Rasterio and GDAL_CACHEMAX


Angus Dickey
 

Does GDAL_CACHEMAX have to be set in bytes when using rasterio? I see in the docs there is an example using MBs but it seems to be causing rasterio to set a very small cache size when I use it. For example, accessing a COG in S3 using rasterio 1.1.8:

# No problem here
with rasterio.Env() as env:
    # Prints 851132006 (5% of my system RAM in bytes)
    print(get_gdal_config('GDAL_CACHEMAX'))
    with rasterio.open('s3://path/to/cog') as src:
        # Do stuff with the COG

# No problem here either
with rasterio.Env(GDAL_CACHEMAX=536870912) as env:
    # Prints  536870912
    print(get_gdal_config('GDAL_CACHEMAX'))
    with rasterio.open('s3://path/to/cog') as src:
        # Do stuff with the COG

# Really slow
with rasterio.Env(GDAL_CACHEMAX=512) as env:
    # Prints 512 (in bytes?)
    print(get_gdal_config('GDAL_CACHEMAX'))
    with rasterio.open('s3://path/to/cog') as src:
        # Do stuff with the COG


It seems like rasterio is setting the GDAL raster block cache to 512 bytes and this is causing the slow read. I don't really understand the internals of rasterio but it looks to be using GDALSetCacheMax() (which only accepts bytes) and is passing it 512. I might be misunderstanding the problem though, it could be something else slowing things down but that is the only change I am making.

Any input is appreciated.

Thanks,

Angus


Sean Gillies
 

Hi Angus,

On Mon, Oct 26, 2020 at 6:04 PM Angus Dickey <angus@...> wrote:
Does GDAL_CACHEMAX have to be set in bytes when using rasterio? I see in the docs there is an example using MBs but it seems to be causing rasterio to set a very small cache size when I use it. For example, accessing a COG in S3 using rasterio 1.1.8:

# No problem here
with rasterio.Env() as env:
    # Prints 851132006 (5% of my system RAM in bytes)
    print(get_gdal_config('GDAL_CACHEMAX'))
    with rasterio.open('s3://path/to/cog') as src:
        # Do stuff with the COG

# No problem here either
with rasterio.Env(GDAL_CACHEMAX=536870912) as env:
    # Prints  536870912
    print(get_gdal_config('GDAL_CACHEMAX'))
    with rasterio.open('s3://path/to/cog') as src:
        # Do stuff with the COG

# Really slow
with rasterio.Env(GDAL_CACHEMAX=512) as env:
    # Prints 512 (in bytes?)
    print(get_gdal_config('GDAL_CACHEMAX'))
    with rasterio.open('s3://path/to/cog') as src:
        # Do stuff with the COG


It seems like rasterio is setting the GDAL raster block cache to 512 bytes and this is causing the slow read. I don't really understand the internals of rasterio but it looks to be using GDALSetCacheMax() (which only accepts bytes) and is passing it 512. I might be misunderstanding the problem though, it could be something else slowing things down but that is the only change I am making.

Any input is appreciated.

Thanks,

Angus

Yes, GDAL_CACHEMAX passed to `Env()` must be in bytes (since https://github.com/mapbox/rasterio/pull/1042/files).

--
Sean Gillies


Angus Dickey
 

Sean,

Awesome, thanks for the response. There are a couple of places in the docs where the example sets the cache in MBs:

https://rasterio.readthedocs.io/en/latest/api/rasterio.env.html?highlight=GDAL_CACHEMAX
https://rasterio.readthedocs.io/en/latest/topics/switch.html?highlight=GDAL_CACHEMAX

Not a big deal, but might send people down the wrong path.

Thanks again,

Angus


On Mon, Oct 26, 2020 at 8:58 PM Sean Gillies via groups.io <sean=mapbox.com@groups.io> wrote:
Hi Angus,

On Mon, Oct 26, 2020 at 6:04 PM Angus Dickey <angus@...> wrote:
Does GDAL_CACHEMAX have to be set in bytes when using rasterio? I see in the docs there is an example using MBs but it seems to be causing rasterio to set a very small cache size when I use it. For example, accessing a COG in S3 using rasterio 1.1.8:

# No problem here
with rasterio.Env() as env:
    # Prints 851132006 (5% of my system RAM in bytes)
    print(get_gdal_config('GDAL_CACHEMAX'))
    with rasterio.open('s3://path/to/cog') as src:
        # Do stuff with the COG

# No problem here either
with rasterio.Env(GDAL_CACHEMAX=536870912) as env:
    # Prints  536870912
    print(get_gdal_config('GDAL_CACHEMAX'))
    with rasterio.open('s3://path/to/cog') as src:
        # Do stuff with the COG

# Really slow
with rasterio.Env(GDAL_CACHEMAX=512) as env:
    # Prints 512 (in bytes?)
    print(get_gdal_config('GDAL_CACHEMAX'))
    with rasterio.open('s3://path/to/cog') as src:
        # Do stuff with the COG


It seems like rasterio is setting the GDAL raster block cache to 512 bytes and this is causing the slow read. I don't really understand the internals of rasterio but it looks to be using GDALSetCacheMax() (which only accepts bytes) and is passing it 512. I might be misunderstanding the problem though, it could be something else slowing things down but that is the only change I am making.

Any input is appreciated.

Thanks,

Angus

Yes, GDAL_CACHEMAX passed to `Env()` must be in bytes (since https://github.com/mapbox/rasterio/pull/1042/files).

--
Sean Gillies