Date   

Speed up reading rasters

Carlos García Rodríguez
 

Hello, I need to read many huge datasets and the speed time is very important to avoid a bottleneck.
I have to read a tiff file that has 20 bands, and a window of 224,224.
Now I am doing like this, and it takes approx 0.8seconds.

with rasterio.open('./sentinel.tif') as src:
          sentinel1_1 = src.read(window=window)

What I realized is that if I try to read only one of the bands the required time is approx the same, but when reading a tiff of only one band the amount of time is 10 times shorter.

Can I do something to speed it up? Maybe read bands in parallel, I don't really know.

I appreciate your help.

Thank you.


Re: Why does shape not return the number of channels

Sean Gillies
 

Correct, the dataset's shape is not the same as the shape of dataset.read().

On Thu, Apr 16, 2020 at 1:38 PM himat15 via groups.io <himat15=yahoo.com@groups.io> wrote:

I was pretty surprised when I realized that my bug was due to ds.shape not showing the number of channels in its shape
i.e.

    with rio.open(f, 'r') as ds:
        
        ds_arr = ds.read()
        print(ds.profile) # {'driver': 'GTiff', 'dtype': 'float32', 'nodata': 0.0, 'width': 890, 'height': 3080, 'count': 3, 'crs': CRS.from_epsg(32629), 'transform': Affine(15.0, 0.0, 232060.0,
       0.0, -15.0, 1438430.0), 'tiled': False, 'interleave': 'pixel'}
        print(ds.shape) # (3080, 890)
        print(ds_arr.shape) # (3, 3080, 890)

I assume this is the expected behavior, but was just taken aback by this.



--
Sean Gillies


Re: Issue with bounds2raster (contextily) using rasterio

geoterraimage1@...
 

You can try the following:

from rasterio.crs import CRS

raster = rio.open(
        path,
        "w",
        driver="GTiff",
        height=h,
        width=w,
        count=b,
        dtype=str(Z.dtype.name),
        crs=CRS.from_user_input('EPSG:3857'),
        transform=transform,
    )


Re: CRS & EPSG issues

geoterraimage1@...
 

So my issue is resolved. The issue was caused by having a GDAL installation on my windows machine, as well as all the GDAL paths set up. This was conflicting with the GDAL setup in each conda environment.

I uninstalled all GDAL core and bindings from Windows itself, and then only installed GDAL in the Anaconda environment. No system variables were set after the GDAL was installed in the environment.


Issue with bounds2raster (contextily) using rasterio

Tony
 

Hi,

I've been using bounds2raster in contextily to save some raster tiles from OSM, but my code has started throwing up this error.  
>8--------------------------------------------------------
File "rasterio\_crs.pyx", line 327, in rasterio._crs._CRS.from_user_input
CRSError: The WKT could not be parsed. OGR Error code 6
>8--------------------------------------------------------

It worked before, so I am not sure what has changed.  I've looked around online for similar problems and haven't found anything.

I am using an Anaconda installation on Windows and have rasterio 1.0.21 and contextily 1.0rc2.

Contexily calls rasterio with the following in the tile.py file:
>8--------------------------------------------------------
    raster = rio.open(
        path,
        "w",
        driver="GTiff",
        height=h,
        width=w,
        count=b,
        dtype=str(Z.dtype.name),
        crs="epsg:3857",
        transform=transform,
    )
>8--------------------------------------------------------


The full stack trace is:
>8--------------------------------------------------------
    img, ext = ctx.bounds2raster(w, s, e, n, 'test.tif', ll=True)
 
  File "C:\ProgramData\Anaconda3\lib\site-packages\contextily\tile.py", line 117, in bounds2raster
    transform=transform,
 
  File "C:\ProgramData\Anaconda3\lib\site-packages\rasterio\env.py", line 423, in wrapper
    return f(*args, **kwds)
 
  File "C:\ProgramData\Anaconda3\lib\site-packages\rasterio\__init__.py", line 225, in open
    **kwargs)
 
  File "rasterio\_io.pyx", line 1182, in rasterio._io.DatasetWriterBase.__init__
 
  File "rasterio\_io.pyx", line 1222, in rasterio._io.DatasetWriterBase._set_crs
 
  File "C:\ProgramData\Anaconda3\lib\site-packages\rasterio\crs.py", line 434, in from_user_input
    obj._crs = _CRS.from_user_input(value, morph_from_esri_dialect=morph_from_esri_dialect)
 
  File "rasterio\_crs.pyx", line 327, in rasterio._crs._CRS.from_user_input
 
CRSError: The WKT could not be parsed. OGR Error code 6
>8--------------------------------------------------------

Do you have any ideas for me to try?

Thanks in advance,
Tony


Re: CRS & EPSG issues

geoterraimage1@...
 

On my windows PC, I am running Anaconda to handle my environments.

My guess is that the issue resides in the GDAL installation on windows, that was performed prior to installing anaconda.
PROJ_LIB was set in the system variable and pointed to the proj.db file in the GDAL folder.
If I create a new environment in Conda, and then install rasterio / gdal / geopandas, it is likely to pick up the GDAL installation currently on windows, and assume that PROJ_LIB is correct (and proj.db)? Is this likely to be leading to the error?

In terms of Conda environments and GDAL/Rasterio, should we treat each environment with its own GDAL installation and own routes to GDAL_DATA path's?

I appreciate the assistance, but how can I ensure correct environment mappings of PROJ to the proj.db that up to date with the correct db schema in the proj.db file? Is there a good rule of thumb to not stuff this up in future?


Re: CRS & EPSG issues

Even Rouault
 

> ERROR 1: PROJ: proj_create_from_database: cannot build geodeticCRS 4326:

> SQLite error on SELECT name, ellipsoid_auth_name, ellipsoid_code,

> prime_meridian_auth_name, prime_meridian_code, area_of_use_auth_name,

> area_of_use_code, publication_date, deprecated FROM geodetic_datum WHERE

> auth_name = ? AND code = ?: no such column: publication_date Traceback

 

Smells like you have PROJ 6.x.y version pointing to the proj.db of another PROJ 6.z.t version, since there were a few database schema changes. Make sure your PROJ_LIB is correctly set.

 

--

Spatialys - Geospatial professional services

http://www.spatialys.com


CRS & EPSG issues

geoterraimage1@...
 

(flask-cog-terracotta) PS E:\00_UBUNTU_DRIVE\WATER_MONITORING_APPLICATION> python
Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 22:22:21) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import rasterio
>>> rasterio.__version__
'1.1.3'
>>> from rasterio.crs import CRS
>>> crs = CRS.from_user_input("EPSG:4326")
ERROR 1: PROJ: proj_create_from_database: cannot build geodeticCRS 4326: SQLite error on SELECT name, ellipsoid_auth_name, ellipsoid_code, prime_meridian_auth_name, prime_meridian_code, area_of_use_auth_name, area_of_use_code, publication_date, deprecated FROM geodetic_datum WHERE auth_name = ? AND code = ?: no such column: publication_date
Traceback (most recent call last):
  File "rasterio\_crs.pyx", line 363, in rasterio._crs._CRS.from_user_input
  File "rasterio\_err.pyx", line 194, in rasterio._err.exc_wrap_ogrerr
rasterio._err.CPLE_BaseError: OGR Error code 6
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\jens.hiestermann\.conda\envs\flask-cog-terracotta\lib\site-packages\rasterio\crs.py", line 459, in from_user_input
    obj._crs = _CRS.from_user_input(value, morph_from_esri_dialect=morph_from_esri_dialect)
  File "rasterio\_crs.pyx", line 367, in rasterio._crs._CRS.from_user_input
rasterio.errors.CRSError: The WKT could not be parsed. OGR Error code 6
>>> crs = CRS.from_user_input("+init=epsg:4326")
Warning 1: +init=epsg:XXXX syntax is deprecated. It might return a CRS with a non-EPSG compliant axis order.
>>> 


Please assist. I am having trouble trying to locate the cause of the issue... is GDAL, PROJ, Rasterio, or the combination of them in an Anaconda environment. 

I am consistently plagued with this error msg:
 
ERROR 1: PROJ: proj_create_from_database: cannot build geodeticCRS 4326: SQLite error on SELECT name, ellipsoid_auth_name, ellipsoid_code, prime_meridian_auth_name, prime_meridian_code, area_of_use_auth_name, area_of_use_code, publication_date, deprecated FROM geodetic_datum WHERE auth_name = ? AND code = ?: no such column: publication_date
 
I have set a system variable to find the projlib proj.db in the GDAL folder but still running into the same error.
 


rio.plot.show with colorbar?

himat15@...
 

How can I add a colorbar after using rio.plot.show?
I've tried a bunch of things but have gotten various errors

Here's one way I tried:

           fig, ax = plt.subplots(figsize = (16, 16))
 
            retted = rio.plot.show(ds, ax=ax, cmap='Greys_r')  
 
            fig.colorbar(retted, ax=ax)
            plt.title("Original")
            plt.show()

This has error: AttributeError: 'AxesSubplot' object has no attribute 'get_array'


Why does shape not return the number of channels

himat15@...
 


I was pretty surprised when I realized that my bug was due to ds.shape not showing the number of channels in its shape
i.e.

    with rio.open(f, 'r') as ds:
        
        ds_arr = ds.read()
        print(ds.profile) # {'driver': 'GTiff', 'dtype': 'float32', 'nodata': 0.0, 'width': 890, 'height': 3080, 'count': 3, 'crs': CRS.from_epsg(32629), 'transform': Affine(15.0, 0.0, 232060.0,
       0.0, -15.0, 1438430.0), 'tiled': False, 'interleave': 'pixel'}
        print(ds.shape) # (3080, 890)
        print(ds_arr.shape) # (3, 3080, 890)

I assume this is the expected behavior, but was just taken aback by this.


Re: Inverted axis in numpy

Armstrong Manuvakola Ezequias Ngolo
 

Hi Gabriel, 

I am not sure if I understood what you meant!!
Were you expecting to have the width in the first axis? 
I think this behaviour (height first) is normal in rasterio and in many libraries used to manipulate images!!

But if you really wanna change the axis you can use something like numpy.transpose (recommended) or numpy.rollaxis

Good luck!!!



De: main@rasterio.groups.io <main@rasterio.groups.io> em nome de gabriel@... <gabriel@...>
Enviado: 11 de abril de 2020 22:11
Para: main@rasterio.groups.io <main@rasterio.groups.io>
Assunto: [rasterio] Inverted axis in numpy
 

Hi everyone. I started using rasterio a few days ago to read some DSM/DTM and process them using numba.

I was trying to debug some weird behaviour in my numba function and i realized that the "read" function in Rasterio actually flips the axis of my dataset.
The file I am opening is a Geotiff, with EPSG:3003 projection.

This how i reproduced this behaviour:

dataset = rasterio.open("raster.tif")
band = dataset.read(1)
print(dataset.width)    -> 84
print(dataset.heigth)  -> 44
print(band.shape)   -> (44,84)


Is this behaviour normal? I don't understand why my data are being flipped in this way.

Thanks, Gabriel


Inverted axis in numpy

gabriel@...
 

Hi everyone. I started using rasterio a few days ago to read some DSM/DTM and process them using numba.

I was trying to debug some weird behaviour in my numba function and i realized that the "read" function in Rasterio actually flips the axis of my dataset.
The file I am opening is a Geotiff, with EPSG:3003 projection.

This how i reproduced this behaviour:

dataset = rasterio.open("raster.tif")
band = dataset.read(1)
print(dataset.width)    -> 84
print(dataset.heigth)  -> 44
print(band.shape)   -> (44,84)


Is this behaviour normal? I don't understand why my data are being flipped in this way.

Thanks, Gabriel


Re: Asyncio + Rasterio for slow network requests?

Dion Häfner
 

Hey Sean,

Sorry, I should have been clearer.

As it stands, my statement is false: GDAL is of course designed to be thread-safe, so doing concurrent reads in different threads *should* work. But in our experience, it doesn't, to the point that we have given up on threads entirely.

Relevant issues from last year:

https://github.com/mapbox/rasterio/issues/1686

https://github.com/OSGeo/gdal/issues/1960

https://github.com/OSGeo/gdal/issues/1244

Even though GDAL#1244 was closed as fixed, we still observed the problem, so I suspect there is another race condition somewhere within GDAL.

Anyway, this wasn't meant as a general statement, just a personal word of advice. To me, multiprocessing seems like a saner alternative at the moment, but YMMV.

Best,
Dion

On 30/03/2020 23.38, Sean Gillies via Groups.Io wrote:
Hi Kyle, Dion:
On Mon, Mar 30, 2020 at 1:41 PM <kylebarron2@... <mailto:kylebarron2@...>> wrote:
Sorry for the slow response. As Vincent noted, just moving back to
GDAL 2.4 made the process ~8x faster, from 1.7s to read to ~200ms to
read each source tile.

> A constant time regardless of the amount of overlap suggests to
me that your source files may lack the proper tiling.
According to the AWS NAIP docs
<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.opendata.aws%2Fnaip%2Freadme.html&data=02%7C01%7Cdion.haefner%40nbi.ku.dk%7Cb9cc3dbc79f94c7384e608d7d4f2bc87%7Ca3927f91cda14696af898c9f1ceffa91%7C0%7C0%7C637212011326047186&sdata=2TmmefR1N6U0yCsJnUyRxI7SHokonzs9%2FfqYQdCTzG8%3D&reserved=0>,
the COG sources were created with
gdal_translate -b 1 -b 2 -b 3 -of GTiff -co tiled=yes -co
BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=DEFLATE -co
PREDICTOR=2 src_dataset dst_dataset
gdaladdo -r average -ro src_dataset 2 4 8 16 32 64
gdal_translate -b 1 -b 2 -b 3 -of GTiff -co TILED=YES -co
BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=JPEG -co
JPEG_QUALITY=85 -co PHOTOMETRIC=YCBCR -co COPY_SRC_OVERVIEWS=YES
–config GDAL_TIFF_OVR_BLOCKSIZE 512 src_dataset dst_dataset
Thank you for the details.

> asyncio's run_in_executor does the exact same thing as using a
thread pool
That makes sense, and I ultimately expected to not be able to make
progress since it's GDAL making the low level requests.

> Usually, reading a tile from S3 takes something like 10-100ms if you do it right.
Moving back to GDAL 2.4 got around these speeds.

> At the moment, GDAL reads are not thread-safe!
That's really great to keep in mind! Means I'll probably shy away
from attempting concurrency with GDAL in general.
Dion, can you say a little more about reads not being thread-safe?
It's intended that we can call GDAL's RasterIO functions in different threads concurrently as long as we don't share dataset handles between threads. If we observe otherwise, then there is a GDAL bug that we can fix.
There is an additional consideration for VRTs explained in https://gdal.org/drivers/raster/vrt.html#multi-threading-issues <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgdal.org%2Fdrivers%2Fraster%2Fvrt.html%23multi-threading-issues&data=02%7C01%7Cdion.haefner%40nbi.ku.dk%7Cb9cc3dbc79f94c7384e608d7d4f2bc87%7Ca3927f91cda14696af898c9f1ceffa91%7C0%7C0%7C637212011326057177&sdata=SY3j36QvO4iwrO9bIRngc4tQ2FgcGISDJ7CszoHayls%3D&reserved=0>. If we have multiple VRTs, used in different threads, pointing to the same URLs, we need to take an extra step to prevent GDAL from accidentally sharing those non-VRT dataset handles between the threads.
--
Sean Gillies
--

Dion Häfner
PhD Student

Niels Bohr Institute
Physics of Ice, Climate and Earth
University of Copenhagen
Tagensvej 16, DK-2200 Copenhagen, DENMARK

_.~"~._.~"~._.~"~._.~"~._.~"~._.~"~._.~"~._


Re: Asyncio + Rasterio for slow network requests?

Sean Gillies
 

Hi Kyle, Dion:

On Mon, Mar 30, 2020 at 1:41 PM <kylebarron2@...> wrote:
Sorry for the slow response. As Vincent noted, just moving back to GDAL 2.4 made the process ~8x faster, from 1.7s to read to ~200ms to read each source tile.

> A constant time regardless of the amount of overlap suggests to me that your source files may lack the proper tiling.

According to the AWS NAIP docs, the COG sources were created with

gdal_translate -b 1 -b 2 -b 3 -of GTiff -co tiled=yes -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=DEFLATE -co PREDICTOR=2 src_dataset dst_dataset

gdaladdo -r average -ro src_dataset 2 4 8 16 32 64

gdal_translate -b 1 -b 2 -b 3 -of GTiff -co TILED=YES -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=JPEG -co JPEG_QUALITY=85 -co PHOTOMETRIC=YCBCR -co COPY_SRC_OVERVIEWS=YES –config GDAL_TIFF_OVR_BLOCKSIZE 512 src_dataset dst_dataset

Thank you for the details.


asyncio's run_in_executor does the exact same thing as using a thread pool

That makes sense, and I ultimately expected to not be able to make progress since it's GDAL making the low level requests.

> Usually, reading a tile from S3 takes something like 10-100ms if you do it right.

Moving back to GDAL 2.4 got around these speeds.

At the moment, GDAL reads are not thread-safe!

That's really great to keep in mind! Means I'll probably shy away from attempting concurrency with GDAL in general.

Dion, can you say a little more about reads not being thread-safe?

It's intended that we can call GDAL's RasterIO functions in different threads concurrently as long as we don't share dataset handles between threads. If we observe otherwise, then there is a GDAL bug that we can fix.

There is an additional consideration for VRTs explained in https://gdal.org/drivers/raster/vrt.html#multi-threading-issues. If we have multiple VRTs, used in different threads, pointing to the same URLs, we need to take an extra step to prevent GDAL from accidentally sharing those non-VRT dataset handles between the threads.

--
Sean Gillies


Re: Asyncio + Rasterio for slow network requests?

kylebarron2@...
 

Sorry for the slow response. As Vincent noted, just moving back to GDAL 2.4 made the process ~8x faster, from 1.7s to read to ~200ms to read each source tile.

> A constant time regardless of the amount of overlap suggests to me that your source files may lack the proper tiling.

According to the AWS NAIP docs, the COG sources were created with

gdal_translate -b 1 -b 2 -b 3 -of GTiff -co tiled=yes -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=DEFLATE -co PREDICTOR=2 src_dataset dst_dataset

gdaladdo -r average -ro src_dataset 2 4 8 16 32 64

gdal_translate -b 1 -b 2 -b 3 -of GTiff -co TILED=YES -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=JPEG -co JPEG_QUALITY=85 -co PHOTOMETRIC=YCBCR -co COPY_SRC_OVERVIEWS=YES –config GDAL_TIFF_OVR_BLOCKSIZE 512 src_dataset dst_dataset

asyncio's run_in_executor does the exact same thing as using a thread pool

That makes sense, and I ultimately expected to not be able to make progress since it's GDAL making the low level requests.

> Usually, reading a tile from S3 takes something like 10-100ms if you do it right.

Moving back to GDAL 2.4 got around these speeds.

At the moment, GDAL reads are not thread-safe!

That's really great to keep in mind! Means I'll probably shy away from attempting concurrency with GDAL in general.


Re: Asyncio + Rasterio for slow network requests?

Sean Gillies
 

Hi Vincent,

On Mon, Mar 30, 2020 at 7:12 AM <vincent.sarago@...> wrote:
Hi All, 
I'll answer for Kyle but he can jump back if needed. 

The problem Kyle was facing was due to GDAL3 (running on AWS Lambda, CentOS) being extremely slow for image reprojection. 
We faced this in https://github.com/RemotePixel/amazonlinux/issues/16 and though it was fixed when we updated sqlite lib (https://github.com/RemotePixel/amazonlinux/pull/17) but while this made things a bit faster, it seems there is still a `huge` difference between gdal2/proj5 and gdal3/proj6.

We still went through some testing with async but because kyle uses AWS Lambda and https://github.com/vincentsarago/lambda-proxy which is not async compatible we just switched to gdal2 and to threading. 

FYI, I've updated another tiling project using async but I need to run benchmarks https://github.com/developmentseed/titiler/blob/master/titiler/api/api_v1/endpoints/tiles.py#L26 

Vincent

Thanks for the update. This situation points out a downside of using the warped VRT: it abstracts everything (network, reprojection, caching) and makes diagnosing problems difficult.

--
Sean Gillies


Re: Asyncio + Rasterio for slow network requests?

vincent.sarago@...
 

Hi All, 
I'll answer for Kyle but he can jump back if needed. 

The problem Kyle was facing was due to GDAL3 (running on AWS Lambda, CentOS) being extremely slow for image reprojection. 
We faced this in https://github.com/RemotePixel/amazonlinux/issues/16 and though it was fixed when we updated sqlite lib (https://github.com/RemotePixel/amazonlinux/pull/17) but while this made things a bit faster, it seems there is still a `huge` difference between gdal2/proj5 and gdal3/proj6. 

We still went through some testing with async but because kyle uses AWS Lambda and https://github.com/vincentsarago/lambda-proxy which is not async compatible we just switched to gdal2 and to threading. 

FYI, I've updated another tiling project using async but I need to run benchmarks https://github.com/developmentseed/titiler/blob/master/titiler/api/api_v1/endpoints/tiles.py#L26 

Vincent 


Re: Asyncio + Rasterio for slow network requests?

Dion Häfner
 

Hey Kyle,

maybe I can help out here.

- asyncio's run_in_executor does the exact same thing as using a thread pool, it's just a different API. Until both GDAL and rasterio explicitly support this, you cannot use "real" asynchronous (non-blocking) IO.

- I can second Sean's comment that multithreading should speed up tile retrieval, and I suspect that something is off with your code and/or your raster. Usually, reading a tile from S3 takes something like 10-100ms if you do it right.

- At the moment, GDAL reads are not thread-safe! This leads to seemingly random failing tile reads (we struggled a lot with this in Terracotta). Now we use a process pool that we spawn at server start, which seems to work OK both performance and reliability-wise (https://github.com/DHI-GRAS/terracotta/blob/master/terracotta/drivers/raster_base.py).

Best,
Dion

On 24/03/2020 22.12, kylebarron2 via Groups.Io wrote:
I'm trying to improve performance of dynamic satellite imagery tiling, using
[`cogeo-mosaic-tiler`](https://github.com/developmentseed/cogeo-mosaic-tiler)/[`rio-tiler`](https://github.com/cogeotiff/rio-tiler),
which combines source Cloud-Optimized GeoTIFFs into a web mercator tile on the
fly. I'm using AWS Landsat and NAIP imagery stored in S3 buckets, and running
code on AWS Lambda in the same region.
Since NAIP imagery doesn't overlap cleanly with web mercator tiles, at zoom 12 I
have to load on average [6 assets to create one mercator
tile](https://user-images.githubusercontent.com/15164633/77286861-cfc7df00-6c99-11ea-84e9-8ed584b030c0.png).
While profiling the AWS Lambda instance using AWS X-Ray, I found that the
biggest bottleneck was the [base
call](https://github.com/cogeotiff/rio-tiler/blob/6b0d4df0b6aa1454c50312e8d352ed57f0a4e3cb/rio_tiler/utils.py#L449-L455)
to `WarpedVRT.read()`. That call always takes [between 1.7 and 2.0
seconds](https://user-images.githubusercontent.com/15164633/77289999-c5f5aa00-6ca0-11ea-816a-5aaf248a782c.png)
for each tile, regardless of the amount of overlap with the mercator tile.
When testing tile load times on an EC2 t2.nano in the same region, for the first
tile load, CPU time is 120 ms but wall time is 1.1 seconds. That leads me to
believe that the bottleneck is S3 latency.
If the code running on Lambda shares the same 90% proportion spent on latency
for each asset, that would imply that 9 seconds total are spent waiting on
latency.
Using multithreading with a `ThreadPoolExecutor` takes longer than running
single-threaded. Given the situation, it would seem ideal to use `asyncio` for
the COG network requests to improve performance.
Has this been attempted ever with Rasterio? I saw a [Rasterio example of using
async](https://github.com/mapbox/rasterio/blob/master/examples/async-rasterio.py)
to improve performance on a CPU bound function, and plan to try that out, but
I'm pessimistic about that approach directly because I'd think that the `async`
calls would need to be applied on the core fetch calls directly.
Reproduction for tile loading:
```py
import os
from rio_tiler.main import tile
os.environ['CURL_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
os.environ['AWS_REQUEST_PAYER'] ="requester"
address = 's3://naip-visualization/ca/2018/60cm/rgb/34118/m_3411861_ne_11_060_20180723_20190208.tif'
x = 701
y = 1635
z = 12
tilesize = 512
%time data, mask = tile(address, x, y, z, tilesize)
```
```
CPU times: user 119 ms, sys: 20.3 ms, total: 140 ms
Wall time: 1.1 s
```
--

Dion Häfner
PhD Student

Niels Bohr Institute
Physics of Ice, Climate and Earth
University of Copenhagen
Tagensvej 16, DK-2200 Copenhagen, DENMARK

_.~"~._.~"~._.~"~._.~"~._.~"~._.~"~._.~"~._


Re: Asyncio + Rasterio for slow network requests?

Sean Gillies
 

Hi,

First of all, I'm not very familiar with rio-tiler. Hopefully, Vincent will help us out.

On Tue, Mar 24, 2020 at 3:36 PM <kylebarron2@...> wrote:
I'm trying to improve performance of dynamic satellite imagery tiling, using
which combines source Cloud-Optimized GeoTIFFs into a web mercator tile on the
fly. I'm using AWS Landsat and NAIP imagery stored in S3 buckets, and running
code on AWS Lambda in the same region.
 
Since NAIP imagery doesn't overlap cleanly with web mercator tiles, at zoom 12 I
have to load on average [6 assets to create one mercator
While profiling the AWS Lambda instance using AWS X-Ray, I found that the
biggest bottleneck was the [base
to `WarpedVRT.read()`. That call always takes [between 1.7 and 2.0
for each tile, regardless of the amount of overlap with the mercator tile.

A constant time regardless of the amount of overlap suggests to me that your source files may lack the proper tiling. If the sources are tiled, the number of bytes transferred (and time) would scale roughly with the amount of overlap.

Can you verify that your sources have overviews? If you're accessing 6 sources to fill a web mercator tile, overviews will help dramatically.
 
 
When testing tile load times on an EC2 t2.nano in the same region, for the first
tile load, CPU time is 120 ms but wall time is 1.1 seconds. That leads me to
believe that the bottleneck is S3 latency.
 
If the code running on Lambda shares the same 90% proportion spent on latency
for each asset, that would imply that 9 seconds total are spent waiting on
latency.
 
Using multithreading with a `ThreadPoolExecutor` takes longer than running
single-threaded. Given the situation, it would seem ideal to use `asyncio` for
the COG network requests to improve performance.

I wonder if Vincent can tell us from his experience if there is a risk of overwhelming GDAL's raster block cache on Lambda when making many current reads? I've seen programs appear to hang when the cache is too small.
 
 
Has this been attempted ever with Rasterio? I saw a [Rasterio example of using
to improve performance on a CPU bound function, and plan to try that out, but
I'm pessimistic about that approach directly because I'd think that the `async`
calls would need to be applied on the core fetch calls directly.

That asyncio example is dated and could be hard to generalize to your problem. I'd love to see a good working example.

You're right that there's only so much we can do in Python about maximizing this conconcurrency. At some level, it's code in GDAL that is making the HTTP requests for parts of the COGs and using a strategy that we can't entirely control from Python.

--
Sean Gillies


Re: Silencing PROJ errors/warnings

gberardinelli@...
 

The warnings are bare when using rasterio (neither of those two prefixes).

Packages were all installed via conda/conda-forge on Windows 10:

gdal version 3.0.4
rasterio version 1.1.3
proj version 6.3.1

Also my original post had a typo, I had tried rasterio.Env(CPL_LOG=os.devnull) (not "CPL_DEBUG").  Still a shot in the dark, perhaps.

Greg