Re: Asyncio + Rasterio for slow network requests?


Sean Gillies
 

Hi Kyle, Dion:

On Mon, Mar 30, 2020 at 1:41 PM <kylebarron2@...> wrote:
Sorry for the slow response. As Vincent noted, just moving back to GDAL 2.4 made the process ~8x faster, from 1.7s to read to ~200ms to read each source tile.

> A constant time regardless of the amount of overlap suggests to me that your source files may lack the proper tiling.

According to the AWS NAIP docs, the COG sources were created with

gdal_translate -b 1 -b 2 -b 3 -of GTiff -co tiled=yes -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=DEFLATE -co PREDICTOR=2 src_dataset dst_dataset

gdaladdo -r average -ro src_dataset 2 4 8 16 32 64

gdal_translate -b 1 -b 2 -b 3 -of GTiff -co TILED=YES -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=JPEG -co JPEG_QUALITY=85 -co PHOTOMETRIC=YCBCR -co COPY_SRC_OVERVIEWS=YES –config GDAL_TIFF_OVR_BLOCKSIZE 512 src_dataset dst_dataset

Thank you for the details.


asyncio's run_in_executor does the exact same thing as using a thread pool

That makes sense, and I ultimately expected to not be able to make progress since it's GDAL making the low level requests.

> Usually, reading a tile from S3 takes something like 10-100ms if you do it right.

Moving back to GDAL 2.4 got around these speeds.

At the moment, GDAL reads are not thread-safe!

That's really great to keep in mind! Means I'll probably shy away from attempting concurrency with GDAL in general.

Dion, can you say a little more about reads not being thread-safe?

It's intended that we can call GDAL's RasterIO functions in different threads concurrently as long as we don't share dataset handles between threads. If we observe otherwise, then there is a GDAL bug that we can fix.

There is an additional consideration for VRTs explained in https://gdal.org/drivers/raster/vrt.html#multi-threading-issues. If we have multiple VRTs, used in different threads, pointing to the same URLs, we need to take an extra step to prevent GDAL from accidentally sharing those non-VRT dataset handles between the threads.

--
Sean Gillies

Join main@rasterio.groups.io to automatically receive all group messages.