Request for comment: rasterio wheels on PyPI and GDAL 3.1


Sean Gillies
 

Hi all,

The rasterio 1.1.5 wheels on PyPI include GDAL 2.4.4 and a patched PROJ 4.9.3. Neither of these versions are supported anymore. Recently I have heard from users and contributors would like to see GDAL 3.1.x in the PyPI wheels soon, to benefit from new features, and also I have heard from those who would rather stick with 2.4.4 for a while yet due to increased latency noticed with the new dependency on proj.db in PROJ versions 6 and greater. I'm sympathetic to both arguments. Rasterio depends on some new GDAL bug fixes to really shine and be its best. On the other hand, I really don't know how best to distribute GDAL 3.1 and PROJ 7.1, particularly when it comes to including PROJ data in the wheels (as we have) or relying on the PROJ CDN.

I'd love to hear comments on whether rasterio wheels should switch over to GDAL 3.1 and PROJ 7.1, especially those based on experience with the latest PROJ in production in distributed and "serverless" systems.

Thanks!

--
Sean Gillies


Alan Snow
 

My first thought is that if you do make the switch, I think it would make sense to do it in the 1.2 release.

> I have heard from those who would rather stick with 2.4.4 for a while yet due to increased latency noticed with the new dependency on proj.db in PROJ versions 6 and greater.

There will always be the --no-binary option and they can install older versions of GDAL/PROJ. It is what we do at work.

> I really don't know how best to distribute GDAL 3.1 and PROJ 7.1, particularly when it comes to including PROJ data in the wheels (as we have) or relying on the PROJ CDN.

Feel free to use whatever you need in the transition from here: https://github.com/pyproj4/pyproj-wheels.

pyproj 3 has decided to not include any datum/transformation grids in the wheels and has a page for users to reference for how to download grids themselves: https://pyproj4.github.io/pyproj/latest/transformation_grids.html. This allows for a smaller wheel to download and allows users to only download the grids they need (useful for serverless applications).

I would also recommend waiting until PROJ 7.2 for wheels due to the CA Bundle path support needed for wheels (https://github.com/pyproj4/pyproj/issues/689).

On another note, I was unable to get manylinux1 wheels to build with the curl setup required, so manylinux2010 will be the new minimum. There is probably a winning combination, but I was unable to find it.


Howard Butler
 



On Sep 13, 2020, at 4:21 PM, Sean Gillies <sean.gillies@...> wrote:

Hi all,

The rasterio 1.1.5 wheels on PyPI include GDAL 2.4.4 and a patched PROJ 4.9.3. Neither of these versions are supported anymore. Recently I have heard from users and contributors would like to see GDAL 3.1.x in the PyPI wheels soon, to benefit from new features, and also I have heard from those who would rather stick with 2.4.4 for a while yet due to increased latency noticed with the new dependency on proj.db in PROJ versions 6 and greater. I'm sympathetic to both arguments. Rasterio depends on some new GDAL bug fixes to really shine and be its best. On the other hand, I really don't know how best to distribute GDAL 3.1 and PROJ 7.1, particularly when it comes to including PROJ data in the wheels (as we have) or relying on the PROJ CDN.

I'd love to hear comments on whether rasterio wheels should switch over to GDAL 3.1 and PROJ 7.1, especially those based on experience with the latest PROJ in production in distributed and "serverless" systems.

As an instigator for the CDN and frequent grid shifter due to our use in the geodetic and lidar domain, I've been very happy with PROJ's CDN approach in Lambda scenarios. You can find my Dockerfile at https://github.com/PDAL/lambda and my latest public layer with GDAL 3.1.3 and PROJ 7.1.1 is at arn:aws:lambda:us-east-1:163178234892:layer:pdal:35

As a library packager, I agree the PROJ data is a burden. The wheel approach of Python even makes that burden more pronounced. I like Alan's pyproj approach of pointing the user at how to fetch the data if they need it, or flip on the network bit if they're ok with that. It seems like a good compromise and a packaging approach that is not that much different from the pre-6.0 days when most people ignored the grids entirely.

The latency complaint? Are you doing 100s of lookups per second (MapServer)? There may be more performance to get there, but someone is going to have to spend *a lot* of time to get it. I think the proj.db situation (one actual database across all the projects) is gobs better than before (at least three csv "databases" with different-but-same information in them and indeterminate  code paths for calculation of a definition). 

Howard