Date   
Re: [rasterio] multi-dimensional support

Norman Barker
 

Hi Sean,

One of the other reasons I would like to build on top of rasterio for multi-dimensional access is the support for geospatial metadata and coordinate reference systems, that support would need to be built again for any new project.

It is is possible to build GDAL without all of the legacy formats to reduce the network size, though I agree it can still be quite large.

I was at the Pangeo summer meeting last week which is where I heard the most about the use of GDAL/Rasterio for multi-dimensional data, I will let them add their opinion. I would like to build on top of rasterio, though I do take note of your views.

Norman 

On Mon, Aug 26, 2019 at 6:33 PM Sean Gillies <sean.gillies@...> wrote:
Hi Norman,

On Fri, Aug 23, 2019 at 11:30 AM Norman Barker <norman.barker@...> wrote:
I was one of the stakeholders for subdataset support in GDAL with netCDF and it worked well with what we were trying to achieve back then, serving regularly gridded time series netcdf data through a WCS, I believe others have used subdataset support in the same way. It was possible to make this work by using external indexes and subdatasets. 

I also agree with your comment that Rasterio is a relatively small project and the code needs to have active users.

The main benefit is a common api for multi-dimensional data access within gdal. Currently using gdalinfo against hdf, netcdf or TileDB requires reading the output to understand the available data, or writing a parser for each of these format driver's metadata. These drivers have no common way to advertise through an API the dimensions and attributes they support. Because implementing subdataset support has been a little adhoc the access patterns are slightly different across drivers, the new api enforces a convention.

Killer features? A couple come to mind; Accessing data cubes with a common api to retrieve data along a z dimension, or sliced by time. These use cases would benefit from being supported in rasterio and using xarray/dask to process multi-dimensional data.

But I can import xarray now to get all of these features from a netCDF file (or other formats supported by xarray). As a Python developer, I don't need these features built into rasterio, because Python is extendable. Furthermore, if I were building applications around very very large multi-dimensional datasets, I'm not sure I would need or want the legacy GDAL formats at all. It would be a waste of time to compile those drivers and a waste of bandwidth to copying them around my network. I'm skeptical I would want, for example, to be able to work with ArcInfo ASCII grids (or 90% of GDAL raster formats) and Parquet (for example) files in the same program using only Rasterio and GDAL. The GDAL programs like gdalinfo aren't extendable and so everything has to be built in, but this is not the case for a Python programmer.
 
I will create a strawman for the API changes and if you and the community are interested then I can start on the code.

I would be even more curious to see what you might come up with for an API if you weren't restricted to maintaining compatibility with Rasterio's existing API.

Apologies in advance if my critique is too harsh. I really do think that the GDAL legacy formats and the new multi-dimensional array abstraction are different things and would be better off in separate projects/packages. I'm also pretty comfortable with the idea of Rasterio remaining on the legacy side of things, avoiding the cutting edge, and still providing a decent amount of value to the Python community while making space for another project to step in and grow in the new space.

--
Sean Gillies

Re: [rasterio] multi-dimensional support

scott
 

Hi Norman and Sean,

Thanks for discussion this! Just wanted to chime in based on Norman's suggestion. I'm currently involved in a NASA-funded project to facilitate analysis of Cloud-based data archives, and as part of the Pangeo project we are really pushing for contributions to established Python packages. We've been using rasterio extensively to load single image files and multiband VRTs. xarray.open_rasterio() has been a great example of Python tools working very well together.

Currently it is a bit awkward to work with multidimensional data with subdatasets in gdal/rasterio. Alternatively, there are format-specific libraries and readers out there (xarray.open_dataset(), h5py, satpy), but I agree there would be a lot of value in a standard access pattern through rasterio, which other libraries could then inherit for I/O tasks. Xarray for example does not currently account for crs, which has been the topic of a lot of discussion (https://github.com/pydata/xarray/issues/2288 ), and writing is currently limited to netCDF and Zarr. I think the current state of things illustrates that the extendibility of Python is both an advantage and disadvantage for users, because people (especially newcomers) are confused by which packages to use and often end up with hodgepodge solutions.

One argument for incorporating the multidimensional API is that there is a tremendous amount out of netCDF and HDF data out there (in fact all NASA data is archived in these formats) and people are interested in translating to more Cloud-friendly formats (see for example https://github.com/pangeo-data/pangeo/issues/686 , https://github.com/pangeo-data/pangeo/issues/120 ). So, one timely use-case of multidimensional support would be transposing existing archives of HDF files for time series analysis and storing as tileDB on S3 or GCS. Or, as Norman mentioned, build multidimensional VRTs of COGs to sample in Time in addition to X and Y. Another common use-case is 1) open a large multidimensional netCDF file, 2) run some dimensionality-reducing analysis with your favorite Python library, 3) save the resulting 2D Geotiff.

My go-to place for any raster format conversion is gdal, and if python code is involved I first turn to rasterio. So if there is motivation to try to bring this new gdal feature into rasterio I'm very interested to see where it leads!

--Scott

Re: [rasterio] multi-dimensional support

Alan Snow
 

> Another common use-case is 1) open a large multidimensional netCDF file, 2) run some dimensionality-reducing analysis with your favorite Python library, 3) save the resulting 2D Geotiff.

That is definitely true. I do that quite a bit. Here is one example: https://corteva.github.io/rioxarray/html/examples/clip_geom.html

I have also seen others have a similar need: https://gis.stackexchange.com/a/329141/144357

Re: [rasterio] multi-dimensional support

Sean Gillies
 

Hi Scott,

Thank you for the feedback. It's really helpful to hear from someone in the Pangeo project.

On Wed, Aug 28, 2019 at 9:37 PM scott <scottyh@...> wrote:
Hi Norman and Sean,

Thanks for discussion this! Just wanted to chime in based on Norman's suggestion. I'm currently involved in a NASA-funded project to facilitate analysis of Cloud-based data archives, and as part of the Pangeo project we are really pushing for contributions to established Python packages. We've been using rasterio extensively to load single image files and multiband VRTs. xarray.open_rasterio() has been a great example of Python tools working very well together.

Currently it is a bit awkward to work with multidimensional data with subdatasets in gdal/rasterio. Alternatively, there are format-specific libraries and readers out there (xarray.open_dataset(), h5py, satpy), but I agree there would be a lot of value in a standard access pattern through rasterio, which other libraries could then inherit for I/O tasks. Xarray for example does not currently account for crs, which has been the topic of a lot of discussion (https://github.com/pydata/xarray/issues/2288 ), and writing is currently limited to netCDF and Zarr. I think the current state of things illustrates that the extendibility of Python is both an advantage and disadvantage for users, because people (especially newcomers) are confused by which packages to use and often end up with hodgepodge solutions.

I cannot deny that Python software, GIS software in particular, can be a hodgepodge. Some of the reasons are beyond our control: Google has paved the early cow paths and so people will use the gdal package and then mix it with rasterio, probably forever. In some cases they must, because the rasterio project lacks a feature they need.

I think the value in having rasterio expand to handle the geophysical data I/O for other libraries is not so clear. I also believe that it is not necessarily distributed equally. Certainly the opportunity costs will be unequally distributed: the rasterio project will see an increase in bugs, support requests, need for documentation, new for new CI infrastructure, while the libraries that depend on rasterio can move on to other things. Thanks to help from its users, rasterio is currently sustainable, but only just barely on average. I think this is largely due to good fortune, and I'm very concerned about upsetting the balance.

There's also the issue that I won't get any more time at work to manage or develop a new rasterio API and will have to give up some of my own influence over the library's design. There are pros and cons to this. I do think that rasterio is in some ways a refreshing alternative to libraries that strictly implement industry standards.


One argument for incorporating the multidimensional API is that there is a tremendous amount out of netCDF and HDF data out there (in fact all NASA data is archived in these formats) and people are interested in translating to more Cloud-friendly formats (see for example https://github.com/pangeo-data/pangeo/issues/686 , https://github.com/pangeo-data/pangeo/issues/120 ). So, one timely use-case of multidimensional support would be transposing existing archives of HDF files for time series analysis and storing as tileDB on S3 or GCS. Or, as Norman mentioned, build multidimensional VRTs of COGs to sample in Time in addition to X and Y. Another common use-case is 1) open a large multidimensional netCDF file, 2) run some dimensionality-reducing analysis with your favorite Python library, 3) save the resulting 2D Geotiff.

I'm clearly going to have to learn more about multidimensional VRTs and their specification. I'm not able to evaluate this use case.

I still hold that the common use case you mention can be solved with existing packages, including Alan Snow's rioxarray. One can also GDAL's Python bindings.


My go-to place for any raster format conversion is gdal, and if python code is involved I first turn to rasterio. So if there is motivation to try to bring this new gdal feature into rasterio I'm very interested to see where it leads!

Thanks again for speaking up!

I don't think a multidimensional data abstraction library is a bad idea, I'm only hesitant to risk rasterio's sustainability and, it must be said, my own creative outlet and job satisfaction.

--
Sean Gillies

Re: Use of GCPS in Rasterio

vincent.sarago@...
 

We added the `from_gcps` function to create a transformation matrix from gcps (see: https://github.com/mapbox/rasterio/pull/1773
This is a good first step but now when we open a file we still need to do something like 

```
with rasterio.open(src_path) as src_dst: 
    with WarpedVRT(src_dst, src_crs=src_dst.gcps[1], src_transform=from_gcps(src_dst.gcps[0])) as vrt_dst: 
        ...
```

Question: I wonder if it will be possible to add an option (or env variable) for the `open` method to force the use of crs and transform created from gcps. 

Re: Use of GCPS in Rasterio

Sean Gillies
 

Hi Vincent,

I think we can do this without a flag or environment variable if we agree on a convention. How about: when creating the GDAL transformer in the WarpedVRT class, we use the source dataset's transform *or* GCPs transform, unless the user provides the value?

I'll do this right now if you agree.


On Tue, Sep 10, 2019 at 7:29 AM <vincent.sarago@...> wrote:
We added the `from_gcps` function to create a transformation matrix from gcps (see: https://github.com/mapbox/rasterio/pull/1773
This is a good first step but now when we open a file we still need to do something like 

```
with rasterio.open(src_path) as src_dst: 
    with WarpedVRT(src_dst, src_crs=src_dst.gcps[1], src_transform=from_gcps(src_dst.gcps[0])) as vrt_dst: 
        ...
```

Question: I wonder if it will be possible to add an option (or env variable) for the `open` method to force the use of crs and transform created from gcps.

--
Sean Gillies

Re: Use of GCPS in Rasterio

vincent.sarago@...
 

Sean, my main worry is that we need to use `WarpedVRT` so it adds one more layer, and thus not optimal. But it will be better than nothing :-) 

For files with gcps, not only the transform but also the CRS won't be set by default so basically, the WarpedVRT reader it should do something like: 

```
gcps, gcps_crs = src.gcps
in_transform = from_gcps(gcps) if gcps else src.transform
in_crs = gcps_crs if gcps else src.crs

# in https://github.com/mapbox/rasterio/blob/master/rasterio/_warp.pyx#L738
self.src_transform = src_transform if src_transform else in_transform

# in https://github.com/mapbox/rasterio/blob/master/rasterio/_warp.pyx#L736
self.src_crs = src_crs if src_crs else in_crs
```

Re: Use of GCPS in Rasterio

Sean Gillies
 

Vincent.

After a little more thought, I'm thinking that working around GDAL with respect to geotransform and GCPs is not a good idea. GDALGetGeoTransform() does not use GCPs if they exist and the second to last paragraph in https://gdal.org/user/raster_data_model.html#gcps cautions that the meaning of the GCPs is up to the application. And by application, it means the application of the data defined by the provider, not our software project as an application. It could be the case that GCPs in one file are meant to be used differently than GCPs in another file, and if we made Rasterio overly clever we might get in the way of programmers or cause problems with the wrong default behavior.

On Tue, Sep 10, 2019 at 10:52 AM <vincent.sarago@...> wrote:
Sean, my main worry is that we need to use `WarpedVRT` so it adds one more layer, and thus not optimal. But it will be better than nothing :-) 

For files with gcps, not only the transform but also the CRS won't be set by default so basically, the WarpedVRT reader it should do something like: 

```
gcps, gcps_crs = src.gcps
in_transform = from_gcps(gcps) if gcps else src.transform
in_crs = gcps_crs if gcps else src.crs

# in https://github.com/mapbox/rasterio/blob/master/rasterio/_warp.pyx#L738
self.src_transform = src_transform if src_transform else in_transform

# in https://github.com/mapbox/rasterio/blob/master/rasterio/_warp.pyx#L736
self.src_crs = src_crs if src_crs else in_crs

--
Sean Gillies

Handling non-AWS S3 services

Guillaume Lostis
 

Hi all,

I've been working with a non-AWS S3 file storage lately, so I had to tackle the question of how to use rasterio with it. The service (https://www.openio.io/) is S3-compatible, so it works with AWS libraries such as boto3 or awscli.

I've managed to make it work with rasterio, but in a manner that doesn't really satisfy me. I'm writing this message to ask if you would agree to make some changes to AWSSession in order to better handle non-AWS S3 providers.

Here is some context on the problem: since the endpoint_url of the service is different from GDAL's default (s3.amazonaws.com), I currently need to write code along the lines of:

import rasterio

with rasterio.Env(profile_name="openio", AWS_S3_ENDPOINT="my_endpoint.com"):
    with rasterio.open("s3://bucket/file.tiff") as src:
        print(src.shape)

The code works, but I don't like the fact that I have to use a mix of rasterio-esque arguments to use an AWSSession and some GDAL-esque arguments to patch a missing argument in the AWSSession.

The nice thing about AWSSession is that it uses a boto3.Session, which in turn reads my ~/.aws/config and ~/.aws/credentials files in which I've saved my OpenIO credentials and region name under a profile named openio (this way I can easily switch between AWS and OpenIO buckets).

The not-so-nice thing is that boto3.Session objects do not handle the specification of a custom endpoint_url. This is intentional and is done because a Session is made to talk to different services (EC2, S3, ...), which have different URLs (more info in the first few comments of this issue). A boto3.Session.client, however, accepts a custom endpoint_url. For example, to have boto3 work with OpenIO, I do the following:

import boto3

session = boto3.Session(profile_name="oio")
client = session.client("s3", endpoint_url="https://my_endpoint.com")
# use the client to retrieve files, etc.

From what I've understood by reading the code, AWSSession uses a boto3.Session only to handle the credentials retrieval part, and then stores them in a _creds attribute. After that, the boto3.Session is not used for anything else. Since a boto3.Session cannot handle the retrieval of a custom endpoint_url, would it be acceptable to add an endpoint_url argument to the AWSSession? I have tested this patch and it does what I want, because I can run the following code (which, IMO, is nicer than the first one):

import rasterio

session = rasterio.session.AWSSession(profile_name="openio", endpoint_url="my_endpoint.com")
with rasterio.Env(session=session):
    with rasterio.open("s3://bucket/file.tiff") as src:
        print(src.shape)

What do you think of this? Is it in the scope of rasterio's AWSSession, or not?

Thanks,

Guillaume Lostis

Re: Handling non-AWS S3 services

Sean Gillies
 

Hi Guillaume,

On Wed, Sep 11, 2019 at 8:25 AM Guillaume Lostis <g.lostis@...> wrote:

Hi all,

I've been working with a non-AWS S3 file storage lately, so I had to tackle the question of how to use rasterio with it. The service (https://www.openio.io/) is S3-compatible, so it works with AWS libraries such as boto3 or awscli.

I've managed to make it work with rasterio, but in a manner that doesn't really satisfy me. I'm writing this message to ask if you would agree to make some changes to AWSSession in order to better handle non-AWS S3 providers.

Here is some context on the problem: since the endpoint_url of the service is different from GDAL's default (s3.amazonaws.com), I currently need to write code along the lines of:

import rasterio

with rasterio.Env(profile_name="openio", AWS_S3_ENDPOINT="my_endpoint.com"):
    with rasterio.open("s3://bucket/file.tiff") as src:
        print(src.shape)

The code works, but I don't like the fact that I have to use a mix of rasterio-esque arguments to use an AWSSession and some GDAL-esque arguments to patch a missing argument in the AWSSession.

The nice thing about AWSSession is that it uses a boto3.Session, which in turn reads my ~/.aws/config and ~/.aws/credentials files in which I've saved my OpenIO credentials and region name under a profile named openio (this way I can easily switch between AWS and OpenIO buckets).

The not-so-nice thing is that boto3.Session objects do not handle the specification of a custom endpoint_url. This is intentional and is done because a Session is made to talk to different services (EC2, S3, ...), which have different URLs (more info in the first few comments of this issue). A boto3.Session.client, however, accepts a custom endpoint_url. For example, to have boto3 work with OpenIO, I do the following:

import boto3

session = boto3.Session(profile_name="oio")
client = session.client("s3", endpoint_url="https://my_endpoint.com")
# use the client to retrieve files, etc.

From what I've understood by reading the code, AWSSession uses a boto3.Session only to handle the credentials retrieval part, and then stores them in a _creds attribute. After that, the boto3.Session is not used for anything else. Since a boto3.Session cannot handle the retrieval of a custom endpoint_url, would it be acceptable to add an endpoint_url argument to the AWSSession? I have tested this patch and it does what I want, because I can run the following code (which, IMO, is nicer than the first one):

import rasterio

session = rasterio.session.AWSSession(profile_name="openio", endpoint_url="my_endpoint.com")
with rasterio.Env(session=session):
    with rasterio.open("s3://bucket/file.tiff") as src:
        print(src.shape)

What do you think of this? Is it in the scope of rasterio's AWSSession, or not?

Thanks,

Guillaume Lostis

Yes, I think this is well within rasterio's scope. Let's do it. Can you submit a PR?

--
Sean Gillies

Re: Handling non-AWS S3 services

Guillaume Lostis
 

Thanks!

Yes, I will open a PR tomorrow, once I add a test for this argument.

Guillaume Lostis


On Wed, 11 Sep 2019 at 18:47, Sean Gillies <sean.gillies@...> wrote:
Hi Guillaume,

On Wed, Sep 11, 2019 at 8:25 AM Guillaume Lostis <g.lostis@...> wrote:

Hi all,

I've been working with a non-AWS S3 file storage lately, so I had to tackle the question of how to use rasterio with it. The service (https://www.openio.io/) is S3-compatible, so it works with AWS libraries such as boto3 or awscli.

I've managed to make it work with rasterio, but in a manner that doesn't really satisfy me. I'm writing this message to ask if you would agree to make some changes to AWSSession in order to better handle non-AWS S3 providers.

Here is some context on the problem: since the endpoint_url of the service is different from GDAL's default (s3.amazonaws.com), I currently need to write code along the lines of:

import rasterio

with rasterio.Env(profile_name="openio", AWS_S3_ENDPOINT="my_endpoint.com"):
    with rasterio.open("s3://bucket/file.tiff") as src:
        print(src.shape)

The code works, but I don't like the fact that I have to use a mix of rasterio-esque arguments to use an AWSSession and some GDAL-esque arguments to patch a missing argument in the AWSSession.

The nice thing about AWSSession is that it uses a boto3.Session, which in turn reads my ~/.aws/config and ~/.aws/credentials files in which I've saved my OpenIO credentials and region name under a profile named openio (this way I can easily switch between AWS and OpenIO buckets).

The not-so-nice thing is that boto3.Session objects do not handle the specification of a custom endpoint_url. This is intentional and is done because a Session is made to talk to different services (EC2, S3, ...), which have different URLs (more info in the first few comments of this issue). A boto3.Session.client, however, accepts a custom endpoint_url. For example, to have boto3 work with OpenIO, I do the following:

import boto3

session = boto3.Session(profile_name="oio")
client = session.client("s3", endpoint_url="https://my_endpoint.com")
# use the client to retrieve files, etc.

From what I've understood by reading the code, AWSSession uses a boto3.Session only to handle the credentials retrieval part, and then stores them in a _creds attribute. After that, the boto3.Session is not used for anything else. Since a boto3.Session cannot handle the retrieval of a custom endpoint_url, would it be acceptable to add an endpoint_url argument to the AWSSession? I have tested this patch and it does what I want, because I can run the following code (which, IMO, is nicer than the first one):

import rasterio

session = rasterio.session.AWSSession(profile_name="openio", endpoint_url="my_endpoint.com")
with rasterio.Env(session=session):
    with rasterio.open("s3://bucket/file.tiff") as src:
        print(src.shape)

What do you think of this? Is it in the scope of rasterio's AWSSession, or not?

Thanks,

Guillaume Lostis

Yes, I think this is well within rasterio's scope. Let's do it. Can you submit a PR?

--
Sean Gillies

Re: Handling non-AWS S3 services

Guillaume Lostis
 

PR opened here, for reference: https://github.com/mapbox/rasterio/pull/1779

Guillaume Lostis

Rasterio 1.1 status

Sean Gillies
 

Hi all,

Work towards Rasterio 1.1.0 is moving forward. I'd like to have a 1.1.0 release before October 1. See the change log at https://github.com/mapbox/rasterio/blob/master/CHANGES.txt for a summary of what will be new in 1.1.0.

Here is the list of open issues: https://github.com/mapbox/rasterio/milestone/70. We have a PR for #1740 but not yet for #1504 or #1511. I think we will move those to the next milestone if we don't get a contributed PR next week. Does that sound fair?

--
Sean Gillies

Rasterio maint-1.1 branch

Sean Gillies
 

I have created a new branch for maintenance of version 1.1. We will keep the maint-1.0 branch around, but I don't foresee adding any new commits to it.

--
Sean Gillies

Numpy Python Support Cadence

Alan Snow
 

I recently noticed this issue where xarray is dropping python 3.5 support due to numpy's recommended cadence in python support:

https://github.com/pydata/xarray/issues/3293

numpy support schedule:

https://numpy.org/neps/nep-0029-deprecation_policy.html

I am curious how this will impact rasterio.

rasterio cookbook

James McBride
 

I noticed that the cookbook (https://github.com/mapbox/rasterio-cookbook) has been archived. It doesn't look like those examples are included anywhere in the current docs. Is there any interest in incorporating those back in the rasterio docs? And, incorporating more usage examples of the same style as are in the cookbook?

Re: rasterio cookbook

Sean Gillies
 

Hi James,

The cookbook was getting out of date too quickly. I think it was a nice idea, but we didn't figure out how to get the original owners to continue to own the material and keep it up to date. Until we solve that, I would be against reincorporating the cookbook material.



On Thu, Oct 17, 2019 at 8:20 PM James McBride <jdmcbr@...> wrote:
I noticed that the cookbook (https://github.com/mapbox/rasterio-cookbook) has been archived. It doesn't look like those examples are included anywhere in the current docs. Is there any interest in incorporating those back in the rasterio docs? And, incorporating more usage examples of the same style as are in the cookbook?



--
Sean Gillies

Re: Numpy Python Support Cadence

Sean Gillies
 

Alan,

Thanks for bringing this up. I could get behind this. Rasterio uses semantic versioning like the other projects downstream of numpy and so we can use the drop schedule published in that doc.



On Thu, Oct 17, 2019 at 5:53 PM Alan Snow <alansnow21@...> wrote:
I recently noticed this issue where xarray is dropping python 3.5 support due to numpy's recommended cadence in python support:

https://github.com/pydata/xarray/issues/3293

numpy support schedule:

https://numpy.org/neps/nep-0029-deprecation_policy.html

I am curious how this will impact rasterio.



--
Sean Gillies