Topics

[rasterio] multi-dimensional support

Sean Gillies
 

Hi Norman,

On Fri, Aug 23, 2019 at 11:30 AM Norman Barker <norman.barker@...> wrote:
I was one of the stakeholders for subdataset support in GDAL with netCDF and it worked well with what we were trying to achieve back then, serving regularly gridded time series netcdf data through a WCS, I believe others have used subdataset support in the same way. It was possible to make this work by using external indexes and subdatasets. 

I also agree with your comment that Rasterio is a relatively small project and the code needs to have active users.

The main benefit is a common api for multi-dimensional data access within gdal. Currently using gdalinfo against hdf, netcdf or TileDB requires reading the output to understand the available data, or writing a parser for each of these format driver's metadata. These drivers have no common way to advertise through an API the dimensions and attributes they support. Because implementing subdataset support has been a little adhoc the access patterns are slightly different across drivers, the new api enforces a convention.

Killer features? A couple come to mind; Accessing data cubes with a common api to retrieve data along a z dimension, or sliced by time. These use cases would benefit from being supported in rasterio and using xarray/dask to process multi-dimensional data.

But I can import xarray now to get all of these features from a netCDF file (or other formats supported by xarray). As a Python developer, I don't need these features built into rasterio, because Python is extendable. Furthermore, if I were building applications around very very large multi-dimensional datasets, I'm not sure I would need or want the legacy GDAL formats at all. It would be a waste of time to compile those drivers and a waste of bandwidth to copying them around my network. I'm skeptical I would want, for example, to be able to work with ArcInfo ASCII grids (or 90% of GDAL raster formats) and Parquet (for example) files in the same program using only Rasterio and GDAL. The GDAL programs like gdalinfo aren't extendable and so everything has to be built in, but this is not the case for a Python programmer.
 
I will create a strawman for the API changes and if you and the community are interested then I can start on the code.

I would be even more curious to see what you might come up with for an API if you weren't restricted to maintaining compatibility with Rasterio's existing API.

Apologies in advance if my critique is too harsh. I really do think that the GDAL legacy formats and the new multi-dimensional array abstraction are different things and would be better off in separate projects/packages. I'm also pretty comfortable with the idea of Rasterio remaining on the legacy side of things, avoiding the cutting edge, and still providing a decent amount of value to the Python community while making space for another project to step in and grow in the new space.

--
Sean Gillies

Norman Barker
 

Hi Sean,

One of the other reasons I would like to build on top of rasterio for multi-dimensional access is the support for geospatial metadata and coordinate reference systems, that support would need to be built again for any new project.

It is is possible to build GDAL without all of the legacy formats to reduce the network size, though I agree it can still be quite large.

I was at the Pangeo summer meeting last week which is where I heard the most about the use of GDAL/Rasterio for multi-dimensional data, I will let them add their opinion. I would like to build on top of rasterio, though I do take note of your views.

Norman 

On Mon, Aug 26, 2019 at 6:33 PM Sean Gillies <sean.gillies@...> wrote:
Hi Norman,

On Fri, Aug 23, 2019 at 11:30 AM Norman Barker <norman.barker@...> wrote:
I was one of the stakeholders for subdataset support in GDAL with netCDF and it worked well with what we were trying to achieve back then, serving regularly gridded time series netcdf data through a WCS, I believe others have used subdataset support in the same way. It was possible to make this work by using external indexes and subdatasets. 

I also agree with your comment that Rasterio is a relatively small project and the code needs to have active users.

The main benefit is a common api for multi-dimensional data access within gdal. Currently using gdalinfo against hdf, netcdf or TileDB requires reading the output to understand the available data, or writing a parser for each of these format driver's metadata. These drivers have no common way to advertise through an API the dimensions and attributes they support. Because implementing subdataset support has been a little adhoc the access patterns are slightly different across drivers, the new api enforces a convention.

Killer features? A couple come to mind; Accessing data cubes with a common api to retrieve data along a z dimension, or sliced by time. These use cases would benefit from being supported in rasterio and using xarray/dask to process multi-dimensional data.

But I can import xarray now to get all of these features from a netCDF file (or other formats supported by xarray). As a Python developer, I don't need these features built into rasterio, because Python is extendable. Furthermore, if I were building applications around very very large multi-dimensional datasets, I'm not sure I would need or want the legacy GDAL formats at all. It would be a waste of time to compile those drivers and a waste of bandwidth to copying them around my network. I'm skeptical I would want, for example, to be able to work with ArcInfo ASCII grids (or 90% of GDAL raster formats) and Parquet (for example) files in the same program using only Rasterio and GDAL. The GDAL programs like gdalinfo aren't extendable and so everything has to be built in, but this is not the case for a Python programmer.
 
I will create a strawman for the API changes and if you and the community are interested then I can start on the code.

I would be even more curious to see what you might come up with for an API if you weren't restricted to maintaining compatibility with Rasterio's existing API.

Apologies in advance if my critique is too harsh. I really do think that the GDAL legacy formats and the new multi-dimensional array abstraction are different things and would be better off in separate projects/packages. I'm also pretty comfortable with the idea of Rasterio remaining on the legacy side of things, avoiding the cutting edge, and still providing a decent amount of value to the Python community while making space for another project to step in and grow in the new space.

--
Sean Gillies

scott
 

Hi Norman and Sean,

Thanks for discussion this! Just wanted to chime in based on Norman's suggestion. I'm currently involved in a NASA-funded project to facilitate analysis of Cloud-based data archives, and as part of the Pangeo project we are really pushing for contributions to established Python packages. We've been using rasterio extensively to load single image files and multiband VRTs. xarray.open_rasterio() has been a great example of Python tools working very well together.

Currently it is a bit awkward to work with multidimensional data with subdatasets in gdal/rasterio. Alternatively, there are format-specific libraries and readers out there (xarray.open_dataset(), h5py, satpy), but I agree there would be a lot of value in a standard access pattern through rasterio, which other libraries could then inherit for I/O tasks. Xarray for example does not currently account for crs, which has been the topic of a lot of discussion (https://github.com/pydata/xarray/issues/2288 ), and writing is currently limited to netCDF and Zarr. I think the current state of things illustrates that the extendibility of Python is both an advantage and disadvantage for users, because people (especially newcomers) are confused by which packages to use and often end up with hodgepodge solutions.

One argument for incorporating the multidimensional API is that there is a tremendous amount out of netCDF and HDF data out there (in fact all NASA data is archived in these formats) and people are interested in translating to more Cloud-friendly formats (see for example https://github.com/pangeo-data/pangeo/issues/686 , https://github.com/pangeo-data/pangeo/issues/120 ). So, one timely use-case of multidimensional support would be transposing existing archives of HDF files for time series analysis and storing as tileDB on S3 or GCS. Or, as Norman mentioned, build multidimensional VRTs of COGs to sample in Time in addition to X and Y. Another common use-case is 1) open a large multidimensional netCDF file, 2) run some dimensionality-reducing analysis with your favorite Python library, 3) save the resulting 2D Geotiff.

My go-to place for any raster format conversion is gdal, and if python code is involved I first turn to rasterio. So if there is motivation to try to bring this new gdal feature into rasterio I'm very interested to see where it leads!

--Scott

Alan Snow
 

> Another common use-case is 1) open a large multidimensional netCDF file, 2) run some dimensionality-reducing analysis with your favorite Python library, 3) save the resulting 2D Geotiff.

That is definitely true. I do that quite a bit. Here is one example: https://corteva.github.io/rioxarray/html/examples/clip_geom.html

I have also seen others have a similar need: https://gis.stackexchange.com/a/329141/144357

Sean Gillies
 

Hi Scott,

Thank you for the feedback. It's really helpful to hear from someone in the Pangeo project.

On Wed, Aug 28, 2019 at 9:37 PM scott <scottyh@...> wrote:
Hi Norman and Sean,

Thanks for discussion this! Just wanted to chime in based on Norman's suggestion. I'm currently involved in a NASA-funded project to facilitate analysis of Cloud-based data archives, and as part of the Pangeo project we are really pushing for contributions to established Python packages. We've been using rasterio extensively to load single image files and multiband VRTs. xarray.open_rasterio() has been a great example of Python tools working very well together.

Currently it is a bit awkward to work with multidimensional data with subdatasets in gdal/rasterio. Alternatively, there are format-specific libraries and readers out there (xarray.open_dataset(), h5py, satpy), but I agree there would be a lot of value in a standard access pattern through rasterio, which other libraries could then inherit for I/O tasks. Xarray for example does not currently account for crs, which has been the topic of a lot of discussion (https://github.com/pydata/xarray/issues/2288 ), and writing is currently limited to netCDF and Zarr. I think the current state of things illustrates that the extendibility of Python is both an advantage and disadvantage for users, because people (especially newcomers) are confused by which packages to use and often end up with hodgepodge solutions.

I cannot deny that Python software, GIS software in particular, can be a hodgepodge. Some of the reasons are beyond our control: Google has paved the early cow paths and so people will use the gdal package and then mix it with rasterio, probably forever. In some cases they must, because the rasterio project lacks a feature they need.

I think the value in having rasterio expand to handle the geophysical data I/O for other libraries is not so clear. I also believe that it is not necessarily distributed equally. Certainly the opportunity costs will be unequally distributed: the rasterio project will see an increase in bugs, support requests, need for documentation, new for new CI infrastructure, while the libraries that depend on rasterio can move on to other things. Thanks to help from its users, rasterio is currently sustainable, but only just barely on average. I think this is largely due to good fortune, and I'm very concerned about upsetting the balance.

There's also the issue that I won't get any more time at work to manage or develop a new rasterio API and will have to give up some of my own influence over the library's design. There are pros and cons to this. I do think that rasterio is in some ways a refreshing alternative to libraries that strictly implement industry standards.


One argument for incorporating the multidimensional API is that there is a tremendous amount out of netCDF and HDF data out there (in fact all NASA data is archived in these formats) and people are interested in translating to more Cloud-friendly formats (see for example https://github.com/pangeo-data/pangeo/issues/686 , https://github.com/pangeo-data/pangeo/issues/120 ). So, one timely use-case of multidimensional support would be transposing existing archives of HDF files for time series analysis and storing as tileDB on S3 or GCS. Or, as Norman mentioned, build multidimensional VRTs of COGs to sample in Time in addition to X and Y. Another common use-case is 1) open a large multidimensional netCDF file, 2) run some dimensionality-reducing analysis with your favorite Python library, 3) save the resulting 2D Geotiff.

I'm clearly going to have to learn more about multidimensional VRTs and their specification. I'm not able to evaluate this use case.

I still hold that the common use case you mention can be solved with existing packages, including Alan Snow's rioxarray. One can also GDAL's Python bindings.


My go-to place for any raster format conversion is gdal, and if python code is involved I first turn to rasterio. So if there is motivation to try to bring this new gdal feature into rasterio I'm very interested to see where it leads!

Thanks again for speaking up!

I don't think a multidimensional data abstraction library is a bad idea, I'm only hesitant to risk rasterio's sustainability and, it must be said, my own creative outlet and job satisfaction.

--
Sean Gillies