Topics

multi-dimensional support


Norman Barker
 

Hi,

I started discussing support for multiple dimensions in https://github.com/mapbox/rasterio/issues/1759 but am moving this to a wider audience.

GDAL has added an implementation of https://gdal.org/development/rfc/rfc75_multidimensional_arrays.html and I would like to add this to rasterio by extending the existing rasterio API.

Is this of interest?

Norman


Alan Snow
 

I think this would be quite useful.

i made a start at it using subdatasets in rioxarray here: https://github.com/corteva/rioxarray/pull/33
But, it is missing 1 dimensional variables such as time at the moment (and probably other items as well).

What you mentioned would be a much nicer solution.


Sean Gillies
 

Hi Norman,

I would need to see a strawman proposal of how rasterio's dataset open/read/write API would be extended before I could support the work.

I'm also a bit concerned about the small number of stakeholders for the new GDAL API. It appears to be only the HDF Group (yes?) with only three GDAL TC members voting to adopt it. The rest of the GDAL community seemed ambivalent. Rasterio is a pretty small project and, in my opinion, can't afford to develop code that isn't going to be widely used.

I find the Python GDAL example of using the new API to be underwhelming: https://github.com/rouault/gdal/blob/rfc75/gdal/doc/source/tutorials/multidimensional_api_tut.rst#in-python. I think that functionality already exists with subdatasets, no? If there are some killer new features, I'd like to see them.


On Thu, Aug 22, 2019 at 4:11 PM Norman Barker <norman.barker@...> wrote:
Hi,

I started discussing support for multiple dimensions in https://github.com/mapbox/rasterio/issues/1759 but am moving this to a wider audience.

GDAL has added an implementation of https://gdal.org/development/rfc/rfc75_multidimensional_arrays.html and I would like to add this to rasterio by extending the existing rasterio API.

Is this of interest?

Norman



--
Sean Gillies


Howard Butler
 



On Aug 23, 2019, at 9:29 AM, Sean Gillies <sean.gillies@...> wrote:


I'm also a bit concerned about the small number of stakeholders for the new GDAL API. It appears to be only the HDF Group (yes?) with only three GDAL TC members voting to adopt it. The rest of the GDAL community seemed ambivalent.

Most folks are ambivalent about multi-dimensional support in GDAL, and they were ambivalent about subdatasets before that (which were a deficient implementation in a number of ways which precipitated the RFC). The RFC moved things forward in a positive direction, and it wasn't just about giving HDFLand a clean mapping to GDAL. It was about giving GDALLand the ability to more easily speak to an additional family of raster-like data. 

GDAL drivers that speak zarr, TileDB, Arrow, and HDF can now be adapted without the miserable compromises that subdatasets required in usability and data fidelity. That will allow people to bring the GDAL geo goodness to their data without reformatting simply to push it through the tool. I think these generic data structures are seeing much more action because they allow data-level interop without special purpose drivers across multiple software runtimes. The winds are blowing the same direction in point cloud land too.

Rasterio is a pretty small project and, in my opinion, can't afford to develop code that isn't going to be widely used.

A completely reasonable position.

Howard



Norman Barker
 

I was one of the stakeholders for subdataset support in GDAL with netCDF and it worked well with what we were trying to achieve back then, serving regularly gridded time series netcdf data through a WCS, I believe others have used subdataset support in the same way. It was possible to make this work by using external indexes and subdatasets. 

I also agree with your comment that Rasterio is a relatively small project and the code needs to have active users.

The main benefit is a common api for multi-dimensional data access within gdal. Currently using gdalinfo against hdf, netcdf or TileDB requires reading the output to understand the available data, or writing a parser for each of these format driver's metadata. These drivers have no common way to advertise through an API the dimensions and attributes they support. Because implementing subdataset support has been a little adhoc the access patterns are slightly different across drivers, the new api enforces a convention.

Killer features? A couple come to mind; Accessing data cubes with a common api to retrieve data along a z dimension, or sliced by time. These use cases would benefit from being supported in rasterio and using xarray/dask to process multi-dimensional data.

I will create a strawman for the API changes and if you and the community are interested then I can start on the code. 

Norman



On Fri, Aug 23, 2019 at 7:51 AM Howard Butler <howard@...> wrote:


On Aug 23, 2019, at 9:29 AM, Sean Gillies <sean.gillies@...> wrote:


I'm also a bit concerned about the small number of stakeholders for the new GDAL API. It appears to be only the HDF Group (yes?) with only three GDAL TC members voting to adopt it. The rest of the GDAL community seemed ambivalent.

Most folks are ambivalent about multi-dimensional support in GDAL, and they were ambivalent about subdatasets before that (which were a deficient implementation in a number of ways which precipitated the RFC). The RFC moved things forward in a positive direction, and it wasn't just about giving HDFLand a clean mapping to GDAL. It was about giving GDALLand the ability to more easily speak to an additional family of raster-like data. 

GDAL drivers that speak zarr, TileDB, Arrow, and HDF can now be adapted without the miserable compromises that subdatasets required in usability and data fidelity. That will allow people to bring the GDAL geo goodness to their data without reformatting simply to push it through the tool. I think these generic data structures are seeing much more action because they allow data-level interop without special purpose drivers across multiple software runtimes. The winds are blowing the same direction in point cloud land too.

Rasterio is a pretty small project and, in my opinion, can't afford to develop code that isn't going to be widely used.

A completely reasonable position.

Howard



Sean Gillies
 

Hi Norman, Howard,

I'm going to move this discussion over to https://rasterio.groups.io/g/dev/messages and continue there.


On Fri, Aug 23, 2019 at 11:30 AM Norman Barker <norman.barker@...> wrote:
I was one of the stakeholders for subdataset support in GDAL with netCDF and it worked well with what we were trying to achieve back then, serving regularly gridded time series netcdf data through a WCS, I believe others have used subdataset support in the same way. It was possible to make this work by using external indexes and subdatasets. 

I also agree with your comment that Rasterio is a relatively small project and the code needs to have active users.

The main benefit is a common api for multi-dimensional data access within gdal. Currently using gdalinfo against hdf, netcdf or TileDB requires reading the output to understand the available data, or writing a parser for each of these format driver's metadata. These drivers have no common way to advertise through an API the dimensions and attributes they support. Because implementing subdataset support has been a little adhoc the access patterns are slightly different across drivers, the new api enforces a convention.

Killer features? A couple come to mind; Accessing data cubes with a common api to retrieve data along a z dimension, or sliced by time. These use cases would benefit from being supported in rasterio and using xarray/dask to process multi-dimensional data.

I will create a strawman for the API changes and if you and the community are interested then I can start on the code. 

Norman



On Fri, Aug 23, 2019 at 7:51 AM Howard Butler <howard@...> wrote:


On Aug 23, 2019, at 9:29 AM, Sean Gillies <sean.gillies@...> wrote:


I'm also a bit concerned about the small number of stakeholders for the new GDAL API. It appears to be only the HDF Group (yes?) with only three GDAL TC members voting to adopt it. The rest of the GDAL community seemed ambivalent.

Most folks are ambivalent about multi-dimensional support in GDAL, and they were ambivalent about subdatasets before that (which were a deficient implementation in a number of ways which precipitated the RFC). The RFC moved things forward in a positive direction, and it wasn't just about giving HDFLand a clean mapping to GDAL. It was about giving GDALLand the ability to more easily speak to an additional family of raster-like data. 

GDAL drivers that speak zarr, TileDB, Arrow, and HDF can now be adapted without the miserable compromises that subdatasets required in usability and data fidelity. That will allow people to bring the GDAL geo goodness to their data without reformatting simply to push it through the tool. I think these generic data structures are seeing much more action because they allow data-level interop without special purpose drivers across multiple software runtimes. The winds are blowing the same direction in point cloud land too.

Rasterio is a pretty small project and, in my opinion, can't afford to develop code that isn't going to be widely used.

A completely reasonable position.

Howard




--
Sean Gillies