Re: [rasterio] multi-dimensional support

Norman Barker
 

Hi Sean,

One of the other reasons I would like to build on top of rasterio for multi-dimensional access is the support for geospatial metadata and coordinate reference systems, that support would need to be built again for any new project.

It is is possible to build GDAL without all of the legacy formats to reduce the network size, though I agree it can still be quite large.

I was at the Pangeo summer meeting last week which is where I heard the most about the use of GDAL/Rasterio for multi-dimensional data, I will let them add their opinion. I would like to build on top of rasterio, though I do take note of your views.

Norman 

On Mon, Aug 26, 2019 at 6:33 PM Sean Gillies <sean.gillies@...> wrote:
Hi Norman,

On Fri, Aug 23, 2019 at 11:30 AM Norman Barker <norman.barker@...> wrote:
I was one of the stakeholders for subdataset support in GDAL with netCDF and it worked well with what we were trying to achieve back then, serving regularly gridded time series netcdf data through a WCS, I believe others have used subdataset support in the same way. It was possible to make this work by using external indexes and subdatasets. 

I also agree with your comment that Rasterio is a relatively small project and the code needs to have active users.

The main benefit is a common api for multi-dimensional data access within gdal. Currently using gdalinfo against hdf, netcdf or TileDB requires reading the output to understand the available data, or writing a parser for each of these format driver's metadata. These drivers have no common way to advertise through an API the dimensions and attributes they support. Because implementing subdataset support has been a little adhoc the access patterns are slightly different across drivers, the new api enforces a convention.

Killer features? A couple come to mind; Accessing data cubes with a common api to retrieve data along a z dimension, or sliced by time. These use cases would benefit from being supported in rasterio and using xarray/dask to process multi-dimensional data.

But I can import xarray now to get all of these features from a netCDF file (or other formats supported by xarray). As a Python developer, I don't need these features built into rasterio, because Python is extendable. Furthermore, if I were building applications around very very large multi-dimensional datasets, I'm not sure I would need or want the legacy GDAL formats at all. It would be a waste of time to compile those drivers and a waste of bandwidth to copying them around my network. I'm skeptical I would want, for example, to be able to work with ArcInfo ASCII grids (or 90% of GDAL raster formats) and Parquet (for example) files in the same program using only Rasterio and GDAL. The GDAL programs like gdalinfo aren't extendable and so everything has to be built in, but this is not the case for a Python programmer.
 
I will create a strawman for the API changes and if you and the community are interested then I can start on the code.

I would be even more curious to see what you might come up with for an API if you weren't restricted to maintaining compatibility with Rasterio's existing API.

Apologies in advance if my critique is too harsh. I really do think that the GDAL legacy formats and the new multi-dimensional array abstraction are different things and would be better off in separate projects/packages. I'm also pretty comfortable with the idea of Rasterio remaining on the legacy side of things, avoiding the cutting edge, and still providing a decent amount of value to the Python community while making space for another project to step in and grow in the new space.

--
Sean Gillies

Join dev@rasterio.groups.io to automatically receive all group messages.