Thank you for the feedback. It's really helpful to hear from someone in the Pangeo project.
On Wed, Aug 28, 2019 at 9:37 PM scott <scottyh@...
Hi Norman and Sean,
Thanks for discussion this! Just wanted to chime in based on Norman's suggestion. I'm currently involved in a NASA-funded project to facilitate analysis of Cloud-based data archives, and as part of the Pangeo project we are really pushing for contributions to established Python packages. We've been using rasterio extensively to load single image files and multiband VRTs. xarray.open_rasterio() has been a great example of Python tools working very well together.
Currently it is a bit awkward to work with multidimensional data with subdatasets in gdal/rasterio. Alternatively, there are format-specific libraries and readers out there (xarray.open_dataset(), h5py, satpy), but I agree there would be a lot of value in a standard access pattern through rasterio, which other libraries could then inherit for I/O tasks. Xarray for example does not currently account for crs, which has been the topic of a lot of discussion (https://github.com/pydata/xarray/issues/2288 ), and writing is currently limited to netCDF and Zarr. I think the current state of things illustrates that the extendibility of Python is both an advantage and disadvantage for users, because people (especially newcomers) are confused by which packages to use and often end up with hodgepodge solutions.
I cannot deny that Python software, GIS software in particular, can be a hodgepodge. Some of the reasons are beyond our control: Google has paved the early cow paths and so people will use the gdal package and then mix it with rasterio, probably forever. In some cases they must, because the rasterio project lacks a feature they need.
I think the value in having rasterio expand to handle the geophysical data I/O for other libraries is not so clear. I also believe that it is not necessarily distributed equally. Certainly the opportunity costs will be unequally distributed: the rasterio project will see an increase in bugs, support requests, need for documentation, new for new CI infrastructure, while the libraries that depend on rasterio can move on to other things. Thanks to help from its users, rasterio is currently sustainable, but only just barely on average. I think this is largely due to good fortune, and I'm very concerned about upsetting the balance.
There's also the issue that I won't get any more time at work to manage or develop a new rasterio API and will have to give up some of my own influence over the library's design. There are pros and cons to this. I do think that rasterio is in some ways a refreshing alternative to libraries that strictly implement industry standards.
One argument for incorporating the multidimensional API is that there is a tremendous amount out of netCDF and HDF data out there (in fact all NASA data is archived in these formats) and people are interested in translating to more Cloud-friendly formats (see for example https://github.com/pangeo-data/pangeo/issues/686 , https://github.com/pangeo-data/pangeo/issues/120 ). So, one timely use-case of multidimensional support would be transposing existing archives of HDF files for time series analysis and storing as tileDB on S3 or GCS. Or, as Norman mentioned, build multidimensional VRTs of COGs to sample in Time in addition to X and Y. Another common use-case is 1) open a large multidimensional netCDF file, 2) run some dimensionality-reducing analysis with your favorite Python library, 3) save the resulting 2D Geotiff.
I'm clearly going to have to learn more about multidimensional VRTs and their specification. I'm not able to evaluate this use case.
I still hold that the common use case you mention can be solved with existing packages, including Alan Snow's rioxarray. One can also GDAL's Python bindings.
My go-to place for any raster format conversion is gdal, and if python code is involved I first turn to rasterio. So if there is motivation to try to bring this new gdal feature into rasterio I'm very interested to see where it leads!
Thanks again for speaking up!
I don't think a multidimensional data abstraction library is a bad idea, I'm only hesitant to risk rasterio's sustainability and, it must be said, my own creative outlet and job satisfaction.