Re: Flaw in rasterio design?

Sean Gillies

Hi all,

Since I haven't written much about Rasterio's design principles or goals beyond what is at it is only natural that these questions come up.

Rasterio's stated design goals are to eliminate the biggest "gotchas" of the GDAL Python bindings and make GDAL configuration a little more Pythonic. The unstated design principles include minimizing the invention of new concepts and minimizing the risk of project failure. Rasterio's new concept of using a context manager to patch GDAL's runtime configuration has already used up most of our new invention budget, I believe. Whether or not this was the right choice depends on your perspective. It has certainly made my own team's code less brittle and more easy to reason about.

Not developing application-specific classes has been one way of minimizing the risk of project failure. Rasterio is going to stay useful for a long time by staying relatively small and behind the cutting edge. We have very limited resources and if we had used them to build a GIS application framework we would have been at risk (in my view) of missing the mark or being swept away by the new flavor of the month as we've seen in other domains. In my team's projects that use Rasterio we do have application classes, but they are constantly changing and I'm very glad not to have published them and committed myself to supporting their use, forever, for free.

On the other hand, count me in for helping design raster data protocols and standards. I'm all for them, but I do not think Rasterio is the right place to experiment with them.

On Tue, Oct 15, 2019 at 3:33 AM Dion Häfner <dion.haefner@...> wrote:
There are some libraries that try and implement a raster datatype, which
rasterio explicitly does not. Strangely, none of them is as successful
as rasterio.

I see rasterio as a tool to get the job done with minimal abstraction,
and I have the highest respect for this type of design. I like not
having to wrap my head around yet another container type. But others can
speak more about the original reasoning behind that design choice, which
I don't know.

I agree that in an ideal world with well-funded open source development
we would probably have pandas or xarray for raster data that doesn't
leak GDALisms into its abstractions. Until then, I suggest you come up
with a more concrete suggestion on how to improve the tools we have.


On 15/10/2019 11.14, Amine Aboufirass via Groups.Io wrote:
> Hi Dion,
> As an experienced user you find yourself juggling between dataset
> objects and arrays/transform objects way too often. It slows down the
> workflow and makes rasterio a pain to use. The choice of where to burn a
> file or not should be left up to the user. Dataset objects should be
> abstract and in memory at all times unless the user wishes for them to
> be written as final output.
> I feel like the attachment to these file abstractions is an artefact
> from whatever rasterio is built on top of (GDAL? C++?) and is not
> pythonic at all. Take for instance the pandas dataframe object. All
> operations are based on the in-memory object, or a pointer thereto. The
> choice of whether to write to excel, csv or heaven-knows-what other kind
> of file format is only a method of the dataframe object and not some
> kind of intrinsic part of it.
> In python the abstraction should represent the physical file and not the
> opposite.
> On Tue, Oct 15, 2019 at 11:05 AM Dion Häfner <dion.haefner@...
> <mailto:dion.haefner@...>> wrote:
>     Hey Amine,
>     usually there are enough functions to allow you to have your cake and
>     eat it, too.
>     Instead of reproject, I usually find myself using a VRT. An alternative
>     to mask that returns an ndarray is raster_geometry_mask. Some of this
>     you will have to roll yourself, though.
>     I think most of the functions that burn rasters directly (such as mask)
>     are meant for less experienced users that just want to get the job
>     done.
>     Maybe there should ideally always be two versions of each function, one
>     operating at dataset level, one at data (+ metadata) level. But as an
>     experienced user you can pretty much always work around that limitation.
>     Hope that helps,
>     Dion
>     On 15/10/2019 10.51, Amine Aboufirass via Groups.Io wrote:
>      > Dear All,
>      >
>      > Does anyone have any difficulty with input and output datatypes for
>      > rasterio functions in general? Many of the python functions in the
>      > library have inconsistent input and output. For instance,
>     sometimes they
>      > have ndarrays and transforms as input/output and sometimes they have
>      > physical datasets as input output. This is very annoying when
>     trying to
>      > develop a generic workflow. Here is a prime example:
>      > *
>      > *
>      > *rasterio.warp.reproject * takes ndarrays and source/destination
>      > transforms to do its work while *mask.mask *takes a dataset
>     opened in
>      > "r" mode. So then what if I first want to open a dataset, clip
>     and then
>      > reproject in my script? I would have to open the file as a
>     dataset in
>      > "r" mode, clip and then burn it into a file. Finally I would have to
>      > open this new file, extract its contents and transform and pass that
>      > into the *reproject *function and burn a third file.
>      >
>      > At first I thought memory files were a good solution, but in fact
>     they
>      > are not because we still see many functions which require
>     non-dataset
>      > formats to do their work. I really think this is a flaw in rasterio
>      > design. There should be one object type which can be passed into
>      > different functions regardless of what they do.
>      >
>      > In the meantime how can I adapt my code so I don't have to be
>     aware of
>      > input and output datatypes?
>      >

Sean Gillies

Join to automatically receive all group messages.