Topics

Flaw in rasterio design?


Amine Aboufirass <amine.aboufirass@...>
 

Dear All,

Does anyone have any difficulty with input and output datatypes for rasterio functions in general? Many of the python functions in the library have inconsistent input and output. For instance, sometimes they have ndarrays and transforms as input/output and sometimes they have physical datasets as input output. This is very annoying when trying to develop a generic workflow. Here is a prime example:

rasterio.warp.reproject  takes ndarrays and source/destination transforms to do its work while mask.mask takes a dataset opened in "r" mode. So then what if I first want to open a dataset, clip and then reproject in my script? I would have to open the file as a dataset in "r" mode, clip and then burn it into a file. Finally I would have to open this new file, extract its contents and transform and pass that into the reproject function and burn a third file.

At first I thought memory files were a good solution, but in fact they are not because we still see many functions which require non-dataset formats to do their work. I really think this is a flaw in rasterio design. There should be one object type which can be passed into different functions regardless of what they do. 

In the meantime how can I adapt my code so I don't have to be aware of input and output datatypes? 


Dion Häfner
 

Hey Amine,

usually there are enough functions to allow you to have your cake and eat it, too.

Instead of reproject, I usually find myself using a VRT. An alternative to mask that returns an ndarray is raster_geometry_mask. Some of this you will have to roll yourself, though.

I think most of the functions that burn rasters directly (such as mask) are meant for less experienced users that just want to get the job done. Maybe there should ideally always be two versions of each function, one operating at dataset level, one at data (+ metadata) level. But as an experienced user you can pretty much always work around that limitation.

Hope that helps,
Dion

On 15/10/2019 10.51, Amine Aboufirass via Groups.Io wrote:
Dear All,
Does anyone have any difficulty with input and output datatypes for rasterio functions in general? Many of the python functions in the library have inconsistent input and output. For instance, sometimes they have ndarrays and transforms as input/output and sometimes they have physical datasets as input output. This is very annoying when trying to develop a generic workflow. Here is a prime example:
*
*
*rasterio.warp.reproject * takes ndarrays and source/destination transforms to do its work while *mask.mask *takes a dataset opened in "r" mode. So then what if I first want to open a dataset, clip and then reproject in my script? I would have to open the file as a dataset in "r" mode, clip and then burn it into a file. Finally I would have to open this new file, extract its contents and transform and pass that into the *reproject *function and burn a third file.
At first I thought memory files were a good solution, but in fact they are not because we still see many functions which require non-dataset formats to do their work. I really think this is a flaw in rasterio design. There should be one object type which can be passed into different functions regardless of what they do.
In the meantime how can I adapt my code so I don't have to be aware of input and output datatypes?


Amine Aboufirass <amine.aboufirass@...>
 

Hi Dion,

As an experienced user you find yourself juggling between dataset objects and arrays/transform objects way too often. It slows down the workflow and makes rasterio a pain to use. The choice of where to burn a file or not should be left up to the user. Dataset objects should be abstract and in memory at all times unless the user wishes for them to be written as final output. 

I feel like the attachment to these file abstractions is an artefact from whatever rasterio is built on top of (GDAL? C++?) and is not pythonic at all. Take for instance the pandas dataframe object. All operations are based on the in-memory object, or a pointer thereto. The choice of whether to write to excel, csv or heaven-knows-what other kind of file format is only a method of the dataframe object and not some kind of intrinsic part of it.

In python the abstraction should represent the physical file and not the opposite.

On Tue, Oct 15, 2019 at 11:05 AM Dion Häfner <dion.haefner@...> wrote:
Hey Amine,

usually there are enough functions to allow you to have your cake and
eat it, too.

Instead of reproject, I usually find myself using a VRT. An alternative
to mask that returns an ndarray is raster_geometry_mask. Some of this
you will have to roll yourself, though.

I think most of the functions that burn rasters directly (such as mask)
are meant for less experienced users that just want to get the job done.
Maybe there should ideally always be two versions of each function, one
operating at dataset level, one at data (+ metadata) level. But as an
experienced user you can pretty much always work around that limitation.

Hope that helps,
Dion

On 15/10/2019 10.51, Amine Aboufirass via Groups.Io wrote:
> Dear All,
>
> Does anyone have any difficulty with input and output datatypes for
> rasterio functions in general? Many of the python functions in the
> library have inconsistent input and output. For instance, sometimes they
> have ndarrays and transforms as input/output and sometimes they have
> physical datasets as input output. This is very annoying when trying to
> develop a generic workflow. Here is a prime example:
> *
> *
> *rasterio.warp.reproject * takes ndarrays and source/destination
> transforms to do its work while *mask.mask *takes a dataset opened in
> "r" mode. So then what if I first want to open a dataset, clip and then
> reproject in my script? I would have to open the file as a dataset in
> "r" mode, clip and then burn it into a file. Finally I would have to
> open this new file, extract its contents and transform and pass that
> into the *reproject *function and burn a third file.
>
> At first I thought memory files were a good solution, but in fact they
> are not because we still see many functions which require non-dataset
> formats to do their work. I really think this is a flaw in rasterio
> design. There should be one object type which can be passed into
> different functions regardless of what they do.
>
> In the meantime how can I adapt my code so I don't have to be aware of
> input and output datatypes?
>




Dion Häfner
 

There are some libraries that try and implement a raster datatype, which rasterio explicitly does not. Strangely, none of them is as successful as rasterio.

I see rasterio as a tool to get the job done with minimal abstraction, and I have the highest respect for this type of design. I like not having to wrap my head around yet another container type. But others can speak more about the original reasoning behind that design choice, which I don't know.

I agree that in an ideal world with well-funded open source development we would probably have pandas or xarray for raster data that doesn't leak GDALisms into its abstractions. Until then, I suggest you come up with a more concrete suggestion on how to improve the tools we have.

Best,
Dion

On 15/10/2019 11.14, Amine Aboufirass via Groups.Io wrote:
Hi Dion,
As an experienced user you find yourself juggling between dataset objects and arrays/transform objects way too often. It slows down the workflow and makes rasterio a pain to use. The choice of where to burn a file or not should be left up to the user. Dataset objects should be abstract and in memory at all times unless the user wishes for them to be written as final output.
I feel like the attachment to these file abstractions is an artefact from whatever rasterio is built on top of (GDAL? C++?) and is not pythonic at all. Take for instance the pandas dataframe object. All operations are based on the in-memory object, or a pointer thereto. The choice of whether to write to excel, csv or heaven-knows-what other kind of file format is only a method of the dataframe object and not some kind of intrinsic part of it.
In python the abstraction should represent the physical file and not the opposite.
On Tue, Oct 15, 2019 at 11:05 AM Dion Häfner <dion.haefner@nbi.ku.dk <mailto:dion.haefner@nbi.ku.dk>> wrote:
Hey Amine,
usually there are enough functions to allow you to have your cake and
eat it, too.
Instead of reproject, I usually find myself using a VRT. An alternative
to mask that returns an ndarray is raster_geometry_mask. Some of this
you will have to roll yourself, though.
I think most of the functions that burn rasters directly (such as mask)
are meant for less experienced users that just want to get the job
done.
Maybe there should ideally always be two versions of each function, one
operating at dataset level, one at data (+ metadata) level. But as an
experienced user you can pretty much always work around that limitation.
Hope that helps,
Dion
On 15/10/2019 10.51, Amine Aboufirass via Groups.Io wrote:
> Dear All,
>
> Does anyone have any difficulty with input and output datatypes for
> rasterio functions in general? Many of the python functions in the
> library have inconsistent input and output. For instance,
sometimes they
> have ndarrays and transforms as input/output and sometimes they have
> physical datasets as input output. This is very annoying when
trying to
> develop a generic workflow. Here is a prime example:
> *
> *
> *rasterio.warp.reproject * takes ndarrays and source/destination
> transforms to do its work while *mask.mask *takes a dataset
opened in
> "r" mode. So then what if I first want to open a dataset, clip
and then
> reproject in my script? I would have to open the file as a
dataset in
> "r" mode, clip and then burn it into a file. Finally I would have to
> open this new file, extract its contents and transform and pass that
> into the *reproject *function and burn a third file.
>
> At first I thought memory files were a good solution, but in fact
they
> are not because we still see many functions which require
non-dataset
> formats to do their work. I really think this is a flaw in rasterio
> design. There should be one object type which can be passed into
> different functions regardless of what they do.
>
> In the meantime how can I adapt my code so I don't have to be
aware of
> input and output datatypes?
>


Amine Aboufirass <amine.aboufirass@...>
 

If it is intentional design then there is probably a reason for it. However it comes at a severe cost of useability. One of the main advantages of python is that users don't really need to know much about the internal workings of each library in order to use it. Some libraries are better encapsulated than others. Rasterio is definitely not the worst. I remember coming from GDAL's python bindings to rasterio and breathing a sigh of relief when I realized how much higher level it is.

However, due to this design it requires users to mess around with things like affine matrices, crs objects and also keeping track of array shapes. This partly defeats the purpose.

I hope you see my point and also that it is a valid point. I of course enjoy many of rasterio's features, but this one is particularly problematic.

On Tue, Oct 15, 2019 at 11:33 AM Dion Häfner <dion.haefner@...> wrote:
There are some libraries that try and implement a raster datatype, which
rasterio explicitly does not. Strangely, none of them is as successful
as rasterio.

I see rasterio as a tool to get the job done with minimal abstraction,
and I have the highest respect for this type of design. I like not
having to wrap my head around yet another container type. But others can
speak more about the original reasoning behind that design choice, which
I don't know.

I agree that in an ideal world with well-funded open source development
we would probably have pandas or xarray for raster data that doesn't
leak GDALisms into its abstractions. Until then, I suggest you come up
with a more concrete suggestion on how to improve the tools we have.

Best,
Dion

On 15/10/2019 11.14, Amine Aboufirass via Groups.Io wrote:
> Hi Dion,
>
> As an experienced user you find yourself juggling between dataset
> objects and arrays/transform objects way too often. It slows down the
> workflow and makes rasterio a pain to use. The choice of where to burn a
> file or not should be left up to the user. Dataset objects should be
> abstract and in memory at all times unless the user wishes for them to
> be written as final output.
>
> I feel like the attachment to these file abstractions is an artefact
> from whatever rasterio is built on top of (GDAL? C++?) and is not
> pythonic at all. Take for instance the pandas dataframe object. All
> operations are based on the in-memory object, or a pointer thereto. The
> choice of whether to write to excel, csv or heaven-knows-what other kind
> of file format is only a method of the dataframe object and not some
> kind of intrinsic part of it.
>
> In python the abstraction should represent the physical file and not the
> opposite.
>
> On Tue, Oct 15, 2019 at 11:05 AM Dion Häfner <dion.haefner@...
> <mailto:dion.haefner@...>> wrote:
>
>     Hey Amine,
>
>     usually there are enough functions to allow you to have your cake and
>     eat it, too.
>
>     Instead of reproject, I usually find myself using a VRT. An alternative
>     to mask that returns an ndarray is raster_geometry_mask. Some of this
>     you will have to roll yourself, though.
>
>     I think most of the functions that burn rasters directly (such as mask)
>     are meant for less experienced users that just want to get the job
>     done.
>     Maybe there should ideally always be two versions of each function, one
>     operating at dataset level, one at data (+ metadata) level. But as an
>     experienced user you can pretty much always work around that limitation.
>
>     Hope that helps,
>     Dion
>
>     On 15/10/2019 10.51, Amine Aboufirass via Groups.Io wrote:
>      > Dear All,
>      >
>      > Does anyone have any difficulty with input and output datatypes for
>      > rasterio functions in general? Many of the python functions in the
>      > library have inconsistent input and output. For instance,
>     sometimes they
>      > have ndarrays and transforms as input/output and sometimes they have
>      > physical datasets as input output. This is very annoying when
>     trying to
>      > develop a generic workflow. Here is a prime example:
>      > *
>      > *
>      > *rasterio.warp.reproject * takes ndarrays and source/destination
>      > transforms to do its work while *mask.mask *takes a dataset
>     opened in
>      > "r" mode. So then what if I first want to open a dataset, clip
>     and then
>      > reproject in my script? I would have to open the file as a
>     dataset in
>      > "r" mode, clip and then burn it into a file. Finally I would have to
>      > open this new file, extract its contents and transform and pass that
>      > into the *reproject *function and burn a third file.
>      >
>      > At first I thought memory files were a good solution, but in fact
>     they
>      > are not because we still see many functions which require
>     non-dataset
>      > formats to do their work. I really think this is a flaw in rasterio
>      > design. There should be one object type which can be passed into
>      > different functions regardless of what they do.
>      >
>      > In the meantime how can I adapt my code so I don't have to be
>     aware of
>      > input and output datatypes?
>      >
>
>
>
>




Alan Snow
 

Hi Amine,

rasterio is definitely a great tool. But, if you are looking for an interface to it similar to pandas, you may be interested in rioxarray. It wraps the rasterio code and gives you an xarray (n-dimensional pandas) interface.

Here is a link to some examples that show how to do operations such as reproject and clip.
https://corteva.github.io/rioxarray/html/examples/examples.html

Hope this helps,
Alan


Amine Aboufirass <amine.aboufirass@...>
 

Hi Alan,

Thanks for your response. I am now a bit more intrigued by xarray, but I have a couple of questions:
  • Wouldn't you then lose all geospatial information? Things like affine and CRS are really important
  • Would I then be able to pass such an xarrayish object to rasterio functions or do similar things based on the rasterio package? I still need access to functions like masking, reprojecting, resampling and warping.
I'll take a look myself as well to check if it can reduce my suffering :P.

Regards,

Amine

On Tue, Oct 15, 2019 at 3:05 PM Alan Snow <alansnow21@...> wrote:
Hi Amine,

rasterio is definitely a great tool. But, if you are looking for an interface to it similar to pandas, you may be interested in rioxarray. It wraps the rasterio code and gives you an xarray (n-dimensional pandas) interface.

Here is a link to some examples that show how to do operations such as reproject and clip.
https://corteva.github.io/rioxarray/html/examples/examples.html

Hope this helps,
Alan


Amine Aboufirass <amine.aboufirass@...>
 

Dear Alan, 

I just had a look. It looks like they are in fact taking those things into consideration. Very exciting. I can't wait to put this to use.

On another note, such a library would not have been possible without rasterio, so where rasterio shines (IMHO) is as a lower-level library upon which more pythonic libraries are constructed.

Regards,

Amine

On Tue, Oct 15, 2019 at 3:09 PM Amine Aboufirass via Groups.Io <amine.aboufirass=gmail.com@groups.io> wrote:
Hi Alan,

Thanks for your response. I am now a bit more intrigued by xarray, but I have a couple of questions:
  • Wouldn't you then lose all geospatial information? Things like affine and CRS are really important
  • Would I then be able to pass such an xarrayish object to rasterio functions or do similar things based on the rasterio package? I still need access to functions like masking, reprojecting, resampling and warping.
I'll take a look myself as well to check if it can reduce my suffering :P.

Regards,

Amine

On Tue, Oct 15, 2019 at 3:05 PM Alan Snow <alansnow21@...> wrote:
Hi Amine,

rasterio is definitely a great tool. But, if you are looking for an interface to it similar to pandas, you may be interested in rioxarray. It wraps the rasterio code and gives you an xarray (n-dimensional pandas) interface.

Here is a link to some examples that show how to do operations such as reproject and clip.
https://corteva.github.io/rioxarray/html/examples/examples.html

Hope this helps,
Alan


Sean Gillies
 

Hi all,

Since I haven't written much about Rasterio's design principles or goals beyond what is at https://rasterio.readthedocs.io/en/latest/intro.html#philosophy it is only natural that these questions come up.

Rasterio's stated design goals are to eliminate the biggest "gotchas" of the GDAL Python bindings and make GDAL configuration a little more Pythonic. The unstated design principles include minimizing the invention of new concepts and minimizing the risk of project failure. Rasterio's new concept of using a context manager to patch GDAL's runtime configuration has already used up most of our new invention budget, I believe. Whether or not this was the right choice depends on your perspective. It has certainly made my own team's code less brittle and more easy to reason about.

Not developing application-specific classes has been one way of minimizing the risk of project failure. Rasterio is going to stay useful for a long time by staying relatively small and behind the cutting edge. We have very limited resources and if we had used them to build a GIS application framework we would have been at risk (in my view) of missing the mark or being swept away by the new flavor of the month as we've seen in other domains. In my team's projects that use Rasterio we do have application classes, but they are constantly changing and I'm very glad not to have published them and committed myself to supporting their use, forever, for free.

On the other hand, count me in for helping design raster data protocols and standards. I'm all for them, but I do not think Rasterio is the right place to experiment with them.


On Tue, Oct 15, 2019 at 3:33 AM Dion Häfner <dion.haefner@...> wrote:
There are some libraries that try and implement a raster datatype, which
rasterio explicitly does not. Strangely, none of them is as successful
as rasterio.

I see rasterio as a tool to get the job done with minimal abstraction,
and I have the highest respect for this type of design. I like not
having to wrap my head around yet another container type. But others can
speak more about the original reasoning behind that design choice, which
I don't know.

I agree that in an ideal world with well-funded open source development
we would probably have pandas or xarray for raster data that doesn't
leak GDALisms into its abstractions. Until then, I suggest you come up
with a more concrete suggestion on how to improve the tools we have.

Best,
Dion

On 15/10/2019 11.14, Amine Aboufirass via Groups.Io wrote:
> Hi Dion,
>
> As an experienced user you find yourself juggling between dataset
> objects and arrays/transform objects way too often. It slows down the
> workflow and makes rasterio a pain to use. The choice of where to burn a
> file or not should be left up to the user. Dataset objects should be
> abstract and in memory at all times unless the user wishes for them to
> be written as final output.
>
> I feel like the attachment to these file abstractions is an artefact
> from whatever rasterio is built on top of (GDAL? C++?) and is not
> pythonic at all. Take for instance the pandas dataframe object. All
> operations are based on the in-memory object, or a pointer thereto. The
> choice of whether to write to excel, csv or heaven-knows-what other kind
> of file format is only a method of the dataframe object and not some
> kind of intrinsic part of it.
>
> In python the abstraction should represent the physical file and not the
> opposite.
>
> On Tue, Oct 15, 2019 at 11:05 AM Dion Häfner <dion.haefner@...
> <mailto:dion.haefner@...>> wrote:
>
>     Hey Amine,
>
>     usually there are enough functions to allow you to have your cake and
>     eat it, too.
>
>     Instead of reproject, I usually find myself using a VRT. An alternative
>     to mask that returns an ndarray is raster_geometry_mask. Some of this
>     you will have to roll yourself, though.
>
>     I think most of the functions that burn rasters directly (such as mask)
>     are meant for less experienced users that just want to get the job
>     done.
>     Maybe there should ideally always be two versions of each function, one
>     operating at dataset level, one at data (+ metadata) level. But as an
>     experienced user you can pretty much always work around that limitation.
>
>     Hope that helps,
>     Dion
>
>     On 15/10/2019 10.51, Amine Aboufirass via Groups.Io wrote:
>      > Dear All,
>      >
>      > Does anyone have any difficulty with input and output datatypes for
>      > rasterio functions in general? Many of the python functions in the
>      > library have inconsistent input and output. For instance,
>     sometimes they
>      > have ndarrays and transforms as input/output and sometimes they have
>      > physical datasets as input output. This is very annoying when
>     trying to
>      > develop a generic workflow. Here is a prime example:
>      > *
>      > *
>      > *rasterio.warp.reproject * takes ndarrays and source/destination
>      > transforms to do its work while *mask.mask *takes a dataset
>     opened in
>      > "r" mode. So then what if I first want to open a dataset, clip
>     and then
>      > reproject in my script? I would have to open the file as a
>     dataset in
>      > "r" mode, clip and then burn it into a file. Finally I would have to
>      > open this new file, extract its contents and transform and pass that
>      > into the *reproject *function and burn a third file.
>      >
>      > At first I thought memory files were a good solution, but in fact
>     they
>      > are not because we still see many functions which require
>     non-dataset
>      > formats to do their work. I really think this is a flaw in rasterio
>      > design. There should be one object type which can be passed into
>      > different functions regardless of what they do.
>      >
>      > In the meantime how can I adapt my code so I don't have to be
>     aware of
>      > input and output datatypes?
>      >
>
>
>
>





--
Sean Gillies