Topics

Fast X-array rasterization

Philipp Kats
 

Hi! I am trying to produce a map using rasterio's rasterization and datashader, following this gist

It works perfectly well, when I rasterize all the features at once, into one 2d numpy array;
However, what I want now is to produce a "cube", having separate raster for each feature separately, with a corresponding value (say, number of category). Plan is then to perform aggregation - say, take moda, from this cube.

However, that takes a lot of time on my machine (~3 hours vs 30 sec for one raster map), so I wonder if there are any suggestions/hacks on how I can speed it up.

Thank you in advance! 



Brendan Ward
 

Hi Philipp,

How are you looping over each value to create a new raster?  Are you able to share a gist of what you are doing?

Philipp Kats
 

oh sure!
here is the link:

And indeed, I am using geoseries.apply, but it is pretty much the same as looping - and obviously converting N 2d rasters into 3d will take some time as well - so I wonder if
there is a better solution here...




On Mon, Sep 10, 2018 at 11:43 AM Brendan Ward <brendan.charles.ward@...> wrote:
Hi Philipp,

How are you looping over each value to create a new raster?  Are you able to share a gist of what you are doing?

Philipp Kats
 

So what I am trying to achieve in the end is the same as basic `rasterio.feature.rasterize`, but with more flexible aggregation strategies, like median, mean, percentile and std - currently it supports only `add` and `replace`.

Brendan Ward
 

I'm not entirely sure I follow what you are aiming for here yet.  Are you trying to aggregate across features per pixel to calculate various statistics?  Like, what is the average feature size of all features that intersect this pixel?

Thus if the first pixel has features 1, 2, 3, you want to calculate statistics on an attribute of those features?  And if the second pixel has only features 1 and 3, you'd do the same?  So your 3rd dimension, and your basis for statistics, is the set of features per pixel?

How many features do you have?  As written here, I'd expect the performance to be something like number of features * time to rasterize a single feature PLUS converting the data between formats on each iteration of the loop (i.e., from internal GDAL representation to numpy).  I'd recommend timing rasterization of a single feature and see how many features you have - it might be that 3 hours is perfectly normal for that number of features.

In contrast, when you do all features at once, you are doing the data format conversions only once.

Some ideas:
1) rasterize feature IDs (index w/in gdf) instead of a floating point like value.  This would let rasterize work on an unsigned integer dtype instead of float64 (theoretically resulting in a much smaller memory footprint and less I/O during data format conversion).
This will make it so that you only have to rasterize once, but then can use gdf to create arrays on the fly by looking up feature ID to the attribute you want.

2) rasterize onto the smallest possible array for each feature.  The idea is that you figure out your target array dimensions, then calculate a "window" within that covered by the feature's bounds.  Then rasterize that smaller window; it will be a bit more efficient because you reduce the size of the arrays that need to be converted.  Then you paste the result of rasterization into your target array.  You'll have to keep the transforms aligned so that the windowed rasterization can be directly pasted into the target.  The performance gains (if any) are likely limited to those cases where your features individually only cover small areas of the target.

3) do some profiling / timing of individual operations - see if you can find those things that are taking the most time.  I'd be interested in seeing where your code is spending most of its time.

Philipp Kats
 

Hi Brendan,

Thanks! 
Indeed, that what I am trying to achieve. and your hint to ints is great! In fact, I could boil it down to store binary masks - feature id would be stored in the Z dimension itself, and all pixels on one layer will likely have the same shared feature-based value after all. Moreover, this way I could calculate this 3d shape once - and then lookup values and perform computations on this cube for any feature's property I want. So this will definitely boost my computations by a lot.

The second one also sounds very legit but sounds a little trickier to implement - we'll see. 

Still, I wonder if it would make sense to (somehow) use gdal_rasterize bands here...

On Mon, Sep 10, 2018 at 11:06 PM Brendan Ward <brendan.charles.ward@...> wrote:
I'm not entirely sure I follow what you are aiming for here yet.  Are you trying to aggregate across features per pixel to calculate various statistics?  Like, what is the average feature size of all features that intersect this pixel?

Thus if the first pixel has features 1, 2, 3, you want to calculate statistics on an attribute of those features?  And if the second pixel has only features 1 and 3, you'd do the same?  So your 3rd dimension, and your basis for statistics, is the set of features per pixel?

How many features do you have?  As written here, I'd expect the performance to be something like number of features * time to rasterize a single feature PLUS converting the data between formats on each iteration of the loop (i.e., from internal GDAL representation to numpy).  I'd recommend timing rasterization of a single feature and see how many features you have - it might be that 3 hours is perfectly normal for that number of features.

In contrast, when you do all features at once, you are doing the data format conversions only once.

Some ideas:
1) rasterize feature IDs (index w/in gdf) instead of a floating point like value.  This would let rasterize work on an unsigned integer dtype instead of float64 (theoretically resulting in a much smaller memory footprint and less I/O during data format conversion).
This will make it so that you only have to rasterize once, but then can use gdf to create arrays on the fly by looking up feature ID to the attribute you want.

2) rasterize onto the smallest possible array for each feature.  The idea is that you figure out your target array dimensions, then calculate a "window" within that covered by the feature's bounds.  Then rasterize that smaller window; it will be a bit more efficient because you reduce the size of the arrays that need to be converted.  Then you paste the result of rasterization into your target array.  You'll have to keep the transforms aligned so that the windowed rasterization can be directly pasted into the target.  The performance gains (if any) are likely limited to those cases where your features individually only cover small areas of the target.

3) do some profiling / timing of individual operations - see if you can find those things that are taking the most time.  I'd be interested in seeing where your code is spending most of its time.