Re: Fast X-array rasterization

Philipp Kats

Hi Brendan,

Indeed, that what I am trying to achieve. and your hint to ints is great! In fact, I could boil it down to store binary masks - feature id would be stored in the Z dimension itself, and all pixels on one layer will likely have the same shared feature-based value after all. Moreover, this way I could calculate this 3d shape once - and then lookup values and perform computations on this cube for any feature's property I want. So this will definitely boost my computations by a lot.

The second one also sounds very legit but sounds a little trickier to implement - we'll see. 

Still, I wonder if it would make sense to (somehow) use gdal_rasterize bands here...

On Mon, Sep 10, 2018 at 11:06 PM Brendan Ward <brendan.charles.ward@...> wrote:
I'm not entirely sure I follow what you are aiming for here yet.  Are you trying to aggregate across features per pixel to calculate various statistics?  Like, what is the average feature size of all features that intersect this pixel?

Thus if the first pixel has features 1, 2, 3, you want to calculate statistics on an attribute of those features?  And if the second pixel has only features 1 and 3, you'd do the same?  So your 3rd dimension, and your basis for statistics, is the set of features per pixel?

How many features do you have?  As written here, I'd expect the performance to be something like number of features * time to rasterize a single feature PLUS converting the data between formats on each iteration of the loop (i.e., from internal GDAL representation to numpy).  I'd recommend timing rasterization of a single feature and see how many features you have - it might be that 3 hours is perfectly normal for that number of features.

In contrast, when you do all features at once, you are doing the data format conversions only once.

Some ideas:
1) rasterize feature IDs (index w/in gdf) instead of a floating point like value.  This would let rasterize work on an unsigned integer dtype instead of float64 (theoretically resulting in a much smaller memory footprint and less I/O during data format conversion).
This will make it so that you only have to rasterize once, but then can use gdf to create arrays on the fly by looking up feature ID to the attribute you want.

2) rasterize onto the smallest possible array for each feature.  The idea is that you figure out your target array dimensions, then calculate a "window" within that covered by the feature's bounds.  Then rasterize that smaller window; it will be a bit more efficient because you reduce the size of the arrays that need to be converted.  Then you paste the result of rasterization into your target array.  You'll have to keep the transforms aligned so that the windowed rasterization can be directly pasted into the target.  The performance gains (if any) are likely limited to those cases where your features individually only cover small areas of the target.

3) do some profiling / timing of individual operations - see if you can find those things that are taking the most time.  I'd be interested in seeing where your code is spending most of its time.

Join to automatically receive all group messages.