Re: How to speed up rasterio.dataset.sample()


Sean Gillies
 

Hi,

The sample method is not optimized. It reads data from GDAL's block cache or disk for each coordinate. Unless your coordinates are specially sorted, it is very likely that you will get block cache misses after some number of coordinates and then you'll be reading blocks from disk over again for every coordinate. If you increase GDAL_CACHEMAX to be >= the size of your raster, you should see a pretty good speedup. That's more or less what you would see by copying the dataset into memory using MemoryFile.

Sorting your input coordinates by x and y so that fewer disk reads are required would be a way to speed up processing. If rasterio was ever to optimize sampling, that's what we would investigate.

On Wed, Dec 22, 2021 at 8:29 AM <aleksandar.ilic@...> wrote:

[Edited Message Follows]

I have used some code i found in a tutorial where sampling, based on coordinates, is speeded up.
I checked this code and found out that the dataset is created properly and that there are plenty of pixels with non zero (~3.5 million)
src = rasterio.open(SAR_MSI_Stack_msk)img = src.read()
profile = src.profile  # copy the profile of the original GeoTIFF input file
with rasterio.io.MemoryFile() as memfile:
    with memfile.open(**profile) as dst:
        for i in range(0, src.count):
            dst.write(img[i], i+1)
    dataset = memfile.open()

This code apparently speeds up the sample from 2h to 10 seconds !! Which is really helpful.

But rasterio.sample returns me an array full of zeros
train_pts['Raster Value'] = [x for x in dataset.sample(coords)]   # all band values are saved as a list in the Raster Value column 


--
Sean Gillies

Join main@rasterio.groups.io to automatically receive all group messages.