Speed up reading rasters


Carlos García Rodríguez
 

Hello, I need to read many huge datasets and the speed time is very important to avoid a bottleneck.
I have to read a tiff file that has 20 bands, and a window of 224,224.
Now I am doing like this, and it takes approx 0.8seconds.

with rasterio.open('./sentinel.tif') as src:
          sentinel1_1 = src.read(window=window)

What I realized is that if I try to read only one of the bands the required time is approx the same, but when reading a tiff of only one band the amount of time is 10 times shorter.

Can I do something to speed it up? Maybe read bands in parallel, I don't really know.

I appreciate your help.

Thank you.


Even Rouault
 

On dimanche 26 avril 2020 03:04:56 CEST carlogarro@... wrote:

> Hello, I need to read many huge datasets and the speed time is very

> important to avoid a bottleneck. I have to read a tiff file that has 20

> bands, and a window of 224,224. Now I am doing like this, and it takes

> approx 0.8seconds.

>

> with rasterio.open('./sentinel.tif') as src:

> sentinel1_1 = src.read(window=window)

>

> What I realized is that if I try to read only one of the bands the required

> time is approx the same, but when reading a tiff of only one band the

> amount of time is 10 times shorter.

>

> Can I do something to speed it up? Maybe read bands in parallel, I don't

> really know.

 

If you have control on how the creation of the TIFF file, make sure it uses Band interleaving instead of Pixel interleaving

 

For example, with gdal_translate can be done with -co INTERLEAVE=BAND

 

If the file is not tiled, adding tiling might also help.

 

I see in https://rasterio.readthedocs.io/en/latest/topics/profiles.html a pure rasterio way of creating such file

 

--

Spatialys - Geospatial professional services

http://www.spatialys.com


Carlos García Rodríguez
 

Hello, thank you so much for your recommendation, it speed it up x5. Very useful. Now I am having a problem that i do not understand.

I have the following script, where i access 10 random tiles of my raster. train_data is a vector [4822,2] of pixels position in the raster.
for i in range(10):
    idx = np.random.randint(4822)
    x_idx = train_data[idx][1]
    y_idx = train_data[idx][0]
    window = Window(y_idx, x_idx, 224, 224)
    start_time = time.time()
    with rasterio.open('./sentinel2_tiled.tif') as src:
        sentinel2 = src.read(window=window)
    end_time = (time.time() - start_time)

I do not understand why the times of loading a window are so different, as can be seen in the following image. Do you have some explanation?



 Thank you once more!


Sean Gillies
 

Hi,

On Mon, Apr 27, 2020 at 2:36 AM <carlogarro@...> wrote:
Hello, thank you so much for your recommendation, it speed it up x5. Very useful. Now I am having a problem that i do not understand.

I have the following script, where i access 10 random tiles of my raster. train_data is a vector [4822,2] of pixels position in the raster.
for i in range(10):
    idx = np.random.randint(4822)
    x_idx = train_data[idx][1]
    y_idx = train_data[idx][0]
    window = Window(y_idx, x_idx, 224, 224)
    start_time = time.time()
    with rasterio.open('./sentinel2_tiled.tif') as src:
        sentinel2 = src.read(window=window)
    end_time = (time.time() - start_time)

I do not understand why the times of loading a window are so different, as can be seen in the following image. Do you have some explanation?



 Thank you once more!

I can't say for sure about the time differences because I don't know much about your data files or your computer. However, know this: GDAL's I/O system caches blocks of raster data in memory, the size of the cache is generally 5% of your computers memory, and windowed reads may or may not be served directly from the cache depending on their size and adjacency to previously read data.

--
Sean Gillies


Carlos García Rodríguez
 

So, do you think it should be a good idea to increase the cache memory? If so, how to do it? I have plenty of ram so that should not be a problem. On the other side I checked I there is some relation between tiles proximity and time and didn't find it. You can see the position of each tile in the image. 

El lun., 27 abr. 2020 17:20, Sean Gillies <sean.gillies@...> escribió:
Hi,

On Mon, Apr 27, 2020 at 2:36 AM <carlogarro@...> wrote:
Hello, thank you so much for your recommendation, it speed it up x5. Very useful. Now I am having a problem that i do not understand.

I have the following script, where i access 10 random tiles of my raster. train_data is a vector [4822,2] of pixels position in the raster.
for i in range(10):
    idx = np.random.randint(4822)
    x_idx = train_data[idx][1]
    y_idx = train_data[idx][0]
    window = Window(y_idx, x_idx, 224, 224)
    start_time = time.time()
    with rasterio.open('./sentinel2_tiled.tif') as src:
        sentinel2 = src.read(window=window)
    end_time = (time.time() - start_time)

I do not understand why the times of loading a window are so different, as can be seen in the following image. Do you have some explanation?



 Thank you once more!

I can't say for sure about the time differences because I don't know much about your data files or your computer. However, know this: GDAL's I/O system caches blocks of raster data in memory, the size of the cache is generally 5% of your computers memory, and windowed reads may or may not be served directly from the cache depending on their size and adjacency to previously read data.

--
Sean Gillies


Carlos García Rodríguez
 

I would also like to add that the first tiled read is not necessarily slow... 


El lun., 27 abr. 2020 20:06, Carlos García Rodríguez via groups.io <carlogarro=gmail.com@groups.io> escribió:
So, do you think it should be a good idea to increase the cache memory? If so, how to do it? I have plenty of ram so that should not be a problem. On the other side I checked I there is some relation between tiles proximity and time and didn't find it. You can see the position of each tile in the image. 
El lun., 27 abr. 2020 17:20, Sean Gillies <sean.gillies@...> escribió:
Hi,

On Mon, Apr 27, 2020 at 2:36 AM <carlogarro@...> wrote:
Hello, thank you so much for your recommendation, it speed it up x5. Very useful. Now I am having a problem that i do not understand.

I have the following script, where i access 10 random tiles of my raster. train_data is a vector [4822,2] of pixels position in the raster.
for i in range(10):
    idx = np.random.randint(4822)
    x_idx = train_data[idx][1]
    y_idx = train_data[idx][0]
    window = Window(y_idx, x_idx, 224, 224)
    start_time = time.time()
    with rasterio.open('./sentinel2_tiled.tif') as src:
        sentinel2 = src.read(window=window)
    end_time = (time.time() - start_time)

I do not understand why the times of loading a window are so different, as can be seen in the following image. Do you have some explanation?



 Thank you once more!

I can't say for sure about the time differences because I don't know much about your data files or your computer. However, know this: GDAL's I/O system caches blocks of raster data in memory, the size of the cache is generally 5% of your computers memory, and windowed reads may or may not be served directly from the cache depending on their size and adjacency to previously read data.

--
Sean Gillies


Carlos García Rodríguez
 

I have already fixed it. One of the main problems was that the tiled tiff was configured with size 256 and I was reading with windows of sizes 224. From here comes the disparity of the times of reading.
Also, I found faster to read pixel-based instead of band based. But it depends on the application.

Thank you all for your help!