Date   

Re: MemoryFile loses Profile information

Guillaume Lostis
 

Like Ciaran I'm not very familiar with GDAL's python bindings, but I've tried running a little snippet which I believe does the same thing as what was done with rasterio: it writes a JP2 to a vsimem in-memory file, and then writes the bytes to a file on disk.

from osgeo import gdal

mem, out = "/vsimem/B01.jp2", "new_B01.jp2"
gdal.Translate(mem, "B01.jp2")

# Taken from https://lists.osgeo.org/pipermail/gdal-dev/2016-August/045030.html
f = gdal.VSIFOpenL(mem, "rb")
gdal.VSIFSeekL(f, 0, 2)  # seek to end
size = gdal.VSIFTellL(f)
gdal.VSIFSeekL(f, 0, 0)  # seek to beginning
data = gdal.VSIFReadL(1, size, f)
gdal.Unlink(mem)

with open(out, "wb") as ff:
    ff.write(data)

When I look at the gdalinfo of new_B01.jp2, it has a CRS and a transform, so I am not able to reproduce the behavior we have with rasterio.

(Side note: Interestingly enough, the pixel values of new_B01.jp2 have slightly changed with respect to the original B01.jp2 file, so there is some data loss somewhere in the process. But maybe that is expected of the JP2 format and could be avoided by passing extra arguments to gdal.Translate? I mainly have experience with writing GeoTIFFs, writing JP2s is new to me)

Guillaume


Re: MemoryFile loses Profile information

ciaran.evans@...
 

Would you say it's worth raising an issue on GDAL then?

I'm also not familiar with the Python GDAL bindings, Rasterio is my window in that haha.


Re: Read using multithreading

Sean Gillies
 

Hi,

On Wed, Apr 29, 2020 at 2:54 AM Carlos García Rodríguez <carlogarro@...> wrote:

Hello, I have an application that calls rasterio to open rasters in its main process. This application allows me to parallelize the process, so it will call the function where rasterio open is many times in parallel.
If I open and close the image every time by doing

with rasterio.open(self.file) as src:
    raster = src.read(window=window)

The process goes fine, but it has to open and close at every step of the raster (losing a bit of time). Then I tried to let the image opened at the beginning of the process but it is not stable. If I don't have too many workers, the application might work well, but if I increase the number of workers it will crash with the following error:

RasterioIOError: Caught RasterioIOError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "rasterio/_io.pyx", line 707, in rasterio._io.DatasetReaderBase._read
File "rasterio/shim_rasterioex.pxi", line 133, in rasterio._shim.io_multi_band
File "rasterio/_err.pyx", line 182, in rasterio._err.exc_wrap_int
rasterio._err.CPLE_AppDefinedError: ./sentinel2_tiled.tif, band 1: IReadBlock failed at X offset 58, Y offset 92: TIFFReadEncodedTile() failed.

What do you recommend me to do?

Thank you.

Dataset files can be accessed from multiple threads, but the dataset objects returned by rasterio.open can only be used by a single thread at a time. This is a constraint that we get from GDAL. You could try creating a set of dataset objects and then distribute them to the worker threads, making sure not to share them.

--
Sean Gillies


Re: MemoryFile loses Profile information

Sean Gillies
 



On Wed, Apr 29, 2020 at 1:38 AM <ciaran.evans@...> wrote:

As with Guillaume's answer, it's a Sentinel 2 band and has no side-car file, when I write straight to a file, the georeferencing is maintained.

Since there's no JP2 specific code in MemoryFile and things work with the GeoTIFF driver, I suspect there's a subtle bug involving the vsimem system and one of the JP2 drivers. It would be useful to try to reproduce this with GDAL's Python bindings, but I don't have them installed and won't be able to try that.

--
Sean Gillies


Rasterio.features.geometry_window error when geojson has GeometryCollection

filthyhands@...
 



I'm attempting to isolate areas in a rasterio raster object, using Fiona to open a geojson and then window the parts I want to look at. 

Where a geojson entry is a geometry collection rasterio.features.geometry_window throws a key error. It's expecting each entry in the geojson to hold a 'coordinates' key but in this geojson the 'coordinates' are inside a list a level below 'geometryCollection'

A workaround could be to dismantle the geometry collections into separate objects. It would work but it's not ideal as the collections are there to group various styles of object together.

Does anyone have experience with this issue, or can suggest another solution? My code etc below.

Linestring: 
[{'type': 'LineString',
  'coordinates': [(144.937247997, -26.3759745069999, 0.0),

Geometry Collection: 
{'type': 'GeometryCollection',
  'geometries': [{'type': 'LineString',
    'coordinates': [(145.619783254, -38.5831492689999, 0.0),


My code:
with fiona.open('./Methane gas detection docs/NationalOnshoreGasPipelines.geojson') as fi: shapes = [each['geometry']for each in fi] # isolate a pipeline area pipes = rasterio.features.geometry_window(win, shapes, pad_x = 0.001, pad_y=0.001)



Error:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-183-b9dff7a7db24> in <module>
     10 
     11     # isolate a pipeline area
---> 12     pipes = rasterio.features.geometry_window(win, shapes, pad_x = 0.001, pad_y=0.001)
     13 
     14     # print(shapes)

~\Anaconda3\envs\sat\lib\site-packages\rasterio\features.py in geometry_window(dataset, shapes, pad_x, pad_y, north_up, rotated, pixel_precision)
    419 
    420     if not rotated:
--> 421         all_bounds = [bounds(shape, north_up=north_up) for shape in shapes]
    422         lefts, bottoms, rights, tops = zip(*all_bounds)
    423 

~\Anaconda3\envs\sat\lib\site-packages\rasterio\features.py in <listcomp>(.0)
    419 
    420     if not rotated:
--> 421         all_bounds = [bounds(shape, north_up=north_up) for shape in shapes]
    422         lefts, bottoms, rights, tops = zip(*all_bounds)
    423 

~\Anaconda3\envs\sat\lib\site-packages\rasterio\features.py in bounds(geometry, north_up, transform)
    374 
    375     geom = geometry.get('geometry') or geometry
--> 376     return _bounds(geom, north_up=north_up, transform=transform)
    377 
    378 

rasterio\_features.pyx in rasterio._features._bounds()

KeyError: 'coordinates'


Read using multithreading

Carlos García Rodríguez
 

Hello, I have an application that calls rasterio to open rasters in its main process. This application allows me to parallelize the process, so it will call the function where rasterio open is many times in parallel.
If I open and close the image every time by doing

with rasterio.open(self.file) as src:
    raster = src.read(window=window)

The process goes fine, but it has to open and close at every step of the raster (losing a bit of time). Then I tried to let the image opened at the beginning of the process but it is not stable. If I don't have too many workers, the application might work well, but if I increase the number of workers it will crash with the following error:

RasterioIOError: Caught RasterioIOError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "rasterio/_io.pyx", line 707, in rasterio._io.DatasetReaderBase._read
File "rasterio/shim_rasterioex.pxi", line 133, in rasterio._shim.io_multi_band
File "rasterio/_err.pyx", line 182, in rasterio._err.exc_wrap_int
rasterio._err.CPLE_AppDefinedError: ./sentinel2_tiled.tif, band 1: IReadBlock failed at X offset 58, Y offset 92: TIFFReadEncodedTile() failed.

What do you recommend me to do?

Thank you.


Re: Speed up reading rasters

Carlos García Rodríguez
 

I have already fixed it. One of the main problems was that the tiled tiff was configured with size 256 and I was reading with windows of sizes 224. From here comes the disparity of the times of reading.
Also, I found faster to read pixel-based instead of band based. But it depends on the application.

Thank you all for your help!


Re: MemoryFile loses Profile information

ciaran.evans@...
 

As with Guillaume's answer, it's a Sentinel 2 band and has no side-car file, when I write straight to a file, the georeferencing is maintained.


Re: MemoryFile loses Profile information

Guillaume Lostis
 

Hi,

I guess the B01.jp2 file mentioned is a Sentinel-2 band, which is standalone, there is no side-car file attached to it.

I have tried downloading a single B01.jp2 file from a SAFE and running Vincent Sarago's snippet on it and I get the same result, the CRS and transform are lost when the jp2 is written through a MemoryFile.

Guillaume Lostis 

On Wed, 29 Apr 2020, 00:24 Sean Gillies via groups.io, <sean=mapbox.com@groups.io> wrote:
Hi,

On Tue, Apr 28, 2020 at 8:02 AM <ciaran.evans@...> wrote:

I tried this with GDAL 2.4.2 and GDAL 3.0 too

If anyone can point me to where I might look further to diagnose this, I can create an issue with some useful info, whether that's in Rasterio/GDAL :)


To confirm: you're right, there's no need to seek after writing using mem.write() because the dataset API doesn't change the MemoryFile's stream position. It remains at 0, the beginning of the file.

Is it possible that your B01.jp2 has auxiliary files? If so, they can be lost because MemoryFile.read() only returns bytes from the primary file and will overlook auxiliaries. For example, see the code below using rasterio's test data.

$ rio insp tests/data/RGB.byte.tif
>>> from rasterio.io import MemoryFile
>>> profile = src.profile
>>> del profile["tiled"]
>>> del profile["interleave"]
>>> profile["driver"] = "PNG"
>>> with MemoryFile(filename="lolwut.png") as memfile:
...     with memfile.open(**profile) as dataset_1:
...         dataset_1.write(src.read())
...         print(dataset_1.files)
...         print(dataset_1.profile)
...     with memfile.open() as dataset_2:
...         print(dataset_2.files)
...         print(dataset_2.profile)
...     with open("/tmp/lolwut.png", "wb") as f:
...         f.write(memfile.read())
...     with rasterio.open("/tmp/lolwut.png") as dataset_3:
...         print(dataset_3.files)
...         print(dataset_3.profile)
...
[]
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': CRS.from_epsg(32618), 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'tiled': False}
['/vsimem/lolwut.png', '/vsimem/lolwut.png.aux.xml']
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': CRS.from_epsg(32618), 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'tiled': False, 'interleave': 'pixel'}
775558
['/tmp/lolwut.png']
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': None, 'transform': Affine(1.0, 0.0, 0.0,
       0.0, 1.0, 0.0), 'tiled': False, 'interleave': 'pixel'}

dataset_1.files is an empty list because no files are written to the /vsimem virtual filesystem until dataset_1 is closed.

dataset_2.files show two files: the primary /vsimem/lolwut.png file and its auxiliary  '/vsimem/lolwut.png.aux.xml' file, containing the georeferencing information.

dataset_3.files shows only one file, the auxiliary file has been lost by memfile.read(). MemoryFile isn't any good for multiple-file formats or multiple-file format variants. It's fine for a profile of GeoTIFFs that keep their georeferencing and masks and overviews in a single file.

--
Sean Gillies


Re: MemoryFile loses Profile information

Sean Gillies
 

Hi,

On Tue, Apr 28, 2020 at 8:02 AM <ciaran.evans@...> wrote:

I tried this with GDAL 2.4.2 and GDAL 3.0 too

If anyone can point me to where I might look further to diagnose this, I can create an issue with some useful info, whether that's in Rasterio/GDAL :)


To confirm: you're right, there's no need to seek after writing using mem.write() because the dataset API doesn't change the MemoryFile's stream position. It remains at 0, the beginning of the file.

Is it possible that your B01.jp2 has auxiliary files? If so, they can be lost because MemoryFile.read() only returns bytes from the primary file and will overlook auxiliaries. For example, see the code below using rasterio's test data.

$ rio insp tests/data/RGB.byte.tif
>>> from rasterio.io import MemoryFile
>>> profile = src.profile
>>> del profile["tiled"]
>>> del profile["interleave"]
>>> profile["driver"] = "PNG"
>>> with MemoryFile(filename="lolwut.png") as memfile:
...     with memfile.open(**profile) as dataset_1:
...         dataset_1.write(src.read())
...         print(dataset_1.files)
...         print(dataset_1.profile)
...     with memfile.open() as dataset_2:
...         print(dataset_2.files)
...         print(dataset_2.profile)
...     with open("/tmp/lolwut.png", "wb") as f:
...         f.write(memfile.read())
...     with rasterio.open("/tmp/lolwut.png") as dataset_3:
...         print(dataset_3.files)
...         print(dataset_3.profile)
...
[]
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': CRS.from_epsg(32618), 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'tiled': False}
['/vsimem/lolwut.png', '/vsimem/lolwut.png.aux.xml']
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': CRS.from_epsg(32618), 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'tiled': False, 'interleave': 'pixel'}
775558
['/tmp/lolwut.png']
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': None, 'transform': Affine(1.0, 0.0, 0.0,
       0.0, 1.0, 0.0), 'tiled': False, 'interleave': 'pixel'}

dataset_1.files is an empty list because no files are written to the /vsimem virtual filesystem until dataset_1 is closed.

dataset_2.files show two files: the primary /vsimem/lolwut.png file and its auxiliary  '/vsimem/lolwut.png.aux.xml' file, containing the georeferencing information.

dataset_3.files shows only one file, the auxiliary file has been lost by memfile.read(). MemoryFile isn't any good for multiple-file formats or multiple-file format variants. It's fine for a profile of GeoTIFFs that keep their georeferencing and masks and overviews in a single file.

--
Sean Gillies


Re: MemoryFile loses Profile information

ciaran.evans@...
 

I tried this with GDAL 2.4.2 and GDAL 3.0 too

If anyone can point me to where I might look further to diagnose this, I can create an issue with some useful info, whether that's in Rasterio/GDAL :)


Re: Speed up reading rasters

Carlos García Rodríguez
 

I would also like to add that the first tiled read is not necessarily slow... 


El lun., 27 abr. 2020 20:06, Carlos García Rodríguez via groups.io <carlogarro=gmail.com@groups.io> escribió:
So, do you think it should be a good idea to increase the cache memory? If so, how to do it? I have plenty of ram so that should not be a problem. On the other side I checked I there is some relation between tiles proximity and time and didn't find it. You can see the position of each tile in the image. 
El lun., 27 abr. 2020 17:20, Sean Gillies <sean.gillies@...> escribió:
Hi,

On Mon, Apr 27, 2020 at 2:36 AM <carlogarro@...> wrote:
Hello, thank you so much for your recommendation, it speed it up x5. Very useful. Now I am having a problem that i do not understand.

I have the following script, where i access 10 random tiles of my raster. train_data is a vector [4822,2] of pixels position in the raster.
for i in range(10):
    idx = np.random.randint(4822)
    x_idx = train_data[idx][1]
    y_idx = train_data[idx][0]
    window = Window(y_idx, x_idx, 224, 224)
    start_time = time.time()
    with rasterio.open('./sentinel2_tiled.tif') as src:
        sentinel2 = src.read(window=window)
    end_time = (time.time() - start_time)

I do not understand why the times of loading a window are so different, as can be seen in the following image. Do you have some explanation?



 Thank you once more!

I can't say for sure about the time differences because I don't know much about your data files or your computer. However, know this: GDAL's I/O system caches blocks of raster data in memory, the size of the cache is generally 5% of your computers memory, and windowed reads may or may not be served directly from the cache depending on their size and adjacency to previously read data.

--
Sean Gillies


Re: Speed up reading rasters

Carlos García Rodríguez
 

So, do you think it should be a good idea to increase the cache memory? If so, how to do it? I have plenty of ram so that should not be a problem. On the other side I checked I there is some relation between tiles proximity and time and didn't find it. You can see the position of each tile in the image. 

El lun., 27 abr. 2020 17:20, Sean Gillies <sean.gillies@...> escribió:
Hi,

On Mon, Apr 27, 2020 at 2:36 AM <carlogarro@...> wrote:
Hello, thank you so much for your recommendation, it speed it up x5. Very useful. Now I am having a problem that i do not understand.

I have the following script, where i access 10 random tiles of my raster. train_data is a vector [4822,2] of pixels position in the raster.
for i in range(10):
    idx = np.random.randint(4822)
    x_idx = train_data[idx][1]
    y_idx = train_data[idx][0]
    window = Window(y_idx, x_idx, 224, 224)
    start_time = time.time()
    with rasterio.open('./sentinel2_tiled.tif') as src:
        sentinel2 = src.read(window=window)
    end_time = (time.time() - start_time)

I do not understand why the times of loading a window are so different, as can be seen in the following image. Do you have some explanation?



 Thank you once more!

I can't say for sure about the time differences because I don't know much about your data files or your computer. However, know this: GDAL's I/O system caches blocks of raster data in memory, the size of the cache is generally 5% of your computers memory, and windowed reads may or may not be served directly from the cache depending on their size and adjacency to previously read data.

--
Sean Gillies


Re: MemoryFile loses Profile information

vincent.sarago@...
 

I can confirm that writing a Jpeg2000 from memory losses the geo information

with rasterio.open("B01.jp2") as src:
    with MemoryFile() as memfile:
        with memfile.open(**src.profile) as mem:
            print(mem.meta)
            mem.write(src.read())
 
        with open("test.jp2", "wb") as f:
            f.write(memfile.read())
 
with rasterio.open("test.jp2") as src:
    print(src.meta)
 
# {'driver': 'JP2OpenJPEG', 'dtype': 'uint16', 'nodata': None, 'width': 1830, 'height': 1830, 'count': 1, 'crs': CRS.from_epsg(32707), 'transform': Affine(60.0, 0.0, 600000.0, 0.0, -60.0, 7100020.0)}
# {'driver': 'JP2OpenJPEG', 'dtype': 'uint16', 'nodata': None, 'width': 1830, 'height': 1830, 'count': 1, 'crs': None, 'transform': Affine(1.0, 0.0, 0.0, 0.0, 1.0, 0.0)}
    


Re: MemoryFile loses Profile information

ciaran.evans@...
 

Revised: Don't require the seek call:

s3_r = boto3.resource('s3')
profile = source_image.profile.copy()

with rasterio.io.MemoryFile() as stack:
    with stack.open(**profile) as stack_out:
        stack_out.write(source_image.read(1), 1)
    s3_r.Object(<my-bucket>, <my-key>).upload_fileobj(stack)

For reference, doing:

s3_r = boto3.resource('s3')
profile = source_image.profile.copy()

with rasterio.io.MemoryFile() as stack:
    profile['driver'] = 'GTiff'
    with stack.open(**profile) as stack_out:
        stack_out.write(source_image.read(1), 1)
    s3_r.Object(<my-bucket>, <my-key>).upload_fileobj(stack)

results in a correctly geocoded GeoTiff. Appears it is an issue with .jp2s and Profiles/MemoryFiles.


Re: Speed up reading rasters

Sean Gillies
 

Hi,

On Mon, Apr 27, 2020 at 2:36 AM <carlogarro@...> wrote:
Hello, thank you so much for your recommendation, it speed it up x5. Very useful. Now I am having a problem that i do not understand.

I have the following script, where i access 10 random tiles of my raster. train_data is a vector [4822,2] of pixels position in the raster.
for i in range(10):
    idx = np.random.randint(4822)
    x_idx = train_data[idx][1]
    y_idx = train_data[idx][0]
    window = Window(y_idx, x_idx, 224, 224)
    start_time = time.time()
    with rasterio.open('./sentinel2_tiled.tif') as src:
        sentinel2 = src.read(window=window)
    end_time = (time.time() - start_time)

I do not understand why the times of loading a window are so different, as can be seen in the following image. Do you have some explanation?



 Thank you once more!

I can't say for sure about the time differences because I don't know much about your data files or your computer. However, know this: GDAL's I/O system caches blocks of raster data in memory, the size of the cache is generally 5% of your computers memory, and windowed reads may or may not be served directly from the cache depending on their size and adjacency to previously read data.

--
Sean Gillies


Re: MemoryFile loses Profile information

ciaran.evans@...
 

On further inspection, when I switched to saving my outputs as GeoTIFF, the resultant images are both geo-coded fine (Disk vs. In Memory to S3).

I'm guessing something is going wrong when .jp2 is involved...

Any suggestions on next steps?

.jp2 is both preferable (because of file sizes) and a requirement for the work I'm currently doing...


MemoryFile loses Profile information

ciaran.evans@...
 

Hi there,

I'm currently re-writing a process that was stacking .jp2 bands into one image and writing the resultant stack to disk.

I'm moving this up to AWS so want to use MemoryFile's instead. 

My code looks something like:
s3_r = boto3.resource('s3')
profile = source_image.profile.copy()

with rasterio.io.MemoryFile() as stack:
with stack.open(**profile) as stack_out:
stack_out.write(source_image.read(1), 1)
stack.seek(0)
s3_r.Object(<my-bucket>, <my-key>).upload_fileobj(stack)

If I download that image and open it in QGIS, the image has no geo-coding at all.

If I process the same image, but write to disk like:

profile = source_image.profile.copy()

with rasterio.open('thing.jp2', 'w', **profile) as stack_out:
stack_out.write(source_image.read(1), 1))

I get a geo-coded image and it appears where I'd expect.

If I compare file sizes:
MemoryFile(): 49,211KB
Disk: 49,214KB

I can only assume I'm losing any profile information when using a MemoryFile, can someone direct me to what I'm doing wrong?

Thanks!


Re: Speed up reading rasters

Carlos García Rodríguez
 

Hello, thank you so much for your recommendation, it speed it up x5. Very useful. Now I am having a problem that i do not understand.

I have the following script, where i access 10 random tiles of my raster. train_data is a vector [4822,2] of pixels position in the raster.
for i in range(10):
    idx = np.random.randint(4822)
    x_idx = train_data[idx][1]
    y_idx = train_data[idx][0]
    window = Window(y_idx, x_idx, 224, 224)
    start_time = time.time()
    with rasterio.open('./sentinel2_tiled.tif') as src:
        sentinel2 = src.read(window=window)
    end_time = (time.time() - start_time)

I do not understand why the times of loading a window are so different, as can be seen in the following image. Do you have some explanation?



 Thank you once more!


Re: Speed up reading rasters

Even Rouault
 

On dimanche 26 avril 2020 03:04:56 CEST carlogarro@... wrote:

> Hello, I need to read many huge datasets and the speed time is very

> important to avoid a bottleneck. I have to read a tiff file that has 20

> bands, and a window of 224,224. Now I am doing like this, and it takes

> approx 0.8seconds.

>

> with rasterio.open('./sentinel.tif') as src:

> sentinel1_1 = src.read(window=window)

>

> What I realized is that if I try to read only one of the bands the required

> time is approx the same, but when reading a tiff of only one band the

> amount of time is 10 times shorter.

>

> Can I do something to speed it up? Maybe read bands in parallel, I don't

> really know.

 

If you have control on how the creation of the TIFF file, make sure it uses Band interleaving instead of Pixel interleaving

 

For example, with gdal_translate can be done with -co INTERLEAVE=BAND

 

If the file is not tiled, adding tiling might also help.

 

I see in https://rasterio.readthedocs.io/en/latest/topics/profiles.html a pure rasterio way of creating such file

 

--

Spatialys - Geospatial professional services

http://www.spatialys.com