Topics

MemoryFile loses Profile information

ciaran.evans@...
 

Hi there,

I'm currently re-writing a process that was stacking .jp2 bands into one image and writing the resultant stack to disk.

I'm moving this up to AWS so want to use MemoryFile's instead. 

My code looks something like:
s3_r = boto3.resource('s3')
profile = source_image.profile.copy()

with rasterio.io.MemoryFile() as stack:
with stack.open(**profile) as stack_out:
stack_out.write(source_image.read(1), 1)
stack.seek(0)
s3_r.Object(<my-bucket>, <my-key>).upload_fileobj(stack)

If I download that image and open it in QGIS, the image has no geo-coding at all.

If I process the same image, but write to disk like:

profile = source_image.profile.copy()

with rasterio.open('thing.jp2', 'w', **profile) as stack_out:
stack_out.write(source_image.read(1), 1))

I get a geo-coded image and it appears where I'd expect.

If I compare file sizes:
MemoryFile(): 49,211KB
Disk: 49,214KB

I can only assume I'm losing any profile information when using a MemoryFile, can someone direct me to what I'm doing wrong?

Thanks!

ciaran.evans@...
 

On further inspection, when I switched to saving my outputs as GeoTIFF, the resultant images are both geo-coded fine (Disk vs. In Memory to S3).

I'm guessing something is going wrong when .jp2 is involved...

Any suggestions on next steps?

.jp2 is both preferable (because of file sizes) and a requirement for the work I'm currently doing...

ciaran.evans@...
 

Revised: Don't require the seek call:

s3_r = boto3.resource('s3')
profile = source_image.profile.copy()

with rasterio.io.MemoryFile() as stack:
    with stack.open(**profile) as stack_out:
        stack_out.write(source_image.read(1), 1)
    s3_r.Object(<my-bucket>, <my-key>).upload_fileobj(stack)

For reference, doing:

s3_r = boto3.resource('s3')
profile = source_image.profile.copy()

with rasterio.io.MemoryFile() as stack:
    profile['driver'] = 'GTiff'
    with stack.open(**profile) as stack_out:
        stack_out.write(source_image.read(1), 1)
    s3_r.Object(<my-bucket>, <my-key>).upload_fileobj(stack)

results in a correctly geocoded GeoTiff. Appears it is an issue with .jp2s and Profiles/MemoryFiles.

vincent.sarago@...
 

I can confirm that writing a Jpeg2000 from memory losses the geo information

with rasterio.open("B01.jp2") as src:
    with MemoryFile() as memfile:
        with memfile.open(**src.profile) as mem:
            print(mem.meta)
            mem.write(src.read())
 
        with open("test.jp2", "wb") as f:
            f.write(memfile.read())
 
with rasterio.open("test.jp2") as src:
    print(src.meta)
 
# {'driver': 'JP2OpenJPEG', 'dtype': 'uint16', 'nodata': None, 'width': 1830, 'height': 1830, 'count': 1, 'crs': CRS.from_epsg(32707), 'transform': Affine(60.0, 0.0, 600000.0, 0.0, -60.0, 7100020.0)}
# {'driver': 'JP2OpenJPEG', 'dtype': 'uint16', 'nodata': None, 'width': 1830, 'height': 1830, 'count': 1, 'crs': None, 'transform': Affine(1.0, 0.0, 0.0, 0.0, 1.0, 0.0)}
    

ciaran.evans@...
 

I tried this with GDAL 2.4.2 and GDAL 3.0 too

If anyone can point me to where I might look further to diagnose this, I can create an issue with some useful info, whether that's in Rasterio/GDAL :)

Sean Gillies
 

Hi,

On Tue, Apr 28, 2020 at 8:02 AM <ciaran.evans@...> wrote:

I tried this with GDAL 2.4.2 and GDAL 3.0 too

If anyone can point me to where I might look further to diagnose this, I can create an issue with some useful info, whether that's in Rasterio/GDAL :)


To confirm: you're right, there's no need to seek after writing using mem.write() because the dataset API doesn't change the MemoryFile's stream position. It remains at 0, the beginning of the file.

Is it possible that your B01.jp2 has auxiliary files? If so, they can be lost because MemoryFile.read() only returns bytes from the primary file and will overlook auxiliaries. For example, see the code below using rasterio's test data.

$ rio insp tests/data/RGB.byte.tif
>>> from rasterio.io import MemoryFile
>>> profile = src.profile
>>> del profile["tiled"]
>>> del profile["interleave"]
>>> profile["driver"] = "PNG"
>>> with MemoryFile(filename="lolwut.png") as memfile:
...     with memfile.open(**profile) as dataset_1:
...         dataset_1.write(src.read())
...         print(dataset_1.files)
...         print(dataset_1.profile)
...     with memfile.open() as dataset_2:
...         print(dataset_2.files)
...         print(dataset_2.profile)
...     with open("/tmp/lolwut.png", "wb") as f:
...         f.write(memfile.read())
...     with rasterio.open("/tmp/lolwut.png") as dataset_3:
...         print(dataset_3.files)
...         print(dataset_3.profile)
...
[]
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': CRS.from_epsg(32618), 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'tiled': False}
['/vsimem/lolwut.png', '/vsimem/lolwut.png.aux.xml']
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': CRS.from_epsg(32618), 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'tiled': False, 'interleave': 'pixel'}
775558
['/tmp/lolwut.png']
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': None, 'transform': Affine(1.0, 0.0, 0.0,
       0.0, 1.0, 0.0), 'tiled': False, 'interleave': 'pixel'}

dataset_1.files is an empty list because no files are written to the /vsimem virtual filesystem until dataset_1 is closed.

dataset_2.files show two files: the primary /vsimem/lolwut.png file and its auxiliary  '/vsimem/lolwut.png.aux.xml' file, containing the georeferencing information.

dataset_3.files shows only one file, the auxiliary file has been lost by memfile.read(). MemoryFile isn't any good for multiple-file formats or multiple-file format variants. It's fine for a profile of GeoTIFFs that keep their georeferencing and masks and overviews in a single file.

--
Sean Gillies

Guillaume Lostis
 

Hi,

I guess the B01.jp2 file mentioned is a Sentinel-2 band, which is standalone, there is no side-car file attached to it.

I have tried downloading a single B01.jp2 file from a SAFE and running Vincent Sarago's snippet on it and I get the same result, the CRS and transform are lost when the jp2 is written through a MemoryFile.

Guillaume Lostis 

On Wed, 29 Apr 2020, 00:24 Sean Gillies via groups.io, <sean=mapbox.com@groups.io> wrote:
Hi,

On Tue, Apr 28, 2020 at 8:02 AM <ciaran.evans@...> wrote:

I tried this with GDAL 2.4.2 and GDAL 3.0 too

If anyone can point me to where I might look further to diagnose this, I can create an issue with some useful info, whether that's in Rasterio/GDAL :)


To confirm: you're right, there's no need to seek after writing using mem.write() because the dataset API doesn't change the MemoryFile's stream position. It remains at 0, the beginning of the file.

Is it possible that your B01.jp2 has auxiliary files? If so, they can be lost because MemoryFile.read() only returns bytes from the primary file and will overlook auxiliaries. For example, see the code below using rasterio's test data.

$ rio insp tests/data/RGB.byte.tif
>>> from rasterio.io import MemoryFile
>>> profile = src.profile
>>> del profile["tiled"]
>>> del profile["interleave"]
>>> profile["driver"] = "PNG"
>>> with MemoryFile(filename="lolwut.png") as memfile:
...     with memfile.open(**profile) as dataset_1:
...         dataset_1.write(src.read())
...         print(dataset_1.files)
...         print(dataset_1.profile)
...     with memfile.open() as dataset_2:
...         print(dataset_2.files)
...         print(dataset_2.profile)
...     with open("/tmp/lolwut.png", "wb") as f:
...         f.write(memfile.read())
...     with rasterio.open("/tmp/lolwut.png") as dataset_3:
...         print(dataset_3.files)
...         print(dataset_3.profile)
...
[]
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': CRS.from_epsg(32618), 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'tiled': False}
['/vsimem/lolwut.png', '/vsimem/lolwut.png.aux.xml']
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': CRS.from_epsg(32618), 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'tiled': False, 'interleave': 'pixel'}
775558
['/tmp/lolwut.png']
{'driver': 'PNG', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': None, 'transform': Affine(1.0, 0.0, 0.0,
       0.0, 1.0, 0.0), 'tiled': False, 'interleave': 'pixel'}

dataset_1.files is an empty list because no files are written to the /vsimem virtual filesystem until dataset_1 is closed.

dataset_2.files show two files: the primary /vsimem/lolwut.png file and its auxiliary  '/vsimem/lolwut.png.aux.xml' file, containing the georeferencing information.

dataset_3.files shows only one file, the auxiliary file has been lost by memfile.read(). MemoryFile isn't any good for multiple-file formats or multiple-file format variants. It's fine for a profile of GeoTIFFs that keep their georeferencing and masks and overviews in a single file.

--
Sean Gillies

ciaran.evans@...
 

As with Guillaume's answer, it's a Sentinel 2 band and has no side-car file, when I write straight to a file, the georeferencing is maintained.

Sean Gillies
 



On Wed, Apr 29, 2020 at 1:38 AM <ciaran.evans@...> wrote:

As with Guillaume's answer, it's a Sentinel 2 band and has no side-car file, when I write straight to a file, the georeferencing is maintained.

Since there's no JP2 specific code in MemoryFile and things work with the GeoTIFF driver, I suspect there's a subtle bug involving the vsimem system and one of the JP2 drivers. It would be useful to try to reproduce this with GDAL's Python bindings, but I don't have them installed and won't be able to try that.

--
Sean Gillies

ciaran.evans@...
 

Would you say it's worth raising an issue on GDAL then?

I'm also not familiar with the Python GDAL bindings, Rasterio is my window in that haha.

Guillaume Lostis
 

Like Ciaran I'm not very familiar with GDAL's python bindings, but I've tried running a little snippet which I believe does the same thing as what was done with rasterio: it writes a JP2 to a vsimem in-memory file, and then writes the bytes to a file on disk.

from osgeo import gdal

mem, out = "/vsimem/B01.jp2", "new_B01.jp2"
gdal.Translate(mem, "B01.jp2")

# Taken from https://lists.osgeo.org/pipermail/gdal-dev/2016-August/045030.html
f = gdal.VSIFOpenL(mem, "rb")
gdal.VSIFSeekL(f, 0, 2)  # seek to end
size = gdal.VSIFTellL(f)
gdal.VSIFSeekL(f, 0, 0)  # seek to beginning
data = gdal.VSIFReadL(1, size, f)
gdal.Unlink(mem)

with open(out, "wb") as ff:
    ff.write(data)

When I look at the gdalinfo of new_B01.jp2, it has a CRS and a transform, so I am not able to reproduce the behavior we have with rasterio.

(Side note: Interestingly enough, the pixel values of new_B01.jp2 have slightly changed with respect to the original B01.jp2 file, so there is some data loss somewhere in the process. But maybe that is expected of the JP2 format and could be avoided by passing extra arguments to gdal.Translate? I mainly have experience with writing GeoTIFFs, writing JP2s is new to me)

Guillaume

Even Rouault
 

On mercredi 29 avril 2020 08:40:07 CEST Guillaume Lostis wrote:

> Like Ciaran I'm not very familiar with GDAL's python bindings, but I've

> tried running a little snippet which I believe does the same thing as what

> was done with `rasterio`: it writes a JP2 to a `vsimem` in-memory file, and

> then writes the bytes to a file on disk.

>

> ```python

> from osgeo import gdal

>

> mem, out = "/vsimem/B01.jp2", "new_B01.jp2"

> gdal.Translate(mem, "B01.jp2")

>

> # Taken from

> https://lists.osgeo.org/pipermail/gdal-dev/2016-August/045030.html f =

> gdal.VSIFOpenL(mem, "rb")

> gdal.VSIFSeekL(f, 0, 2) # seek to end

> size = gdal.VSIFTellL(f)

> gdal.VSIFSeekL(f, 0, 0) # seek to beginning

> data = gdal.VSIFReadL(1, size, f)

> gdal.Unlink(mem)

>

> with open(out, "wb") as ff:

> ff.write(data)

> ```

>

> When I look at the `gdalinfo` of `new_B01.jp2`, it has a CRS and a

> transform, so I am not able to reproduce the behavior we have with

> `rasterio`.

>

> (Side note: Interestingly enough, the pixel values of `new_B01.jp2` have

> slightly changed with respect to the original `B01.jp2` file, so there is

> some data loss somewhere in the process. But maybe that is expected of the

> JP2 format and could be avoided by passing extra arguments to

> `gdal.Translate`?

 

By default, the JP2OpenJPEG driver uses a lossy compression (my personal take on JPEG2000 is that it has very limited interested when used in its lossless profile. Better use more conventional algorithms that are way faster).

See https://gdal.org/drivers/raster/jp2openjpeg.html#lossless-compression for options to pass to get a lossless JPEG2000 as output.

 

So all in all that should be something like

 

gdal.Translate(mem, "B01.jp2",

format = "JP2OpenJPEG", # in case you have several JP2K drivers available

options = ["REVERSIBLE=YES", "QUALITY=100"])

 

 

--

Spatialys - Geospatial professional services

http://www.spatialys.com

Sean Gillies
 

Hi,

On Wed, Apr 29, 2020 at 9:40 AM Guillaume Lostis <g.lostis@...> wrote:

Like Ciaran I'm not very familiar with GDAL's python bindings, but I've tried running a little snippet which I believe does the same thing as what was done with rasterio: it writes a JP2 to a vsimem in-memory file, and then writes the bytes to a file on disk.

from osgeo import gdal

mem, out = "/vsimem/B01.jp2", "new_B01.jp2"
gdal.Translate(mem, "B01.jp2")

# Taken from https://lists.osgeo.org/pipermail/gdal-dev/2016-August/045030.html
f = gdal.VSIFOpenL(mem, "rb")
gdal.VSIFSeekL(f, 0, 2)  # seek to end
size = gdal.VSIFTellL(f)
gdal.VSIFSeekL(f, 0, 0)  # seek to beginning
data = gdal.VSIFReadL(1, size, f)
gdal.Unlink(mem)

with open(out, "wb") as ff:
    ff.write(data)

When I look at the gdalinfo of new_B01.jp2, it has a CRS and a transform, so I am not able to reproduce the behavior we have with rasterio.

(Side note: Interestingly enough, the pixel values of new_B01.jp2 have slightly changed with respect to the original B01.jp2 file, so there is some data loss somewhere in the process. But maybe that is expected of the JP2 format and could be avoided by passing extra arguments to gdal.Translate? I mainly have experience with writing GeoTIFFs, writing JP2s is new to me)

Guillaume

I've only now remembered (not being a regular jp2 user) that JPEG2000 is a create-copy format (see https://gdal.org/drivers/raster/index.html) and as such is not suited for uses cases like

    with rasterio.open("file.jp2", "w", driver="JP2OpenJPEG", ...) as dataset:
        dataset.write(data)

Rasterio tries to abstract over the differences between "create" and "create-copy" formats, but might falter in some cases. We're testing JPEG and PNG in the test suite, but not JPEG2000. If you're constructing a dataset from the bands of multiple other datasets, you should probably be using GeoTIFF as a format, it's well suited for this. And then convert to JPEG2000 before uploading to S3 using rasterio.shutil.copy, which calls on GDAL's GDALCreateCopy and is mostly equivalent to gdal_translate.

--
Sean Gillies

ciaran.evans@...
 

I think the problem is that I'm trying to do the conversion in memory, not on disk. If I've read your latest reply correctly, I'd convert to JPEG2000 on disk and then copy to S3?

Can I do the conversion via rasterio.shutil.copy in memory?