Topics

Loading large ENVI rasters into a MemoryFile

nickubels@...
 

Hello,

 

In my project I’m dealing with 217 hyperspectral raster files in ENVI format. These rasters contain 420 bands, this means that I have to deal with files that are 30GB+ in size. To keep stuff maintainable I’m working on a a high performance cluster where I can perform the calculations on these rasters in parallel by splitting it into several tasks. Primary operation in this case is grabbing masks from these rasters (I’m only interested in the raster parts in building polygons). As I/O is a bit of a bottleneck on the cluster I was considering loading the raster into RAM to speed up processing as reading from the network storage isn’t fast enough. I also tried copying the file to the local disk in the node, but when several of the subtasks are assigned to the same node things quickly grind to a halt. 

However, I can’t really get MemoryFile to work. I tried two approaches:

data = open(rasterpath, ‘rb’).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


And

data = rasterio.open(rasterpath).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


In the first case, I’m getting the following stack trace:

raceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/s2912279/bachelorproject/Code/generate_trainingset.py", line 118, in <module>
    with memfile.open() as raster:
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/env.py", line 366, in wrapper
    return f(*args, **kwds)
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/io.py", line 130, in open
    return DatasetReader(vsi_path, driver=driver, **kwargs)
  File "rasterio/_base.pyx", line 201, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

In the second case the following stack trace is generated with very poor performance (it takes a good 25 minutes to load 30GB, where as the above takes about 3 minutes to load all data into RAM.

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

 

I’m doubting if my way of using the MemoryFile functionality is the correct way. Is there something I’m doing wrong or am I missing something?

 

Kind regards,

 

Nick

Sean Gillies
 

Hi Nick,

I'm not an expert in the ENVI format or HPC, but I have a suggestion for you. As I understand from reading https://www.gdal.org/frmt_various.html, your ENVI dataset is represented by multiple files, yes? A binary file and a .hdr file? If you've copied only one of these into a rasterio MemoryFile, you won't be able to open it.

There are two different ways to get your dataset into RAM while preserving its format and I've recently added tests to the rasterio test suite to demonstrate that they work:


I think the second is the easier if you only have two files to deal with, such as the .tif and the .tif.msk in that test case. A .bin and a .hdr (or something like that) in your case. Here's a code snippet to try:

with MemoryFile(bin_bytes, filename='foo.bin') as memfile, MemoryFile(hdr_bytes, filename='foo.hdr'):
with memfile.open() as src:
    print(src.profile)

You won't need to bind the header memory file to a name because according to the GDAL docs, the binary file is the one to open.

Hope this helps.


On Tue, Dec 11, 2018 at 7:18 AM <nickubels@...> wrote:

Hello,

 

In my project I’m dealing with 217 hyperspectral raster files in ENVI format. These rasters contain 420 bands, this means that I have to deal with files that are 30GB+ in size. To keep stuff maintainable I’m working on a a high performance cluster where I can perform the calculations on these rasters in parallel by splitting it into several tasks. Primary operation in this case is grabbing masks from these rasters (I’m only interested in the raster parts in building polygons). As I/O is a bit of a bottleneck on the cluster I was considering loading the raster into RAM to speed up processing as reading from the network storage isn’t fast enough. I also tried copying the file to the local disk in the node, but when several of the subtasks are assigned to the same node things quickly grind to a halt. 

However, I can’t really get MemoryFile to work. I tried two approaches:

data = open(rasterpath, ‘rb’).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


And

data = rasterio.open(rasterpath).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


In the first case, I’m getting the following stack trace:

raceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/s2912279/bachelorproject/Code/generate_trainingset.py", line 118, in <module>
    with memfile.open() as raster:
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/env.py", line 366, in wrapper
    return f(*args, **kwds)
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/io.py", line 130, in open
    return DatasetReader(vsi_path, driver=driver, **kwargs)
  File "rasterio/_base.pyx", line 201, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

In the second case the following stack trace is generated with very poor performance (it takes a good 25 minutes to load 30GB, where as the above takes about 3 minutes to load all data into RAM.

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

 

I’m doubting if my way of using the MemoryFile functionality is the correct way. Is there something I’m doing wrong or am I missing something?

 

Kind regards,

 

Nick



--
Sean Gillies

nickubels@...
 

Hi Sean,

I have tried your suggestion and it indeed works like a charm now. I totally forgot about the header file that was needed to open the binary data file. 

Thanks for your help and your awesome contributions!

Kind regards,

Nick