Create a Dataset from an array


ayr035@...
 

Hi all,

I'm struggling to find a good way to create a dataset object from a numpy array. Does such a workflow exist in rasterio? I'm trying to avoid any unnecessary memory copies as this numpy array could be quite large and I definitely need to avoid writing to disk if at all possible.

I'm trying to do some pre-processing on rasters before a merge and need to wrap an array in a dataset-like object that supports resampling, windowed reading, etc for rasterio.merge. I saw an InMemoryRaster class which seems like it could work, but unfortunately, the read method of InMemoryRaster lacks API compatibility with Dataset.read. Is there a solution that is escaping me? In my ideal world, something like DatasetReader.from_array(arr, **arr_profile) where a dataset could be created directly from an array would be incredibly useful. Are there existing solutions to accomplish this


Sean Gillies
 

Hi,

What you're asking for is possible, though not well abstracted by rasterio. It takes some extra GDAL configuration which I will try to explain.

GDAL's MEM format driver can read from an array that exists in memory. See https://gdal.org/drivers/raster/mem.html#dataset-name-format. So reading an existing numpy array at a very low level in GDAL is a matter of describing the numpy array's address and layout to GDAL. That information can be had from an array's "array interface". For example:

>>> import numpy
>>> arr = numpy.array(range(9), dtype="uint8").reshape(3,3)
>>> arr.__array_interface__
{'data': (22236512, False), 'strides': None, 'descr': [('', '|u1')], 'typestr': '|u1', 'shape': (3, 3), 'version': 3}

The GDAL dataset connection name for this is

>>> descr = "MEM:::DATAPOINTER=22236512,PIXELS=3,LINES=3,BANDS=1,DATATYPE=Byte" 

You can open this with rasterio in "r+" mode and attach a CRS and geo transform.

>>> with rasterio.open(descr, "r+") as dataset:
...     dataset.crs = "EPSG:4326"
...     dataset.transform = Affine.scale(2, -2) * Affine.identity()
...     print(dataset.profile)
...     print(dataset.read())
...
{'driver': 'MEM', 'dtype': 'uint8', 'nodata': None, 'width': 3, 'height': 3, 'count': 1, 'crs': CRS.from_epsg(4326), 'transform': Affine(2.0, 0.0, 0.0,
       0.0, -2.0, 0.0), 'tiled': False}
[[[0 1 2]
  [3 4 5]
  [6 7 8]]]

I don't have any such code in production so can't say anything about the performance or reliability. Obviously you have to be careful with the data pointer to avoid crashes.

On Thu, Aug 5, 2021 at 3:39 PM <ayr035@...> wrote:
Hi all,

I'm struggling to find a good way to create a dataset object from a numpy array. Does such a workflow exist in rasterio? I'm trying to avoid any unnecessary memory copies as this numpy array could be quite large and I definitely need to avoid writing to disk if at all possible.

I'm trying to do some pre-processing on rasters before a merge and need to wrap an array in a dataset-like object that supports resampling, windowed reading, etc for rasterio.merge. I saw an InMemoryRaster class which seems like it could work, but unfortunately, the read method of InMemoryRaster lacks API compatibility with Dataset.read. Is there a solution that is escaping me? In my ideal world, something like DatasetReader.from_array(arr, **arr_profile) where a dataset could be created directly from an array would be incredibly useful. Are there existing solutions to accomplish this


--
Sean Gillies


ayr035@...
 

Hi Sean,

Thank you for the example code. It seems to work reliably enough for my use case. It was way easier than I imagined it being as I thought it might involve hacking the InMemoryRaster class. Would it be a desirable addition to make it easy for a user to do this in rasterio (at least a helper function generate the MEM string given an array)?

Wrapping the array in a Dataset does not seem to have much of a noticeable performance penalty if any at all. The performance of reading directly from the array (making a copy of any slices) and reading from the Dataset were almost identical.

Thanks again.


Sean Gillies
 

I feel it's too early to make a new function for this. First, I'd like to try to use this to eliminate the copying of data inside InMemoryRaster.


On Thu, Aug 19, 2021 at 3:39 PM <ayr035@...> wrote:
Hi Sean,

Thank you for the example code. It seems to work reliably enough for my use case. It was way easier than I imagined it being as I thought it might involve hacking the InMemoryRaster class. Would it be a desirable addition to make it easy for a user to do this in rasterio (at least a helper function generate the MEM string given an array)?

Wrapping the array in a Dataset does not seem to have much of a noticeable performance penalty if any at all. The performance of reading directly from the array (making a copy of any slices) and reading from the Dataset were almost identical.

Thanks again.


--
Sean Gillies