Date   
Re: GDALRasterizeGeometries - Function Signature

jdcasta@...
 

Yeah I thought so. FYI the code you linked is the same one I embedded above. It looks like it really is just relying upon the in memory raster already having a configured transform associated with it.  I did the same to get it to work like so:

        GDALDatasetH memDS = GDALCreate( GDALGetDriverByName("MEM"), proc_num_str, cols, rows, 0, GDT_Byte, NULL );
        // Set transform
        reinterpret_cast<GDALDataset*>(memDS)->SetGeoTransform(transform);

Here are set the in memory raster dataset to use the specified transform (double [6]). Now I can call GDALRasterizeGeometries with NULL for pTransform and pTransformArg and it is handled appropriately.

Thanks!

Best way of doing concurrent window processing

Pablo Sanfilippo
 

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?

Re: Best way of doing concurrent window processing

Matthew Perry
 

Pablo,

I think your approach would make a great example and addition to the concurrency docs. Being able to do input, output and computation in parallel is a huge benefit. The locks seems effective at avoiding race conditions and I get a decent speedup (~3x on 4 cores) for a few test cases. So while I don't think there is or ever will be a single "right" way to solve concurrency, this is certainly a very viable option.

The only potential issue I see is the `compute` function is required release the GIL to get the full benefit. Most numpy array methods do this but if you've got additional calculations in pure python or call non-GIL-aware C functions, there's a chance that compute will block. If blocking computation outweighs IO significantly, you might not see much performance benefit if at all. Those users would be better served by some form of multiprocessing. 

It's hard to know if any arbitrary python code will block the GIL - making it tricky to predict the performance characteristics of this technique. I think a note in the docs re: the compute function and the GIL would be sufficient to warn users of any pitfalls. 

- Matt

Re: Best way of doing concurrent window processing

Sean Gillies
 

Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies

rasterio Google Cloud Storage (GCS) access

Madhav Desetty
 

Hi 

I forked and am trying to add support for rasterio and rio to handle files on Google Cloud Storage. I made changes to path.py and added remote schemes to include gs:// type urls similar to AWS S3 support. I tested it locally on macOS, the below command works:

rio info gs://pdd-stac/disasters/hurricane-harvey/0831/20170831_172754_101c_3b_Visual.tif`through the container.

But when I package rasterio in Docker container try through a container. I get the following error. Any ideas what I maybe doing wrong?

ERROR 6: CPLRSASHA256Sign() not implemented: GDAL must be built against libcrypto++ or libcrypto (openssl)

Re: rasterio Google Cloud Storage (GCS) access

Sean Gillies
 

Hi,

Your GDAL library needs to be built on a curl library that is itself built with SSL support. I think that the GDAL and curl packages in most distros will meet this requirement, but if you are compiling GDAL and curl from source you'll need to assure something like


for curl and


for GDAL.

Hope this helps!

On Sat, Dec 8, 2018 at 3:40 PM <madhav@...> wrote:

Hi 

I forked and am trying to add support for rasterio and rio to handle files on Google Cloud Storage. I made changes to path.py and added remote schemes to include gs:// type urls similar to AWS S3 support. I tested it locally on macOS, the below command works:

rio info gs://pdd-stac/disasters/hurricane-harvey/0831/20170831_172754_101c_3b_Visual.tif`through the container.

But when I package rasterio in Docker container try through a container. I get the following error. Any ideas what I maybe doing wrong?

ERROR 6: CPLRSASHA256Sign() not implemented: GDAL must be built against libcrypto++ or libcrypto (openssl)


--
Sean Gillies

Re: Best way of doing concurrent window processing

Pablo Sanfilippo
 

Hi Sean! Oh, that's really interesting. Thanks for sharing this. I'm going to do my own tests and then update the PR I made with the example.


On Fri, Dec 7, 2018 at 8:34 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies

Re: rasterio Google Cloud Storage (GCS) access

Even Rouault
 

On samedi 8 décembre 2018 16:05:52 CET Sean Gillies wrote:
Hi,

Your GDAL library needs to be built on a curl library that is itself built
with SSL support. I think that the GDAL and curl packages in most distros
will meet this requirement, but if you are compiling GDAL and curl from
source you'll need to assure something like
Actually, this warning is GDAL specific, and has nothing to do with curl (but
of course curl without SSL support is rather useless). The service account
authentication method of Google OAuth2 requires using a RSA signing key to
create signed requests to GCS, and, for that, "GDAL must be built against
libcrypto++ or libcrypto (openssl)" (sorry, can't find a better wording :-))

$ ./configure --help | grep crypto
--with-cryptopp=ARG Include cryptopp support (ARG=yes, no or path)
--with-crypto=ARG Include crypto (from openssl) support (ARG=yes, no
or path)

Even

--
Spatialys - Geospatial professional services
http://www.spatialys.com

Re: rasterio Google Cloud Storage (GCS) access

Madhav Desetty
 

I didn't build gdal but I am using the base gdal image from quay.io/mojodna/gdal:v2.3.2. s3 access is working perfectly fine. If GDAL was not built SSL, AWS S3 wouldn't have worked correct? Also is  there a development guide that I can use to test rasterio and package it for distribution in docker images? I looked at the Makefile but not sure what are best practices to fork and develop rasterio. 

Re: rasterio Google Cloud Storage (GCS) access

Sean Gillies
 

Thanks for the clarification, Even!


On Sat, Dec 8, 2018 at 9:17 AM Even Rouault <even.rouault@...> wrote:
On samedi 8 décembre 2018 16:05:52 CET Sean Gillies wrote:
> Hi,
>
> Your GDAL library needs to be built on a curl library that is itself built
> with SSL support. I think that the GDAL and curl packages in most distros
> will meet this requirement, but if you are compiling GDAL and curl from
> source you'll need to assure something like

Actually, this warning is GDAL specific, and has nothing to do with curl (but
of course curl without SSL support is rather useless). The service account
authentication method of Google OAuth2 requires using a RSA signing key to
create signed requests to GCS, and, for that, "GDAL must be built against
libcrypto++ or libcrypto (openssl)" (sorry, can't find a better wording :-))

$ ./configure --help | grep crypto
  --with-cryptopp=ARG       Include cryptopp support (ARG=yes, no or path)
  --with-crypto=ARG       Include crypto (from openssl) support (ARG=yes, no
or path)

Even

--
Spatialys - Geospatial professional services
http://www.spatialys.com





--
Sean Gillies

Re: Best way of doing concurrent window processing

Guy Doulberg
 

Hi guys

Wanted to share my 10 cent about this subject.

I am trying to do the concurrency support for reading by creating a vrt on the windows, VRTs are just XML so they can be created  eagerly in the map execution, moreover each of the vrts is a separate file so accessing the file parts doesn't require any lock - GDAL handles this for you.

Guy

Re: Best way of doing concurrent window processing

Sean Gillies
 

Hi Pablo,

I don't want to preempt you, but here's what I've come up with: https://gist.github.com/sgillies/b90a79917d7ec5ca0c074b5f6f4857e3#file-cfrw-py. If the window reads are done by the worker threads, the memory usage is proportional to the number of workers plus whatever is in the sequence of complete but unwritten futures. If writing a window is faster than computing on it, this allocation will be small. If computing is faster than writing, one could employ the chunkify() helper (which I got from Peters' example) to cap the number of futures and result arrays.

Opening a dataset for each window adds some overhead. A possible optimization: create a pool of dataset objects up front that is at least as large as the number of workers and then the workers could pop a dataset object from this pool and return it when done reading.

On Sat, Dec 8, 2018 at 9:14 AM Pablo Sanfilippo <sanfilippopablo@...> wrote:
Hi Sean! Oh, that's really interesting. Thanks for sharing this. I'm going to do my own tests and then update the PR I made with the example.

On Fri, Dec 7, 2018 at 8:34 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies



--
Sean Gillies

Loading large ENVI rasters into a MemoryFile

nickubels@...
 

Hello,

 

In my project I’m dealing with 217 hyperspectral raster files in ENVI format. These rasters contain 420 bands, this means that I have to deal with files that are 30GB+ in size. To keep stuff maintainable I’m working on a a high performance cluster where I can perform the calculations on these rasters in parallel by splitting it into several tasks. Primary operation in this case is grabbing masks from these rasters (I’m only interested in the raster parts in building polygons). As I/O is a bit of a bottleneck on the cluster I was considering loading the raster into RAM to speed up processing as reading from the network storage isn’t fast enough. I also tried copying the file to the local disk in the node, but when several of the subtasks are assigned to the same node things quickly grind to a halt. 

However, I can’t really get MemoryFile to work. I tried two approaches:

data = open(rasterpath, ‘rb’).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


And

data = rasterio.open(rasterpath).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


In the first case, I’m getting the following stack trace:

raceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/s2912279/bachelorproject/Code/generate_trainingset.py", line 118, in <module>
    with memfile.open() as raster:
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/env.py", line 366, in wrapper
    return f(*args, **kwds)
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/io.py", line 130, in open
    return DatasetReader(vsi_path, driver=driver, **kwargs)
  File "rasterio/_base.pyx", line 201, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

In the second case the following stack trace is generated with very poor performance (it takes a good 25 minutes to load 30GB, where as the above takes about 3 minutes to load all data into RAM.

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

 

I’m doubting if my way of using the MemoryFile functionality is the correct way. Is there something I’m doing wrong or am I missing something?

 

Kind regards,

 

Nick

Re: Best way of doing concurrent window processing

Pablo Sanfilippo
 

I don't want to preempt you
 
Not at all! This is great stuff. I'm a little short of time right now, but as soon as I can I'll test all of this. I don't know how much of an overhead opening a dataset per window is, but a dataset pool sound like a good idea.

On Sun, Dec 9, 2018 at 1:56 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I don't want to preempt you, but here's what I've come up with: https://gist.github.com/sgillies/b90a79917d7ec5ca0c074b5f6f4857e3#file-cfrw-py. If the window reads are done by the worker threads, the memory usage is proportional to the number of workers plus whatever is in the sequence of complete but unwritten futures. If writing a window is faster than computing on it, this allocation will be small. If computing is faster than writing, one could employ the chunkify() helper (which I got from Peters' example) to cap the number of futures and result arrays.

Opening a dataset for each window adds some overhead. A possible optimization: create a pool of dataset objects up front that is at least as large as the number of workers and then the workers could pop a dataset object from this pool and return it when done reading.

On Sat, Dec 8, 2018 at 9:14 AM Pablo Sanfilippo <sanfilippopablo@...> wrote:
Hi Sean! Oh, that's really interesting. Thanks for sharing this. I'm going to do my own tests and then update the PR I made with the example.

On Fri, Dec 7, 2018 at 8:34 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies



--
Sean Gillies

Re: Loading large ENVI rasters into a MemoryFile

Sean Gillies
 

Hi Nick,

I'm not an expert in the ENVI format or HPC, but I have a suggestion for you. As I understand from reading https://www.gdal.org/frmt_various.html, your ENVI dataset is represented by multiple files, yes? A binary file and a .hdr file? If you've copied only one of these into a rasterio MemoryFile, you won't be able to open it.

There are two different ways to get your dataset into RAM while preserving its format and I've recently added tests to the rasterio test suite to demonstrate that they work:


I think the second is the easier if you only have two files to deal with, such as the .tif and the .tif.msk in that test case. A .bin and a .hdr (or something like that) in your case. Here's a code snippet to try:

with MemoryFile(bin_bytes, filename='foo.bin') as memfile, MemoryFile(hdr_bytes, filename='foo.hdr'):
with memfile.open() as src:
    print(src.profile)

You won't need to bind the header memory file to a name because according to the GDAL docs, the binary file is the one to open.

Hope this helps.


On Tue, Dec 11, 2018 at 7:18 AM <nickubels@...> wrote:

Hello,

 

In my project I’m dealing with 217 hyperspectral raster files in ENVI format. These rasters contain 420 bands, this means that I have to deal with files that are 30GB+ in size. To keep stuff maintainable I’m working on a a high performance cluster where I can perform the calculations on these rasters in parallel by splitting it into several tasks. Primary operation in this case is grabbing masks from these rasters (I’m only interested in the raster parts in building polygons). As I/O is a bit of a bottleneck on the cluster I was considering loading the raster into RAM to speed up processing as reading from the network storage isn’t fast enough. I also tried copying the file to the local disk in the node, but when several of the subtasks are assigned to the same node things quickly grind to a halt. 

However, I can’t really get MemoryFile to work. I tried two approaches:

data = open(rasterpath, ‘rb’).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


And

data = rasterio.open(rasterpath).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


In the first case, I’m getting the following stack trace:

raceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/s2912279/bachelorproject/Code/generate_trainingset.py", line 118, in <module>
    with memfile.open() as raster:
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/env.py", line 366, in wrapper
    return f(*args, **kwds)
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/io.py", line 130, in open
    return DatasetReader(vsi_path, driver=driver, **kwargs)
  File "rasterio/_base.pyx", line 201, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

In the second case the following stack trace is generated with very poor performance (it takes a good 25 minutes to load 30GB, where as the above takes about 3 minutes to load all data into RAM.

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

 

I’m doubting if my way of using the MemoryFile functionality is the correct way. Is there something I’m doing wrong or am I missing something?

 

Kind regards,

 

Nick



--
Sean Gillies

Re: Loading large ENVI rasters into a MemoryFile

nickubels@...
 

Hi Sean,

I have tried your suggestion and it indeed works like a charm now. I totally forgot about the header file that was needed to open the binary data file. 

Thanks for your help and your awesome contributions!

Kind regards,

Nick

Re: Best way of doing concurrent window processing

Pablo Sanfilippo
 

Some tests results. I tried these methods on a 2.2GB image using sleep(0.05) as the compute function. Results:

TimePeak memory
Single thread window iteration15:31.97454668
simple ThreadPoolExecutor3:47.761359364
chunked ThreadPoolExecutor4:21.40475624
Reused datasets with locks3:55.75496556

I guess this really shows the overhead of opening a new dataset for each window read in the chunked ThreadPoolExecutor. The advantage of this is that, unlike the reused datasets with locks method, you can use ProcessPoolExecutor to overcome the GIL when the compute function doesn't release it. It might be worth trying to make a dataset pool, but it I'm thinking it correctly, it wouldn't work with ProcessPoolExecutor (which in my opinion is the biggest advantage of this method). Actually, thinking aloud, the reused dataset with locks is a simple case of a dataset pool (just one dataset).

On Tue, Dec 11, 2018 at 2:22 PM Pablo Sanfilippo via Groups.Io <sanfilippopablo=gmail.com@groups.io> wrote:
I don't want to preempt you
 
Not at all! This is great stuff. I'm a little short of time right now, but as soon as I can I'll test all of this. I don't know how much of an overhead opening a dataset per window is, but a dataset pool sound like a good idea.

On Sun, Dec 9, 2018 at 1:56 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I don't want to preempt you, but here's what I've come up with: https://gist.github.com/sgillies/b90a79917d7ec5ca0c074b5f6f4857e3#file-cfrw-py. If the window reads are done by the worker threads, the memory usage is proportional to the number of workers plus whatever is in the sequence of complete but unwritten futures. If writing a window is faster than computing on it, this allocation will be small. If computing is faster than writing, one could employ the chunkify() helper (which I got from Peters' example) to cap the number of futures and result arrays.

Opening a dataset for each window adds some overhead. A possible optimization: create a pool of dataset objects up front that is at least as large as the number of workers and then the workers could pop a dataset object from this pool and return it when done reading.

On Sat, Dec 8, 2018 at 9:14 AM Pablo Sanfilippo <sanfilippopablo@...> wrote:
Hi Sean! Oh, that's really interesting. Thanks for sharing this. I'm going to do my own tests and then update the PR I made with the example.

On Fri, Dec 7, 2018 at 8:34 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies



--
Sean Gillies

reproject on numpy array

Guy Doulberg
 

Hi guys

I am trying to use a method `rasterio.warp.reproject`

I create a source numpy array full of zeros and dest numpy array full of ones, I reprojected using the nearset resmapling and src_nodata=0
The dest array became all zeros,

I expected it to be all ones, since the array didn't have any valid data

What am I missing?

I am running this code:


data =  np.zeros((1,1000, 1000))
dest = np.ones_like(data)
reproject(data, dest, src_transform=raster.affine,dst_transform=projected.affine, src_crs=raster.crs, dst_crs=projected.crs, resampling=Resampling.nearest, src_nodata=data[0][0][0])
print(src_transform,"\n",dst_transform,"\n", src_crs, "\n",dst_crs)
print(data)
print(dest)

and this is the output I am getting

| 10.00, 0.00, 640590.00|
| 0.00,-10.00, 5839580.00|
| 0.00, 0.00, 1.00| 
 | 0.00, 0.00,-73.41|
| 0.00,-0.00,-37.58|
| 0.00, 0.00, 1.00| 
 +init=epsg:32718 
 +init=epsg:4326
[[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]
[[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]


What am I missing?

Re: reproject on numpy array

Alan Snow
 

Hi Guy,

The dest array will be overwritten by the reprojected data based on your source array. So, to get ones in the destination, you would need to modify the source array to have ones in it.

Hope that helps,
Alan

Re: reproject on numpy array

Sean Gillies
 

Hi,

The thing that is tripping Guy up is that reproject has an optional init_dest_nodata keyword argument that defaults to True: the output array is filled with the nodata value (0) before warped pixels are copied over. Guy, if you pass init_dest_nodata=False you will get the results you expect.


On Tue, Dec 18, 2018 at 10:06 AM Alan Snow <alansnow21@...> wrote:
Hi Guy,

The dest array will be overwritten by the reprojected data based on your source array. So, to get ones in the destination, you would need to modify the source array to have ones in it.

Hope that helps,
Alan



--
Sean Gillies