Best way of doing concurrent window processing

Pablo Sanfilippo
 

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?

Join main@rasterio.groups.io to automatically receive all group messages.