Date   
Re: Unclear how to implement image alignment based on GCPs

mmweene15@...
 

Hi, thanks for your response. Let me provide some background. I have two sets of images that capture the same scene but were taken using separate cameras. This causes some degree of GPS discrepancy between the two sets of images, and the task is to align them based on GCPs generated either manually or automatically (irrelevant for this problem). Maybe I have completely misunderstood the purpose of the `reproject` function. I have included a sample of my GCPs below:

`GroundControlPoint(row=array(357), col=array(682), x=array(22.67637871), y=array(-33.94166626), z=0.0, id='45e0b74c-54a4-48c8-b876-1dfcc184c6c8')`

The row and column reference the position in the aligned image that corresponds to the x and y coordinates in the misaligned image. Perhaps my understanding of how to construct this is wrong as well? The documentation doesn't seem to cover this case all that clearly.

Thanks for your help.

Re: Make _boundless_vrt_doc public

Guy Doulberg
 

Hi Sean

I can't find the dev-group to discuss this issue, where can I find it?

Thanks, Guy

Bug: 'JPEG2000' is not a valid Compression

nicholas.maxwell@...
 

Is this the right place to report a bug? I can open an issue on github if you like.

I'm reading a NITF, and getting this error. Should add 'JPEG2000' to the Compression enum?

In[14]: src.tags(ns='IMAGE_STRUCTURE')
Out[14]: {'COMPRESSION': 'JPEG2000'}

In[15]: src.compression
Traceback (most recent call last):
  File "rasterio/_base.pyx", line 840, in rasterio._base.DatasetBase.compression.__get__
  File "/lego/lib/python3.7/enum.py", line 307, in __call__
    return cls.__new__(cls, value)
  File "/lego/lib/python3.7/enum.py", line 555, in __new__
    return cls._missing_(value)
  File "/lego/lib/python3.7/enum.py", line 568, in _missing_
    raise ValueError("%r is not a valid %s" % (value, cls.__name__))
ValueError: 'JPEG2000' is not a valid Compression

thanks

Re: Bug: 'JPEG2000' is not a valid Compression

Sean Gillies
 

Hi, 

This is indeed a bug. I would be much obliged if you reported it in the Rasterio issue tracker.

I haven't used NITF in a long time. I see in https://www.gdal.org/frmt_nitf.html that there's a lack of symmetry in the compression creation option name ("IC"), values ("C3", "M3", "C8"), and what gets stored in the metadata ("COMPRESSION" and "JPEG2000"). This will complicate things a bit, more than in the GeoTIFF situation where I mostly work.


On Wed, Nov 7, 2018 at 12:17 PM <nicholas.maxwell@...> wrote:
Is this the right place to report a bug? I can open an issue on github if you like.

I'm reading a NITF, and getting this error. Should add 'JPEG2000' to the Compression enum?

In[14]: src.tags(ns='IMAGE_STRUCTURE')
Out[14]: {'COMPRESSION': 'JPEG2000'}

In[15]: src.compression
Traceback (most recent call last):
  File "rasterio/_base.pyx", line 840, in rasterio._base.DatasetBase.compression.__get__
  File "/lego/lib/python3.7/enum.py", line 307, in __call__
    return cls.__new__(cls, value)
  File "/lego/lib/python3.7/enum.py", line 555, in __new__
    return cls._missing_(value)
  File "/lego/lib/python3.7/enum.py", line 568, in _missing_
    raise ValueError("%r is not a valid %s" % (value, cls.__name__))
ValueError: 'JPEG2000' is not a valid Compression

thanks



--
Sean Gillies

Re: Unclear how to implement image alignment based on GCPs

Sean Gillies
 

Hi,

On Tue, Nov 6, 2018 at 10:04 PM <mmweene15@...> wrote:
Hi, thanks for your response. Let me provide some background. I have two sets of images that capture the same scene but were taken using separate cameras. This causes some degree of GPS discrepancy between the two sets of images, and the task is to align them based on GCPs generated either manually or automatically (irrelevant for this problem). Maybe I have completely misunderstood the purpose of the `reproject` function. I have included a sample of my GCPs below:

What an interesting problem. The reproject() function is mainly for cartographic reprojection, but it might be useful in your case.
 

`GroundControlPoint(row=array(357), col=array(682), x=array(22.67637871), y=array(-33.94166626), z=0.0, id='45e0b74c-54a4-48c8-b876-1dfcc184c6c8')`

The row and column reference the position in the aligned image that corresponds to the x and y coordinates in the misaligned image. Perhaps my understanding of how to construct this is wrong as well? The documentation doesn't seem to cover this case all that clearly.

The row, col, x, and y for a GroundControlPoint need to be floats, not arrays of floats. I'm sorry that the documentation of this class is inadequate!


Thanks for your help.

You're welcome! I think that adjusting the GCPs may help, but may not be entirely enough.

--
Sean Gillies

Rasterio result different than gdal_calc

anand@...
 

I am trying to recreate gdal_calc for simple calculation using below code.

import numpy
import rasterio
def merge_bands(calculationType, band1Image, band2Image, outputFilePath):
# Read raster bands directly to Numpy arrays.
#
# We handle the connections with "with"
with rasterio.open(band1Image) as dsRed:
bandRed = dsRed.read(1)
 
with rasterio.open(band2Image) as dsNIR:
bandNIR = dsNIR.read(1)
 
# Allow division by zero
numpy.seterr(divide='ignore', invalid='ignore')
ndvi = (bandNIR.astype(float)-bandRed.astype(float))/(bandNIR.astype(float) + bandRed.astype(float))
kwargs = dsRed.meta
kwargs.update(dtype=rasterio.float32, count=1, compress='lzw')
with rasterio.open(outputFilePath, 'w', **kwargs) as dst:
dst.write_band(1, ndvi.astype(rasterio.float32))

But the result of this script is different than gdal_calc:

gdal_cal result range -0.1 - +0.09
rasterio result range 0.02 - 0.1
 

Re: Rasterio result different than gdal_calc

Sean Gillies
 

Hi,

Can you also include the expressions that you're using with gdal_calc? Without them we can only guess at the cause of the difference in the results. I don't see any big problems with the way you're using Rasterio. I have a couple comments below:

On Wed, Nov 21, 2018 at 3:35 PM anand via Groups.Io <anand=geospoc.com@groups.io> wrote:
I am trying to recreate gdal_calc for simple calculation using below code.

import numpy
import rasterio
def merge_bands(calculationType, band1Image, band2Image, outputFilePath):
# Read raster bands directly to Numpy arrays.
#
# We handle the connections with "with"
with rasterio.open(band1Image) as dsRed:
bandRed = dsRed.read(1)

You could convert this to a float32 now instead of doing it twice in a statement below.

 
with rasterio.open(band2Image) as dsNIR:
bandNIR = dsNIR.read(1)

Same comment.

 
# Allow division by zero
numpy.seterr(divide='ignore', invalid='ignore')

Unless you're doing this in gdal_calc too, this could be a cause of differences. I think it would be better to add keyword argument `masked=True` when you read the source bands. This will give you masked arrays that are safer to work when calculating the NDVI.

ndvi = (bandNIR.astype(float)-bandRed.astype(float))/(bandNIR.astype(float) + bandRed.astype(float))

If you didn't specify `astype(float)` here, Numpy would convert to float64 as soon as it performed the division operation, and would return a float64 ndarray.
 
kwargs = dsRed.meta
kwargs.update(dtype=rasterio.float32, count=1, compress='lzw')
with rasterio.open(outputFilePath, 'w', **kwargs) as dst:
dst.write_band(1, ndvi.astype(rasterio.float32))

But the result of this script is different than gdal_calc:

gdal_cal result range -0.1 - +0.09
rasterio result range 0.02 - 0.1

Like I said, we need to see the gdal_calc expression. I suspect nodata pixels are involved.

--
Sean Gillies

Re: Rasterio result different than gdal_calc

anand@...
 

Hi Sean,

Sure, I will try those changes you have mentioned. here is the gdal_calc expression I used :

gdal_calc.py -A T1_B5.TIF --A_band=1 -B T1_B4.TIF --B_band=1 --outfile=ndvi_gal_1.TIF --calc='(A.astype(float)-B)/(A.astype(float)+B)' --type='Float32'
--co='COMPRESS=LZW'

GDALRasterizeGeometries - Function Signature

jdcasta@...
 

Hi this is a little bit  of a strange question about how rasterio is using the GDALRasterizeGeometries function. This c-function is called in _features.pyx here:

with InMemoryRaster(image=image, transform=transform) as mem:
  exc_wrap_int(
  GDALRasterizeGeometries(
  mem.handle(), 1, mem.band_ids,num_geoms, geoms, NULL,
  mem.gdal_transform, pixel_values, options, NULL, NULL))

My question has to do with the bolded arguments, 6 & 7. The python function docstring expresses the notion that the affine matrix, transform, is used convert between the shapes coordinate system to the pixels provided in image. However looking at the CPP source code for this function I see this function declaration:


CPLErr GDALRasterizeGeometries ( GDALDatasetH  hDS,
    int  nBandCount,
    int *  panBandList,
    int  nGeomCount,
    OGRGeometryH *  pahGeometries,
    GDALTransformerFunc  pfnTransformer,
    void *  pTransformArg,
    double *  padfGeomBurnValue,
    char **  papszOptions,
    GDALProgressFunc  pfnProgress,
    void *  pProgressArg 
  )

As you can see its it seems rasterio is passing NULL to pfnTransformer and the affine transform to pTransformArg. This pTransformArg is not really meant to be used for the affine matrix, but as an argument to a valid pfnTransformer.

It *think* everything still works because of this code here that checks the arguments. If it sees that pfnTransformer is NULL it will automatically try to get the transform from the raster dataset. But I have not idead how in memory dataset will have that information
so I keep thinking this theory must be wrong.
if( pfnTransformer == nullptr )
  {
  bNeedToFreeTransformer = true;
   
  char** papszTransformerOptions = nullptr;
  double adfGeoTransform[6] = { 0.0 };
  if( poDS->GetGeoTransform( adfGeoTransform ) != CE_None &&
  poDS->GetGCPCount() == 0 &&
  poDS->GetMetadata("RPC") == nullptr )
  {
  papszTransformerOptions = CSLSetNameValue(
  papszTransformerOptions, "DST_METHOD", "NO_GEOTRANSFORM");
  }
   
  pTransformArg =
  GDALCreateGenImgProjTransformer2( nullptr, hDS,
  papszTransformerOptions );
  CSLDestroy( papszTransformerOptions );
   
  pfnTransformer = GDALGenImgProjTransform;
  if( pTransformArg == nullptr )
  {
  return CE_Failure;
  }
  }


In the end I am just trying to figure out how in the world you are passing just the affine transform into the  GDALRasterizeGeometries as  pTransformArg.

PS. I'm used to github formatting so this has been a little different to type up!
In the end I was just trying to write some CPP code that is similar to rasterio for rasterizing freatures (in memory).

Re: GDALRasterizeGeometries - Function Signature

Sean Gillies
 

You're right, the transform that Rasterio passes to GDALRasterizeGeometries is not used. Worse, there looks like risk of corruption. I'm going to change Rasterio to pass NULL instead.

When GDALRasterizeGeometries is called without a transformer, the following code is executed: https://github.com/OSGeo/gdal/blob/master/gdal/alg/gdalrasterize.cpp#L686-L710. Therein you can see how a transformer is constructed and configured.


On Mon, Dec 3, 2018 at 8:52 PM <jdcasta@...> wrote:
Hi this is a little bit  of a strange question about how rasterio is using the GDALRasterizeGeometries function. This c-function is called in _features.pyx here:

with InMemoryRaster(image=image, transform=transform) as mem:
  exc_wrap_int(
  GDALRasterizeGeometries(
  mem.handle(), 1, mem.band_ids,num_geoms, geoms, NULL,
  mem.gdal_transform, pixel_values, options, NULL, NULL))

My question has to do with the bolded arguments, 6 & 7. The python function docstring expresses the notion that the affine matrix, transform, is used convert between the shapes coordinate system to the pixels provided in image. However looking at the CPP source code for this function I see this function declaration:


CPLErr GDALRasterizeGeometries ( GDALDatasetH  hDS,
    int  nBandCount,
    int *  panBandList,
    int  nGeomCount,
    OGRGeometryH *  pahGeometries,
    GDALTransformerFunc  pfnTransformer,
    void *  pTransformArg,
    double *  padfGeomBurnValue,
    char **  papszOptions,
    GDALProgressFunc  pfnProgress,
    void *  pProgressArg 
  )

As you can see its it seems rasterio is passing NULL to pfnTransformer and the affine transform to pTransformArg. This pTransformArg is not really meant to be used for the affine matrix, but as an argument to a valid pfnTransformer.

It *think* everything still works because of this code here that checks the arguments. If it sees that pfnTransformer is NULL it will automatically try to get the transform from the raster dataset. But I have not idead how in memory dataset will have that information
so I keep thinking this theory must be wrong.
if( pfnTransformer == nullptr )
  {
  bNeedToFreeTransformer = true;
   
  char** papszTransformerOptions = nullptr;
  double adfGeoTransform[6] = { 0.0 };
  if( poDS->GetGeoTransform( adfGeoTransform ) != CE_None &&
  poDS->GetGCPCount() == 0 &&
  poDS->GetMetadata("RPC") == nullptr )
  {
  papszTransformerOptions = CSLSetNameValue(
  papszTransformerOptions, "DST_METHOD", "NO_GEOTRANSFORM");
  }
   
  pTransformArg =
  GDALCreateGenImgProjTransformer2( nullptr, hDS,
  papszTransformerOptions );
  CSLDestroy( papszTransformerOptions );
   
  pfnTransformer = GDALGenImgProjTransform;
  if( pTransformArg == nullptr )
  {
  return CE_Failure;
  }
  }


In the end I am just trying to figure out how in the world you are passing just the affine transform into the  GDALRasterizeGeometries as  pTransformArg.

PS. I'm used to github formatting so this has been a little different to type up!
In the end I was just trying to write some CPP code that is similar to rasterio for rasterizing freatures (in memory).



--
Sean Gillies

Re: GDALRasterizeGeometries - Function Signature

jdcasta@...
 

Yeah I thought so. FYI the code you linked is the same one I embedded above. It looks like it really is just relying upon the in memory raster already having a configured transform associated with it.  I did the same to get it to work like so:

        GDALDatasetH memDS = GDALCreate( GDALGetDriverByName("MEM"), proc_num_str, cols, rows, 0, GDT_Byte, NULL );
        // Set transform
        reinterpret_cast<GDALDataset*>(memDS)->SetGeoTransform(transform);

Here are set the in memory raster dataset to use the specified transform (double [6]). Now I can call GDALRasterizeGeometries with NULL for pTransform and pTransformArg and it is handled appropriately.

Thanks!

Best way of doing concurrent window processing

Pablo Sanfilippo
 

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?

Re: Best way of doing concurrent window processing

Matthew Perry
 

Pablo,

I think your approach would make a great example and addition to the concurrency docs. Being able to do input, output and computation in parallel is a huge benefit. The locks seems effective at avoiding race conditions and I get a decent speedup (~3x on 4 cores) for a few test cases. So while I don't think there is or ever will be a single "right" way to solve concurrency, this is certainly a very viable option.

The only potential issue I see is the `compute` function is required release the GIL to get the full benefit. Most numpy array methods do this but if you've got additional calculations in pure python or call non-GIL-aware C functions, there's a chance that compute will block. If blocking computation outweighs IO significantly, you might not see much performance benefit if at all. Those users would be better served by some form of multiprocessing. 

It's hard to know if any arbitrary python code will block the GIL - making it tricky to predict the performance characteristics of this technique. I think a note in the docs re: the compute function and the GIL would be sufficient to warn users of any pitfalls. 

- Matt

Re: Best way of doing concurrent window processing

Sean Gillies
 

Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies

rasterio Google Cloud Storage (GCS) access

Madhav Desetty
 

Hi 

I forked and am trying to add support for rasterio and rio to handle files on Google Cloud Storage. I made changes to path.py and added remote schemes to include gs:// type urls similar to AWS S3 support. I tested it locally on macOS, the below command works:

rio info gs://pdd-stac/disasters/hurricane-harvey/0831/20170831_172754_101c_3b_Visual.tif`through the container.

But when I package rasterio in Docker container try through a container. I get the following error. Any ideas what I maybe doing wrong?

ERROR 6: CPLRSASHA256Sign() not implemented: GDAL must be built against libcrypto++ or libcrypto (openssl)

Re: rasterio Google Cloud Storage (GCS) access

Sean Gillies
 

Hi,

Your GDAL library needs to be built on a curl library that is itself built with SSL support. I think that the GDAL and curl packages in most distros will meet this requirement, but if you are compiling GDAL and curl from source you'll need to assure something like


for curl and


for GDAL.

Hope this helps!

On Sat, Dec 8, 2018 at 3:40 PM <madhav@...> wrote:

Hi 

I forked and am trying to add support for rasterio and rio to handle files on Google Cloud Storage. I made changes to path.py and added remote schemes to include gs:// type urls similar to AWS S3 support. I tested it locally on macOS, the below command works:

rio info gs://pdd-stac/disasters/hurricane-harvey/0831/20170831_172754_101c_3b_Visual.tif`through the container.

But when I package rasterio in Docker container try through a container. I get the following error. Any ideas what I maybe doing wrong?

ERROR 6: CPLRSASHA256Sign() not implemented: GDAL must be built against libcrypto++ or libcrypto (openssl)


--
Sean Gillies

Re: Best way of doing concurrent window processing

Pablo Sanfilippo
 

Hi Sean! Oh, that's really interesting. Thanks for sharing this. I'm going to do my own tests and then update the PR I made with the example.


On Fri, Dec 7, 2018 at 8:34 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies

Re: rasterio Google Cloud Storage (GCS) access

Even Rouault
 

On samedi 8 décembre 2018 16:05:52 CET Sean Gillies wrote:
Hi,

Your GDAL library needs to be built on a curl library that is itself built
with SSL support. I think that the GDAL and curl packages in most distros
will meet this requirement, but if you are compiling GDAL and curl from
source you'll need to assure something like
Actually, this warning is GDAL specific, and has nothing to do with curl (but
of course curl without SSL support is rather useless). The service account
authentication method of Google OAuth2 requires using a RSA signing key to
create signed requests to GCS, and, for that, "GDAL must be built against
libcrypto++ or libcrypto (openssl)" (sorry, can't find a better wording :-))

$ ./configure --help | grep crypto
--with-cryptopp=ARG Include cryptopp support (ARG=yes, no or path)
--with-crypto=ARG Include crypto (from openssl) support (ARG=yes, no
or path)

Even

--
Spatialys - Geospatial professional services
http://www.spatialys.com

Re: rasterio Google Cloud Storage (GCS) access

Madhav Desetty
 

I didn't build gdal but I am using the base gdal image from quay.io/mojodna/gdal:v2.3.2. s3 access is working perfectly fine. If GDAL was not built SSL, AWS S3 wouldn't have worked correct? Also is  there a development guide that I can use to test rasterio and package it for distribution in docker images? I looked at the Makefile but not sure what are best practices to fork and develop rasterio. 

Re: rasterio Google Cloud Storage (GCS) access

Sean Gillies
 

Thanks for the clarification, Even!


On Sat, Dec 8, 2018 at 9:17 AM Even Rouault <even.rouault@...> wrote:
On samedi 8 décembre 2018 16:05:52 CET Sean Gillies wrote:
> Hi,
>
> Your GDAL library needs to be built on a curl library that is itself built
> with SSL support. I think that the GDAL and curl packages in most distros
> will meet this requirement, but if you are compiling GDAL and curl from
> source you'll need to assure something like

Actually, this warning is GDAL specific, and has nothing to do with curl (but
of course curl without SSL support is rather useless). The service account
authentication method of Google OAuth2 requires using a RSA signing key to
create signed requests to GCS, and, for that, "GDAL must be built against
libcrypto++ or libcrypto (openssl)" (sorry, can't find a better wording :-))

$ ./configure --help | grep crypto
  --with-cryptopp=ARG       Include cryptopp support (ARG=yes, no or path)
  --with-crypto=ARG       Include crypto (from openssl) support (ARG=yes, no
or path)

Even

--
Spatialys - Geospatial professional services
http://www.spatialys.com





--
Sean Gillies