Date   
Re: rasterio Google Cloud Storage (GCS) access

Sean Gillies
 

Thanks for the clarification, Even!


On Sat, Dec 8, 2018 at 9:17 AM Even Rouault <even.rouault@...> wrote:
On samedi 8 décembre 2018 16:05:52 CET Sean Gillies wrote:
> Hi,
>
> Your GDAL library needs to be built on a curl library that is itself built
> with SSL support. I think that the GDAL and curl packages in most distros
> will meet this requirement, but if you are compiling GDAL and curl from
> source you'll need to assure something like

Actually, this warning is GDAL specific, and has nothing to do with curl (but
of course curl without SSL support is rather useless). The service account
authentication method of Google OAuth2 requires using a RSA signing key to
create signed requests to GCS, and, for that, "GDAL must be built against
libcrypto++ or libcrypto (openssl)" (sorry, can't find a better wording :-))

$ ./configure --help | grep crypto
  --with-cryptopp=ARG       Include cryptopp support (ARG=yes, no or path)
  --with-crypto=ARG       Include crypto (from openssl) support (ARG=yes, no
or path)

Even

--
Spatialys - Geospatial professional services
http://www.spatialys.com





--
Sean Gillies

Re: Best way of doing concurrent window processing

Guy Doulberg
 

Hi guys

Wanted to share my 10 cent about this subject.

I am trying to do the concurrency support for reading by creating a vrt on the windows, VRTs are just XML so they can be created  eagerly in the map execution, moreover each of the vrts is a separate file so accessing the file parts doesn't require any lock - GDAL handles this for you.

Guy

Re: Best way of doing concurrent window processing

Sean Gillies
 

Hi Pablo,

I don't want to preempt you, but here's what I've come up with: https://gist.github.com/sgillies/b90a79917d7ec5ca0c074b5f6f4857e3#file-cfrw-py. If the window reads are done by the worker threads, the memory usage is proportional to the number of workers plus whatever is in the sequence of complete but unwritten futures. If writing a window is faster than computing on it, this allocation will be small. If computing is faster than writing, one could employ the chunkify() helper (which I got from Peters' example) to cap the number of futures and result arrays.

Opening a dataset for each window adds some overhead. A possible optimization: create a pool of dataset objects up front that is at least as large as the number of workers and then the workers could pop a dataset object from this pool and return it when done reading.

On Sat, Dec 8, 2018 at 9:14 AM Pablo Sanfilippo <sanfilippopablo@...> wrote:
Hi Sean! Oh, that's really interesting. Thanks for sharing this. I'm going to do my own tests and then update the PR I made with the example.

On Fri, Dec 7, 2018 at 8:34 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies



--
Sean Gillies

Loading large ENVI rasters into a MemoryFile

nickubels@...
 

Hello,

 

In my project I’m dealing with 217 hyperspectral raster files in ENVI format. These rasters contain 420 bands, this means that I have to deal with files that are 30GB+ in size. To keep stuff maintainable I’m working on a a high performance cluster where I can perform the calculations on these rasters in parallel by splitting it into several tasks. Primary operation in this case is grabbing masks from these rasters (I’m only interested in the raster parts in building polygons). As I/O is a bit of a bottleneck on the cluster I was considering loading the raster into RAM to speed up processing as reading from the network storage isn’t fast enough. I also tried copying the file to the local disk in the node, but when several of the subtasks are assigned to the same node things quickly grind to a halt. 

However, I can’t really get MemoryFile to work. I tried two approaches:

data = open(rasterpath, ‘rb’).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


And

data = rasterio.open(rasterpath).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


In the first case, I’m getting the following stack trace:

raceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/s2912279/bachelorproject/Code/generate_trainingset.py", line 118, in <module>
    with memfile.open() as raster:
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/env.py", line 366, in wrapper
    return f(*args, **kwds)
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/io.py", line 130, in open
    return DatasetReader(vsi_path, driver=driver, **kwargs)
  File "rasterio/_base.pyx", line 201, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

In the second case the following stack trace is generated with very poor performance (it takes a good 25 minutes to load 30GB, where as the above takes about 3 minutes to load all data into RAM.

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

 

I’m doubting if my way of using the MemoryFile functionality is the correct way. Is there something I’m doing wrong or am I missing something?

 

Kind regards,

 

Nick

Re: Best way of doing concurrent window processing

Pablo Sanfilippo
 

I don't want to preempt you
 
Not at all! This is great stuff. I'm a little short of time right now, but as soon as I can I'll test all of this. I don't know how much of an overhead opening a dataset per window is, but a dataset pool sound like a good idea.

On Sun, Dec 9, 2018 at 1:56 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I don't want to preempt you, but here's what I've come up with: https://gist.github.com/sgillies/b90a79917d7ec5ca0c074b5f6f4857e3#file-cfrw-py. If the window reads are done by the worker threads, the memory usage is proportional to the number of workers plus whatever is in the sequence of complete but unwritten futures. If writing a window is faster than computing on it, this allocation will be small. If computing is faster than writing, one could employ the chunkify() helper (which I got from Peters' example) to cap the number of futures and result arrays.

Opening a dataset for each window adds some overhead. A possible optimization: create a pool of dataset objects up front that is at least as large as the number of workers and then the workers could pop a dataset object from this pool and return it when done reading.

On Sat, Dec 8, 2018 at 9:14 AM Pablo Sanfilippo <sanfilippopablo@...> wrote:
Hi Sean! Oh, that's really interesting. Thanks for sharing this. I'm going to do my own tests and then update the PR I made with the example.

On Fri, Dec 7, 2018 at 8:34 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies



--
Sean Gillies

Re: Loading large ENVI rasters into a MemoryFile

Sean Gillies
 

Hi Nick,

I'm not an expert in the ENVI format or HPC, but I have a suggestion for you. As I understand from reading https://www.gdal.org/frmt_various.html, your ENVI dataset is represented by multiple files, yes? A binary file and a .hdr file? If you've copied only one of these into a rasterio MemoryFile, you won't be able to open it.

There are two different ways to get your dataset into RAM while preserving its format and I've recently added tests to the rasterio test suite to demonstrate that they work:


I think the second is the easier if you only have two files to deal with, such as the .tif and the .tif.msk in that test case. A .bin and a .hdr (or something like that) in your case. Here's a code snippet to try:

with MemoryFile(bin_bytes, filename='foo.bin') as memfile, MemoryFile(hdr_bytes, filename='foo.hdr'):
with memfile.open() as src:
    print(src.profile)

You won't need to bind the header memory file to a name because according to the GDAL docs, the binary file is the one to open.

Hope this helps.


On Tue, Dec 11, 2018 at 7:18 AM <nickubels@...> wrote:

Hello,

 

In my project I’m dealing with 217 hyperspectral raster files in ENVI format. These rasters contain 420 bands, this means that I have to deal with files that are 30GB+ in size. To keep stuff maintainable I’m working on a a high performance cluster where I can perform the calculations on these rasters in parallel by splitting it into several tasks. Primary operation in this case is grabbing masks from these rasters (I’m only interested in the raster parts in building polygons). As I/O is a bit of a bottleneck on the cluster I was considering loading the raster into RAM to speed up processing as reading from the network storage isn’t fast enough. I also tried copying the file to the local disk in the node, but when several of the subtasks are assigned to the same node things quickly grind to a halt. 

However, I can’t really get MemoryFile to work. I tried two approaches:

data = open(rasterpath, ‘rb’).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


And

data = rasterio.open(rasterpath).read()
with MemoryFile(data) as memfile:
with memfile.open() as raster:
print(raster.profile)


In the first case, I’m getting the following stack trace:

raceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/s2912279/bachelorproject/Code/generate_trainingset.py", line 118, in <module>
    with memfile.open() as raster:
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/env.py", line 366, in wrapper
    return f(*args, **kwds)
  File "/home/s2912279/bachelorproject/Code/venv/lib/python3.6/site-packages/rasterio/io.py", line 130, in open
    return DatasetReader(vsi_path, driver=driver, **kwargs)
  File "rasterio/_base.pyx", line 201, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

In the second case the following stack trace is generated with very poor performance (it takes a good 25 minutes to load 30GB, where as the above takes about 3 minutes to load all data into RAM.

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 199, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsimem/9ce098cf-1d79-4a73-88e4-5ada1bd35b1f.' not recognized as a supported file format.

 

I’m doubting if my way of using the MemoryFile functionality is the correct way. Is there something I’m doing wrong or am I missing something?

 

Kind regards,

 

Nick



--
Sean Gillies

Re: Loading large ENVI rasters into a MemoryFile

nickubels@...
 

Hi Sean,

I have tried your suggestion and it indeed works like a charm now. I totally forgot about the header file that was needed to open the binary data file. 

Thanks for your help and your awesome contributions!

Kind regards,

Nick

Re: Best way of doing concurrent window processing

Pablo Sanfilippo
 

Some tests results. I tried these methods on a 2.2GB image using sleep(0.05) as the compute function. Results:

TimePeak memory
Single thread window iteration15:31.97454668
simple ThreadPoolExecutor3:47.761359364
chunked ThreadPoolExecutor4:21.40475624
Reused datasets with locks3:55.75496556

I guess this really shows the overhead of opening a new dataset for each window read in the chunked ThreadPoolExecutor. The advantage of this is that, unlike the reused datasets with locks method, you can use ProcessPoolExecutor to overcome the GIL when the compute function doesn't release it. It might be worth trying to make a dataset pool, but it I'm thinking it correctly, it wouldn't work with ProcessPoolExecutor (which in my opinion is the biggest advantage of this method). Actually, thinking aloud, the reused dataset with locks is a simple case of a dataset pool (just one dataset).

On Tue, Dec 11, 2018 at 2:22 PM Pablo Sanfilippo via Groups.Io <sanfilippopablo=gmail.com@groups.io> wrote:
I don't want to preempt you
 
Not at all! This is great stuff. I'm a little short of time right now, but as soon as I can I'll test all of this. I don't know how much of an overhead opening a dataset per window is, but a dataset pool sound like a good idea.

On Sun, Dec 9, 2018 at 1:56 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I don't want to preempt you, but here's what I've come up with: https://gist.github.com/sgillies/b90a79917d7ec5ca0c074b5f6f4857e3#file-cfrw-py. If the window reads are done by the worker threads, the memory usage is proportional to the number of workers plus whatever is in the sequence of complete but unwritten futures. If writing a window is faster than computing on it, this allocation will be small. If computing is faster than writing, one could employ the chunkify() helper (which I got from Peters' example) to cap the number of futures and result arrays.

Opening a dataset for each window adds some overhead. A possible optimization: create a pool of dataset objects up front that is at least as large as the number of workers and then the workers could pop a dataset object from this pool and return it when done reading.

On Sat, Dec 8, 2018 at 9:14 AM Pablo Sanfilippo <sanfilippopablo@...> wrote:
Hi Sean! Oh, that's really interesting. Thanks for sharing this. I'm going to do my own tests and then update the PR I made with the example.

On Fri, Dec 7, 2018 at 8:34 PM Sean Gillies via Groups.Io <sean=mapbox.com@groups.io> wrote:
Hi Pablo,

I really appreciate your analysis of the concurrency example! Things have changed quite a bit in the newest versions of Python since it was written and it needs an update.

While researching the non-laziness of ThreadPoolExecutor.map I discovered a few interesting tickets in the Python tracker. One https://bugs.python.org/issue34168 has a suggestion from Tim Peters for chunking the input and using a combination of c.f.ThreadPoolExecutor.submit and c.f.as_completed as in this example: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. The chunk size (not to be confused with c.f.ThreadPoolExecutor.map's chunksize) would set the upper limit on memory consumption. This could be worth a try. I'm going to try it soon myself.

I feel like introducing locks is a step back from the nice c.f. abstraction to raw threading. However, an example that used threading only could be nice for users that are still on Python 2.7.

Yours,

On Tue, Dec 4, 2018 at 1:26 PM Pablo Sanfilippo <sanfilippopablo@...> wrote:

I've been looking for the best way of processing a large raster files while achieving both maximum processing parallelism/speed and a low memory footprint. To achieve a low memory footprint, is enough to iterate over windows and read, process and write in the same loop:

with rasterio.open(infile, 'r') as src:
    with rasterio.open(outfile, 'w', **src.profile) as dst:
        for ij, window in src.block_windows():
            array = src.read(window=window)
            result = compute(array)
            dst.write(result, window=window)

This has a low memory footprint: it has only one window in memory at a time. But it doesn't have concurrency.

So I looked at the concurrency examples (the one using asyncio coroutines and the one using ThreadPoolExecutor), which look like this:

with rasterio.open(infile, 'r') as src:
        with rasterio.open(outfile, "w", **src.profile) as dst:
            windows = [window for ij, window in dst.block_windows()]
            data_gen = (src.read(window=window) for window in windows)

            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                for window, result in zip(windows, executor.map(compute, data_gen)):
                    dst.write(result, window=window)

Now we have computing concurrency. The comment on the original example made it seem like read -> compute -> write was being done in a streaming manner, overlapping all those operations:

We map the compute() function over the raster data generator, zip the resulting iterator with the windows list, and as pairs come back we write data to the destination dataset.

Which is not true. A verbose test showed that computing was being overlap with reading and also (later) with writing, but reading and writing wasn't being overlapped, meaning that the whole file was in memory at some point. Monitoring of memory usage confirmed this.

This made more sense after reading the documentation of ThreadPoolExecutor.mapthe iterables are collected immediately rather than lazily. So it means that it will read the data_gen iterator eagerly, putting all the data in memory if necessary.

The example that uses asyncio coroutines shows the same behavior.

My next idea was to put the read, compute, and write operation all together in the function being passed to ThreadPoolExecutor.map so they can be overlapped too, like so:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        def process(window):
            src_array = src.read(window=window)
            result = compute(src_array)
            dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

This obviously caused a race condition because reading and writing are not thread safe operations, so the output was filled with GDAL read errors.

So I fixed that by using locks for the read and write operations:

with rasterio.open(infile) as src:
    with rasterio.open(outfile, "w", **src.profile) as dst:
        windows = [window for ij, window in dst.block_windows()]

        read_lock = threading.Lock()
        write_lock = threading.Lock()

        def process(window):
            with read_lock:
                src_array = src.read(window=window)
            result = compute(src_array)
            with write_lock:
                dst.write(result, window=window)

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(process, windows)

That fixed the race conditions. Computing is now done in parallel, reading and writing is done in a streaming way, and max memory footprint is window_array_size * num_threads.

My questions are:

  • Is this the right way to solve this problem? Am I missing something?
  • If this is the right way, would you accept a PR with a topic write up about this (maybe enhancements in the concurrency topic)?



--
Sean Gillies



--
Sean Gillies

reproject on numpy array

Guy Doulberg
 

Hi guys

I am trying to use a method `rasterio.warp.reproject`

I create a source numpy array full of zeros and dest numpy array full of ones, I reprojected using the nearset resmapling and src_nodata=0
The dest array became all zeros,

I expected it to be all ones, since the array didn't have any valid data

What am I missing?

I am running this code:


data =  np.zeros((1,1000, 1000))
dest = np.ones_like(data)
reproject(data, dest, src_transform=raster.affine,dst_transform=projected.affine, src_crs=raster.crs, dst_crs=projected.crs, resampling=Resampling.nearest, src_nodata=data[0][0][0])
print(src_transform,"\n",dst_transform,"\n", src_crs, "\n",dst_crs)
print(data)
print(dest)

and this is the output I am getting

| 10.00, 0.00, 640590.00|
| 0.00,-10.00, 5839580.00|
| 0.00, 0.00, 1.00| 
 | 0.00, 0.00,-73.41|
| 0.00,-0.00,-37.58|
| 0.00, 0.00, 1.00| 
 +init=epsg:32718 
 +init=epsg:4326
[[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]
[[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]


What am I missing?

Re: reproject on numpy array

Alan Snow
 

Hi Guy,

The dest array will be overwritten by the reprojected data based on your source array. So, to get ones in the destination, you would need to modify the source array to have ones in it.

Hope that helps,
Alan

Re: reproject on numpy array

Sean Gillies
 

Hi,

The thing that is tripping Guy up is that reproject has an optional init_dest_nodata keyword argument that defaults to True: the output array is filled with the nodata value (0) before warped pixels are copied over. Guy, if you pass init_dest_nodata=False you will get the results you expect.


On Tue, Dec 18, 2018 at 10:06 AM Alan Snow <alansnow21@...> wrote:
Hi Guy,

The dest array will be overwritten by the reprojected data based on your source array. So, to get ones in the destination, you would need to modify the source array to have ones in it.

Hope that helps,
Alan



--
Sean Gillies

Change the Nodata value in a file.

@cratcliff
 

Hi Everyone,

Using RasterIO, what is the correct way to change (or assign if it isn't already set) the no data value for a single or multi band TIF file?

For example using the test rgb.byte.tif file, how do I change the nodata value from 0 to 255?  So far I have tried:
with rasterio.open('RGB.byte.tif') as src:  
bands = src.read(masked=True)
bands.fill_value = 255
# or
# bands.set_fill_value(255)

kwargs = src.meta.copy()
kwargs.update({'nodata': 255})

with rasterio.open('copy.RGB.byte.tif', 'w', **kwargs) as dst:
dst.write(bands)
Currently this only changes tif header/metadata information and not the actual data.

Options I've seen to accomplish this include
# 1. masked arrays
bands = np.ma.masked_values(src.read(masked=True), 255)
dst.write(bands)

# 2. loop through bands and change value
for i in range(1, src.count + 1):
band = src.read(i)
band = np.where(band==0,255,band)
dst.write(bands,i)

# 3. numpy indexing
bands = src.read()
bands[bands==0] = 255
Or should I be using rasterIO's read_mask(), write_mask() or dataset_mask()?

What is the preferred method? I haven't been able to find this in any of the user documentation.

Thanks,

Re: Change the Nodata value in a file.

Sean Gillies
 

Hi,

I'm sorry this isn't more obviously documented. You can change the nodata value of an existing dataset by opening it in r+ mode and setting the attribute.

with rasterio.open("example.tif", "r+") as dataset:
    dataset.nodata = 255

Same goes for the crs and transform attributes. Hope this helps!


On Wed, Dec 19, 2018 at 6:58 AM <christina.ratcliff@...> wrote:
Hi Everyone,

Using RasterIO, what is the correct way to change (or assign if it isn't already set) the no data value for a single or multi band TIF file?

For example using the test rgb.byte.tif file, how do I change the nodata value from 0 to 255?  So far I have tried:
with rasterio.open('RGB.byte.tif') as src:  
bands = src.read(masked=True)
bands.fill_value = 255
# or
# bands.set_fill_value(255)

kwargs = src.meta.copy()
kwargs.update({'nodata': 255})

with rasterio.open('copy.RGB.byte.tif', 'w', **kwargs) as dst:
dst.write(bands)
Currently this only changes tif header/metadata information and not the actual data.

Options I've seen to accomplish this include
# 1. masked arrays
bands = np.ma.masked_values(src.read(masked=True), 255)
dst.write(bands)

# 2. loop through bands and change value
for i in range(1, src.count + 1):
band = src.read(i)
band = np.where(band==0,255,band)
dst.write(bands,i)

# 3. numpy indexing
bands = src.read()
bands[bands==0] = 255
Or should I be using rasterIO's read_mask(), write_mask() or dataset_mask()?

What is the preferred method? I haven't been able to find this in any of the user documentation.

Thanks,



--
Sean Gillies

Re: Change the Nodata value in a file.

Jonas
 

The `r+` trick is very neat for changing the metadata in-place.

But for actually changing the data values to match the new metadata, both your suggestions are fine - either used masked arrays or do the masking yourself. There is no smarter method and depending on your data type, you have to take care to get the masking right (or rely on `read(..., masked=True)` for it), e.g. for floating point comparison or NaN.

Btw, this question is ideal for Stack Overflow... -- in this case you could post and accept your own answer, though. :-D

Re: [rasterio-dev] Writing azure blobs does not create blobs

Sean Gillies
 

Hi Niklas,

First off, in my original reply I missed the chance to suggest that we move this discussion to the main (not dev) discussion group, which has more users. The dev group is for discussing and planning new features for the library. I'm going to cross post over to main@rasterio.groups.io and would appreciate it if you would follow up there.

Nothing in the logs jumps out at me, so I'm stumped. I don't have any experience using Azure blob storage and have never used GDAL or Rasterio to write to cloud objects of any kind. In my own applications, I write local files and then upload them to AWS S3 using the AWS Python SDK (boto3 &c). There may be some limitations or required GDAL configuration that I am not aware of.

Are you able to write blobs to the same Azure bucket using gdal_translate instead of a Rasterio program?

And of course, we really need to know what versions of Rasterio and GDAL are employed here and where they come from: PyPI or Conda-forge or where else?

On Thu, Jan 3, 2019 at 5:25 AM Niklas Heim <heim.niklas@...> wrote:
Hi Sean,

Thank you for your reply!
After setting the CURL_CA_BUNDLE environment variable I could run the above script without GDAL_HTTP_UNSAFESSL.
The logs with CPL_CURL_VERBOSE enabled are attached below. It looks like the new blob is not created?

DEBUG:rasterio.env:Entering env context: <rasterio.env.Env object at 0x7f36b3c0c630>
DEBUG:rasterio.env:Starting outermost env
DEBUG:rasterio.env:No GDAL environment exists
DEBUG:rasterio.env:New GDAL environment <rasterio._env.GDALEnv object at 0x7f3699eab198> created
DEBUG:rasterio._env:GDAL_DATA not found in environment, set to '/home/nihe/.envs/azure/lib/python3.6/site-packages/rasterio/gdal_data'.
DEBUG:rasterio._env:Started GDALEnv <rasterio._env.GDALEnv object at 0x7f3699eab198>.
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Entered env context: <rasterio.env.Env object at 0x7f36b3c0c630>
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Entering env context: <rasterio.env.Env object at 0x7f36b3cc8b70>
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Entered env context: <rasterio.env.Env object at 0x7f36b3cc8b70>
* Couldn't find host AZURE-STORAGE-NAME.blob.core.windows.net in the .netrc file; using defaults
*   Trying 52.239.213.4...
* TCP_NODELAY set
* Connected to AZURE-STORAGE-NAME.blob.core.windows.net (52.239.213.4) port 443 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=*.blob.core.windows.net
*  start date: Nov  9 05:42:30 2017 GMT
*  expire date: Nov  9 05:42:30 2019 GMT
*  subjectAltName: host "AZURE-STORAGE-NAME.blob.core.windows.net" matched cert's "*.blob.core.windows.net"
*  issuer: C=US; ST=Washington; L=Redmond; O=Microsoft Corporation; OU=Microsoft IT; CN=Microsoft IT TLS CA 2
*  SSL certificate verify ok.
> GET /products?comp=list&delimiter=%2F&prefix=data%2F&restype=container HTTP/1.1
Host: AZURE-STORAGE-NAME.blob.core.windows.net
Accept: */*
x-ms-date: Thu, 03 Jan 2019 11:55:58 GMT
x-ms-version: 2015-02-21
Authorization: SharedKey AZURE-STORAGE-NAME:AZURE-KEY

< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Content-Type: application/xml
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: REQUEST-ID
< x-ms-version: 2015-02-21
< Date: Thu, 03 Jan 2019 11:55:58 GMT
< 
* Connection #0 to host AZURE-STORAGE-NAME.blob.core.windows.net left intact
* Couldn't find host AZURE-STORAGE-NAME.blob.core.windows.net in the .netrc file; using defaults
* Found bundle for host AZURE-STORAGE-NAME.blob.core.windows.net: 0x1b25870 [can pipeline]
* Re-using existing connection! (#0) with host AZURE-STORAGE-NAME.blob.core.windows.net
* Connected to AZURE-STORAGE-NAME.blob.core.windows.net (52.239.213.4) port 443 (#0)
> GET /products/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif HTTP/1.1
Host: AZURE-STORAGE-NAME.blob.core.windows.net
Accept: */*
Range: bytes=0-10161
x-ms-date: Thu, 03 Jan 2019 11:55:59 GMT
x-ms-version: 2015-02-21
Authorization: SharedKey AZURE-STORAGE-NAME:AZURE-KEY

< HTTP/1.1 206 Partial Content
< Content-Length: 10162
< Content-Type: application/octet-stream
< Content-Range: bytes 0-10161/10162
< Last-Modified: Thu, 20 Dec 2018 10:25:26 GMT
< Accept-Ranges: bytes
< ETag: "0x8D666657591C03E"
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: REQUEST-ID
< x-ms-version: 2015-02-21
< x-ms-lease-status: unlocked
< x-ms-lease-state: available
< x-ms-blob-type: BlockBlob
< Date: Thu, 03 Jan 2019 11:55:58 GMT
< 
* Connection #0 to host AZURE-STORAGE-NAME.blob.core.windows.net left intact
/home/nihe/.envs/azure/lib/python3.6/site-packages/rasterio/__init__.py:216: NotGeoreferencedWarning: Dataset has no geotransform set. The identity matrix may be returned.
  s = DatasetReader(path, driver=driver, **kwargs)
DEBUG:rasterio._base:No projection detected.
DEBUG:rasterio._base:Nodata success: 0, Nodata value: -10000000000.000000
DEBUG:rasterio._base:No projection detected.
DEBUG:rasterio._base:Dataset <open DatasetReader name='/vsiaz/products/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='r'> is started.
DEBUG:rasterio.env:Exiting env context: <rasterio.env.Env object at 0x7f36b3cc8b70>
DEBUG:rasterio.env:Cleared existing <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio._env:Stopped GDALEnv <rasterio._env.GDALEnv object at 0x7f3699eab198>.
DEBUG:rasterio.env:No GDAL environment exists
DEBUG:rasterio.env:New GDAL environment <rasterio._env.GDALEnv object at 0x7f3699eab198> created
DEBUG:rasterio._env:GDAL_DATA not found in environment, set to '/home/nihe/.envs/azure/lib/python3.6/site-packages/rasterio/gdal_data'.
DEBUG:rasterio._env:Started GDALEnv <rasterio._env.GDALEnv object at 0x7f3699eab198>.
DEBUG:rasterio.env:Exited env context: <rasterio.env.Env object at 0x7f36b3cc8b70>
DEBUG:rasterio._base:Entering Dataset <open DatasetReader name='/vsiaz/products/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='r'> context.
DEBUG:rasterio._io:Output nodata value read from file: None
DEBUG:rasterio._io:Output nodata values: [None]
DEBUG:rasterio._io:all_valid: True
DEBUG:rasterio._io:mask_flags: ([<MaskFlags.all_valid: 1>],)
DEBUG:rasterio._io:Jump straight to _read()
DEBUG:rasterio._io:Window: None
DEBUG:rasterio._io:IO window xoff=0.0 yoff=0.0 width=100.0 height=100.0
* Couldn't find host AZURE-STORAGE-NAME.blob.core.windows.net in the .netrc file; using defaults
* Found bundle for host AZURE-STORAGE-NAME.blob.core.windows.net: 0x1b25870 [can pipeline]
* Re-using existing connection! (#0) with host AZURE-STORAGE-NAME.blob.core.windows.net
* Connected to AZURE-STORAGE-NAME.blob.core.windows.net (52.239.213.4) port 443 (#0)
> GET /products/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif HTTP/1.1
Host: AZURE-STORAGE-NAME.blob.core.windows.net
Accept: */*
Range: bytes=162-10161
x-ms-date: Thu, 03 Jan 2019 11:55:59 GMT
x-ms-version: 2015-02-21
Authorization: SharedKey AZURE-STORAGE-NAME:AZURE-KEY

< HTTP/1.1 206 Partial Content
< Content-Length: 10000
< Content-Type: application/octet-stream
< Content-Range: bytes 162-10161/10162
< Last-Modified: Thu, 20 Dec 2018 10:25:26 GMT
< Accept-Ranges: bytes
< ETag: "0x8D666657591C03E"
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: REQUEST-ID
< x-ms-version: 2015-02-21
< x-ms-lease-status: unlocked
< x-ms-lease-state: available
< x-ms-blob-type: BlockBlob
< Date: Thu, 03 Jan 2019 11:55:58 GMT
< 
* Connection #0 to host AZURE-STORAGE-NAME.blob.core.windows.net left intact
DEBUG:rasterio._base:Dataset <open DatasetReader name='/vsiaz/products/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='r'> has been stopped.
DEBUG:rasterio._base:Dataset <closed DatasetReader name='/vsiaz/products/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='r'> has been closed.
DEBUG:rasterio._base:Exited Dataset <closed DatasetReader name='/vsiaz/products/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='r'> context.
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Entering env context: <rasterio.env.Env object at 0x7f3699b73080>
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Entered env context: <rasterio.env.Env object at 0x7f3699b73080>
DEBUG:rasterio._io:Path: UnparsedPath(path='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif'), mode: w, driver: GTiff
DEBUG:rasterio._io:Option: ('TILED', b'FALSE')
DEBUG:rasterio._io:Option: ('INTERLEAVE', b'BAND')
* Couldn't find host AZURE-STORAGE-NAME.blob.core.windows.net in the .netrc file; using defaults
* Found bundle for host AZURE-STORAGE-NAME.blob.core.windows.net: 0x1b25870 [can pipeline]
* Re-using existing connection! (#0) with host AZURE-STORAGE-NAME.blob.core.windows.net
* Connected to AZURE-STORAGE-NAME.blob.core.windows.net (52.239.213.4) port 443 (#0)
> GET /ready?comp=list&delimiter=%2F&prefix=data%2F&restype=container HTTP/1.1
Host: AZURE-STORAGE-NAME.blob.core.windows.net
Accept: */*
x-ms-date: Thu, 03 Jan 2019 11:55:59 GMT
x-ms-version: 2015-02-21
Authorization: SharedKey AZURE-STORAGE-NAME:AZURE-KEY

< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Content-Type: application/xml
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: REQUEST-ID
< x-ms-version: 2015-02-21
< Date: Thu, 03 Jan 2019 11:55:58 GMT
< 
* Connection #0 to host AZURE-STORAGE-NAME.blob.core.windows.net left intact
* Couldn't find host AZURE-STORAGE-NAME.blob.core.windows.net in the .netrc file; using defaults
* Found bundle for host AZURE-STORAGE-NAME.blob.core.windows.net: 0x1b25870 [can pipeline]
* Re-using existing connection! (#0) with host AZURE-STORAGE-NAME.blob.core.windows.net
* Connected to AZURE-STORAGE-NAME.blob.core.windows.net (52.239.213.4) port 443 (#0)
> HEAD /ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif HTTP/1.1
Host: AZURE-STORAGE-NAME.blob.core.windows.net
Accept: */*
x-ms-date: Thu, 03 Jan 2019 11:55:59 GMT
x-ms-version: 2015-02-21
Authorization: SharedKey AZURE-STORAGE-NAME:AZURE-KEY

< HTTP/1.1 404 The specified blob does not exist.
< Transfer-Encoding: chunked
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: REQUEST-ID
< x-ms-version: 2015-02-21
< Date: Thu, 03 Jan 2019 11:55:58 GMT
< 
* Connection #0 to host AZURE-STORAGE-NAME.blob.core.windows.net left intact
* Couldn't find host AZURE-STORAGE-NAME.blob.core.windows.net in the .netrc file; using defaults
* Found bundle for host AZURE-STORAGE-NAME.blob.core.windows.net: 0x1b25870 [can pipeline]
* Re-using existing connection! (#0) with host AZURE-STORAGE-NAME.blob.core.windows.net
* Connected to AZURE-STORAGE-NAME.blob.core.windows.net (52.239.213.4) port 443 (#0)
> GET /ready?comp=list&delimiter=%2F&maxresults=1&prefix=data%2F9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif%2F&restype=container HTTP/1.1
Host: AZURE-STORAGE-NAME.blob.core.windows.net
Accept: */*
x-ms-date: Thu, 03 Jan 2019 11:55:59 GMT
x-ms-version: 2015-02-21
Authorization: SharedKey AZURE-STORAGE-NAME:AZURE-KEY

< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Content-Type: application/xml
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: REQUEST-ID
< x-ms-version: 2015-02-21
< Date: Thu, 03 Jan 2019 11:55:58 GMT
< 
* Connection #0 to host AZURE-STORAGE-NAME.blob.core.windows.net left intact
* Couldn't find host AZURE-STORAGE-NAME.blob.core.windows.net in the .netrc file; using defaults
* Found bundle for host AZURE-STORAGE-NAME.blob.core.windows.net: 0x1b25870 [can pipeline]
* Re-using existing connection! (#0) with host AZURE-STORAGE-NAME.blob.core.windows.net
* Connected to AZURE-STORAGE-NAME.blob.core.windows.net (52.239.213.4) port 443 (#0)
> HEAD /ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif.xml HTTP/1.1
Host: AZURE-STORAGE-NAME.blob.core.windows.net
Accept: */*
x-ms-date: Thu, 03 Jan 2019 11:55:59 GMT
x-ms-version: 2015-02-21
Authorization: SharedKey AZURE-STORAGE-NAME:AZURE-KEY

< HTTP/1.1 404 The specified blob does not exist.
< Transfer-Encoding: chunked
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: REQUEST-ID
< x-ms-version: 2015-02-21
< Date: Thu, 03 Jan 2019 11:55:58 GMT
< 
* Connection #0 to host AZURE-STORAGE-NAME.blob.core.windows.net left intact
* Couldn't find host AZURE-STORAGE-NAME.blob.core.windows.net in the .netrc file; using defaults
* Found bundle for host AZURE-STORAGE-NAME.blob.core.windows.net: 0x1b25870 [can pipeline]
* Re-using existing connection! (#0) with host AZURE-STORAGE-NAME.blob.core.windows.net
* Connected to AZURE-STORAGE-NAME.blob.core.windows.net (52.239.213.4) port 443 (#0)
> GET /ready?comp=list&delimiter=%2F&maxresults=1&prefix=data%2F9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif.xml%2F&restype=container HTTP/1.1
Host: AZURE-STORAGE-NAME.blob.core.windows.net
Accept: */*
x-ms-date: Thu, 03 Jan 2019 11:55:59 GMT
x-ms-version: 2015-02-21
Authorization: SharedKey AZURE-STORAGE-NAME:AZURE-KEY

< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Content-Type: application/xml
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: REQUEST-ID
< x-ms-version: 2015-02-21
< Date: Thu, 03 Jan 2019 11:55:58 GMT
< 
* Connection #0 to host AZURE-STORAGE-NAME.blob.core.windows.net left intact
DEBUG:rasterio._io:Skipped delete for overwrite.  Dataset does not exist: b'/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif'
/home/nihe/.envs/azure/lib/python3.6/site-packages/rasterio/__init__.py:225: NotGeoreferencedWarning: The given matrix is equal to Affine.identity or its flipped counterpart. GDAL may ignore this matrix and save no geotransform without raising an error. This behavior is somewhat driver-specific.
  **kwargs)
DEBUG:rasterio._base:No projection detected.
DEBUG:rasterio._base:Nodata success: 0, Nodata value: -10000000000.000000
DEBUG:rasterio._base:No projection detected.
DEBUG:rasterio.env:Exiting env context: <rasterio.env.Env object at 0x7f3699b73080>
DEBUG:rasterio.env:Cleared existing <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio._env:Stopped GDALEnv <rasterio._env.GDALEnv object at 0x7f3699eab198>.
DEBUG:rasterio.env:No GDAL environment exists
DEBUG:rasterio.env:New GDAL environment <rasterio._env.GDALEnv object at 0x7f3699eab198> created
DEBUG:rasterio._env:GDAL_DATA not found in environment, set to '/home/nihe/.envs/azure/lib/python3.6/site-packages/rasterio/gdal_data'.
DEBUG:rasterio._env:Started GDALEnv <rasterio._env.GDALEnv object at 0x7f3699eab198>.
DEBUG:rasterio.env:Exited env context: <rasterio.env.Env object at 0x7f3699b73080>
DEBUG:rasterio._base:Entering Dataset <open DatasetWriter name='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='w'> context.
DEBUG:rasterio._io:Dataset <closed DatasetWriter name='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='w'> has been stopped.
DEBUG:rasterio._base:Dataset <closed DatasetWriter name='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='w'> has been closed.
DEBUG:rasterio._base:Exited Dataset <closed DatasetWriter name='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='w'> context.
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Entering env context: <rasterio.env.Env object at 0x7f3699b73080>
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio.env:Entered env context: <rasterio.env.Env object at 0x7f3699b73080>
DEBUG:rasterio._io:Path: UnparsedPath(path='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif'), mode: w, driver: GTiff
DEBUG:rasterio._io:Option: ('TILED', b'FALSE')
DEBUG:rasterio._io:Option: ('INTERLEAVE', b'BAND')
DEBUG:rasterio._io:Skipped delete for overwrite.  Dataset does not exist: b'/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif'
DEBUG:rasterio._base:No projection detected.
DEBUG:rasterio._base:Nodata success: 0, Nodata value: -10000000000.000000
DEBUG:rasterio._base:No projection detected.
DEBUG:rasterio.env:Exiting env context: <rasterio.env.Env object at 0x7f3699b73080>
DEBUG:rasterio.env:Cleared existing <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio._env:Stopped GDALEnv <rasterio._env.GDALEnv object at 0x7f3699eab198>.
DEBUG:rasterio.env:No GDAL environment exists
DEBUG:rasterio.env:New GDAL environment <rasterio._env.GDALEnv object at 0x7f3699eab198> created
DEBUG:rasterio._env:GDAL_DATA not found in environment, set to '/home/nihe/.envs/azure/lib/python3.6/site-packages/rasterio/gdal_data'.
DEBUG:rasterio._env:Started GDALEnv <rasterio._env.GDALEnv object at 0x7f3699eab198>.
DEBUG:rasterio.env:Exited env context: <rasterio.env.Env object at 0x7f3699b73080>
DEBUG:rasterio._base:Entering Dataset <open DatasetWriter name='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='w'> context.
DEBUG:rasterio._io:Dataset <closed DatasetWriter name='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='w'> has been stopped.
DEBUG:rasterio._base:Dataset <closed DatasetWriter name='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='w'> has been closed.
DEBUG:rasterio._base:Exited Dataset <closed DatasetWriter name='/vsiaz/ready/data/9440fcbd-1e7f-48af-92ab-70c4c11c1fa7.tif' mode='w'> context.
DEBUG:rasterio.env:Exiting env context: <rasterio.env.Env object at 0x7f36b3c0c630>
DEBUG:rasterio.env:Cleared existing <rasterio._env.GDALEnv object at 0x7f3699eab198> options
DEBUG:rasterio._env:Stopped GDALEnv <rasterio._env.GDALEnv object at 0x7f3699eab198>.
DEBUG:rasterio.env:Exiting outermost env
DEBUG:rasterio.env:Exited env context: <rasterio.env.Env object at 0x7f36b3c0c630>



--
Sean Gillies

Re: [rasterio-dev] debugging build_overviews

Sean Gillies
 

Hi Guy,

Let's try to answer your questions on the main discussion group. The dev group is for discussing development of the library itself: new API methods, infrastructure, governance, &c. I've cross-posted to main@rasterio.groups.io.

Yes, if you configure and build a debug-enabled version of GDAL (see the GDAL wiki for instructions) and build Rasterio using that build you can run your program under gdb or lldb and set breakpoints in Rasterio and GDAL. However, there is very little logic in the Rasterio code around building overviews:


We do some pre-overview error handling and then call GDALBuildOverviews:


I have seen various reports of overview bugs in the old GDAL tracker, mostly fixed, but maybe there is a lingering one in your case. It may be worth a check in the tracker for open issues or ones that sound familiar.

Simpler than using a debugger: add some logging statements to rasterio/_io.pyx in a custom build of Rasterio to report the overview factors and resampling method passed to GDALBuildOverviews.

On Wed, Jan 2, 2019 at 1:59 AM Guy Doulberg <guyd@...> wrote:
Hi guys,

I wonder if you can give me some ideas about something,
I am working on a raster (with mask) that I am creating overviews on it.

It looks like that in some of the overviews (not all) I am getting artifacts, these artifacts looks like the product of using masked values in the resampling algo. 

How would you debug such a problem?

If I understand correctly some of the code that handles the build overviews is in cython code (rasterio) and some is part of GDAL c code.  Can I run a debug tool that I can traverse these two code bases and add breakpoints in them  

What else can I do?

artifact example:
first plot is of original data, the white spaces are masks:



and now in and overview for X8 resolution I get the black line



Any advice can help,

Thanks



--
Sean Gillies

Re: [rasterio-dev] Writing azure blobs does not create blobs

Sean Gillies
 

In https://github.com/OSGeo/gdal/issues/1189#issuecomment-452716305, Even reminds us that GDAL does not support writing TIFFs to S3 and I'm pretty sure that this applies to Azure as well. See the note about sequential writing at https://www.gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsiaz.

On Fri, Jan 4, 2019 at 1:40 PM Sean Gillies <sean.gillies@...> wrote:
Hi Niklas,

First off, in my original reply I missed the chance to suggest that we move this discussion to the main (not dev) discussion group, which has more users. The dev group is for discussing and planning new features for the library. I'm going to cross post over to main@rasterio.groups.io and would appreciate it if you would follow up there.

Nothing in the logs jumps out at me, so I'm stumped. I don't have any experience using Azure blob storage and have never used GDAL or Rasterio to write to cloud objects of any kind. In my own applications, I write local files and then upload them to AWS S3 using the AWS Python SDK (boto3 &c). There may be some limitations or required GDAL configuration that I am not aware of.

Are you able to write blobs to the same Azure bucket using gdal_translate instead of a Rasterio program?

And of course, we really need to know what versions of Rasterio and GDAL are employed here and where they come from: PyPI or Conda-forge or where else?

On Thu, Jan 3, 2019 at 5:25 AM Niklas Heim <heim.niklas@...> wrote:
Hi Sean,

Thank you for your reply!
After setting the CURL_CA_BUNDLE environment variable I could run the above script without GDAL_HTTP_UNSAFESSL.
The logs with CPL_CURL_VERBOSE enabled are attached below. It looks like the new blob is not created?
...

Re: [rasterio-dev] Writing azure blobs does not create blobs

Niklas Heim <heim.niklas@...>
 

Hi Sean,

Thank you for you help and excuse the late reply! I will find another way of accomplishing this.

Best regards,
Niklas

Am Mi., 9. Jan. 2019 um 22:12 Uhr schrieb Sean Gillies via Groups.Io <sean=mapbox.com@groups.io>:

In https://github.com/OSGeo/gdal/issues/1189#issuecomment-452716305, Even reminds us that GDAL does not support writing TIFFs to S3 and I'm pretty sure that this applies to Azure as well. See the note about sequential writing at https://www.gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsiaz.

On Fri, Jan 4, 2019 at 1:40 PM Sean Gillies <sean.gillies@...> wrote:
Hi Niklas,

First off, in my original reply I missed the chance to suggest that we move this discussion to the main (not dev) discussion group, which has more users. The dev group is for discussing and planning new features for the library. I'm going to cross post over to main@rasterio.groups.io and would appreciate it if you would follow up there.

Nothing in the logs jumps out at me, so I'm stumped. I don't have any experience using Azure blob storage and have never used GDAL or Rasterio to write to cloud objects of any kind. In my own applications, I write local files and then upload them to AWS S3 using the AWS Python SDK (boto3 &c). There may be some limitations or required GDAL configuration that I am not aware of.

Are you able to write blobs to the same Azure bucket using gdal_translate instead of a Rasterio program?

And of course, we really need to know what versions of Rasterio and GDAL are employed here and where they come from: PyPI or Conda-forge or where else?

On Thu, Jan 3, 2019 at 5:25 AM Niklas Heim <heim.niklas@...> wrote:
Hi Sean,

Thank you for your reply!
After setting the CURL_CA_BUNDLE environment variable I could run the above script without GDAL_HTTP_UNSAFESSL.
The logs with CPL_CURL_VERBOSE enabled are attached below. It looks like the new blob is not created?
...

Rasterio Resampling documentation needs some key descriptions such as the libraries to call

Eyal Saiet
 

Hello Rasterio community,
while I have been using GDAL-python for a few years, I am delighted to learn about Rasterio. Both in GDAL and Rasterio (matter of fact any programming) require detail and to be exact. For newcomers to Rasterio, the details are critical for a smooth adoption.

The Resampling documentation link  does not have key pieces. First, it misses the required import libraries:
import rasterio
from rasterio.warp import Affine, Resampling, reproject # this line is key for the resampling code to work
numpy as np

Second, along  the code I would highlight the read() method so numpy's shape would read all bands from the raster correctly. 

Thanks for an otherwise, excellent library and excellent project!  

Re: Rasterio Resampling documentation needs some key descriptions such as the libraries to call

Sean Gillies
 

Hi,

I'm glad you're liking the project and appreciate your comments about the documentation. I've made a new ticket to track these issues: https://github.com/mapbox/rasterio/issues/1596.

Yours,


On Fri, Jan 11, 2019 at 3:26 PM Eyal Saiet <ejsaiet@...> wrote:
Hello Rasterio community,
while I have been using GDAL-python for a few years, I am delighted to learn about Rasterio. Both in GDAL and Rasterio (matter of fact any programming) require detail and to be exact. For newcomers to Rasterio, the details are critical for a smooth adoption.

The Resampling documentation link  does not have key pieces. First, it misses the required import libraries:
import rasterio
from rasterio.warp import Affine, Resampling, reproject # this line is key for the resampling code to work
numpy as np

Second, along  the code I would highlight the read() method so numpy's shape would read all bands from the raster correctly. 

Thanks for an otherwise, excellent library and excellent project!  



--
Sean Gillies