Date   

Re: Asyncio + Rasterio for slow network requests?

kylebarron2@...
 

Sorry for the slow response. As Vincent noted, just moving back to GDAL 2.4 made the process ~8x faster, from 1.7s to read to ~200ms to read each source tile.

> A constant time regardless of the amount of overlap suggests to me that your source files may lack the proper tiling.

According to the AWS NAIP docs, the COG sources were created with

gdal_translate -b 1 -b 2 -b 3 -of GTiff -co tiled=yes -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=DEFLATE -co PREDICTOR=2 src_dataset dst_dataset

gdaladdo -r average -ro src_dataset 2 4 8 16 32 64

gdal_translate -b 1 -b 2 -b 3 -of GTiff -co TILED=YES -co BLOCKXSIZE=512 -co BLOCKYSIZE=512 -co COMPRESS=JPEG -co JPEG_QUALITY=85 -co PHOTOMETRIC=YCBCR -co COPY_SRC_OVERVIEWS=YES –config GDAL_TIFF_OVR_BLOCKSIZE 512 src_dataset dst_dataset

asyncio's run_in_executor does the exact same thing as using a thread pool

That makes sense, and I ultimately expected to not be able to make progress since it's GDAL making the low level requests.

> Usually, reading a tile from S3 takes something like 10-100ms if you do it right.

Moving back to GDAL 2.4 got around these speeds.

At the moment, GDAL reads are not thread-safe!

That's really great to keep in mind! Means I'll probably shy away from attempting concurrency with GDAL in general.


Re: Asyncio + Rasterio for slow network requests?

Sean Gillies
 

Hi Vincent,

On Mon, Mar 30, 2020 at 7:12 AM <vincent.sarago@...> wrote:
Hi All, 
I'll answer for Kyle but he can jump back if needed. 

The problem Kyle was facing was due to GDAL3 (running on AWS Lambda, CentOS) being extremely slow for image reprojection. 
We faced this in https://github.com/RemotePixel/amazonlinux/issues/16 and though it was fixed when we updated sqlite lib (https://github.com/RemotePixel/amazonlinux/pull/17) but while this made things a bit faster, it seems there is still a `huge` difference between gdal2/proj5 and gdal3/proj6.

We still went through some testing with async but because kyle uses AWS Lambda and https://github.com/vincentsarago/lambda-proxy which is not async compatible we just switched to gdal2 and to threading. 

FYI, I've updated another tiling project using async but I need to run benchmarks https://github.com/developmentseed/titiler/blob/master/titiler/api/api_v1/endpoints/tiles.py#L26 

Vincent

Thanks for the update. This situation points out a downside of using the warped VRT: it abstracts everything (network, reprojection, caching) and makes diagnosing problems difficult.

--
Sean Gillies


Re: Asyncio + Rasterio for slow network requests?

vincent.sarago@...
 

Hi All, 
I'll answer for Kyle but he can jump back if needed. 

The problem Kyle was facing was due to GDAL3 (running on AWS Lambda, CentOS) being extremely slow for image reprojection. 
We faced this in https://github.com/RemotePixel/amazonlinux/issues/16 and though it was fixed when we updated sqlite lib (https://github.com/RemotePixel/amazonlinux/pull/17) but while this made things a bit faster, it seems there is still a `huge` difference between gdal2/proj5 and gdal3/proj6. 

We still went through some testing with async but because kyle uses AWS Lambda and https://github.com/vincentsarago/lambda-proxy which is not async compatible we just switched to gdal2 and to threading. 

FYI, I've updated another tiling project using async but I need to run benchmarks https://github.com/developmentseed/titiler/blob/master/titiler/api/api_v1/endpoints/tiles.py#L26 

Vincent 


Re: Asyncio + Rasterio for slow network requests?

Dion Häfner
 

Hey Kyle,

maybe I can help out here.

- asyncio's run_in_executor does the exact same thing as using a thread pool, it's just a different API. Until both GDAL and rasterio explicitly support this, you cannot use "real" asynchronous (non-blocking) IO.

- I can second Sean's comment that multithreading should speed up tile retrieval, and I suspect that something is off with your code and/or your raster. Usually, reading a tile from S3 takes something like 10-100ms if you do it right.

- At the moment, GDAL reads are not thread-safe! This leads to seemingly random failing tile reads (we struggled a lot with this in Terracotta). Now we use a process pool that we spawn at server start, which seems to work OK both performance and reliability-wise (https://github.com/DHI-GRAS/terracotta/blob/master/terracotta/drivers/raster_base.py).

Best,
Dion

On 24/03/2020 22.12, kylebarron2 via Groups.Io wrote:
I'm trying to improve performance of dynamic satellite imagery tiling, using
[`cogeo-mosaic-tiler`](https://github.com/developmentseed/cogeo-mosaic-tiler)/[`rio-tiler`](https://github.com/cogeotiff/rio-tiler),
which combines source Cloud-Optimized GeoTIFFs into a web mercator tile on the
fly. I'm using AWS Landsat and NAIP imagery stored in S3 buckets, and running
code on AWS Lambda in the same region.
Since NAIP imagery doesn't overlap cleanly with web mercator tiles, at zoom 12 I
have to load on average [6 assets to create one mercator
tile](https://user-images.githubusercontent.com/15164633/77286861-cfc7df00-6c99-11ea-84e9-8ed584b030c0.png).
While profiling the AWS Lambda instance using AWS X-Ray, I found that the
biggest bottleneck was the [base
call](https://github.com/cogeotiff/rio-tiler/blob/6b0d4df0b6aa1454c50312e8d352ed57f0a4e3cb/rio_tiler/utils.py#L449-L455)
to `WarpedVRT.read()`. That call always takes [between 1.7 and 2.0
seconds](https://user-images.githubusercontent.com/15164633/77289999-c5f5aa00-6ca0-11ea-816a-5aaf248a782c.png)
for each tile, regardless of the amount of overlap with the mercator tile.
When testing tile load times on an EC2 t2.nano in the same region, for the first
tile load, CPU time is 120 ms but wall time is 1.1 seconds. That leads me to
believe that the bottleneck is S3 latency.
If the code running on Lambda shares the same 90% proportion spent on latency
for each asset, that would imply that 9 seconds total are spent waiting on
latency.
Using multithreading with a `ThreadPoolExecutor` takes longer than running
single-threaded. Given the situation, it would seem ideal to use `asyncio` for
the COG network requests to improve performance.
Has this been attempted ever with Rasterio? I saw a [Rasterio example of using
async](https://github.com/mapbox/rasterio/blob/master/examples/async-rasterio.py)
to improve performance on a CPU bound function, and plan to try that out, but
I'm pessimistic about that approach directly because I'd think that the `async`
calls would need to be applied on the core fetch calls directly.
Reproduction for tile loading:
```py
import os
from rio_tiler.main import tile
os.environ['CURL_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
os.environ['AWS_REQUEST_PAYER'] ="requester"
address = 's3://naip-visualization/ca/2018/60cm/rgb/34118/m_3411861_ne_11_060_20180723_20190208.tif'
x = 701
y = 1635
z = 12
tilesize = 512
%time data, mask = tile(address, x, y, z, tilesize)
```
```
CPU times: user 119 ms, sys: 20.3 ms, total: 140 ms
Wall time: 1.1 s
```
--

Dion Häfner
PhD Student

Niels Bohr Institute
Physics of Ice, Climate and Earth
University of Copenhagen
Tagensvej 16, DK-2200 Copenhagen, DENMARK

_.~"~._.~"~._.~"~._.~"~._.~"~._.~"~._.~"~._


Re: Asyncio + Rasterio for slow network requests?

Sean Gillies
 

Hi,

First of all, I'm not very familiar with rio-tiler. Hopefully, Vincent will help us out.

On Tue, Mar 24, 2020 at 3:36 PM <kylebarron2@...> wrote:
I'm trying to improve performance of dynamic satellite imagery tiling, using
which combines source Cloud-Optimized GeoTIFFs into a web mercator tile on the
fly. I'm using AWS Landsat and NAIP imagery stored in S3 buckets, and running
code on AWS Lambda in the same region.
 
Since NAIP imagery doesn't overlap cleanly with web mercator tiles, at zoom 12 I
have to load on average [6 assets to create one mercator
While profiling the AWS Lambda instance using AWS X-Ray, I found that the
biggest bottleneck was the [base
to `WarpedVRT.read()`. That call always takes [between 1.7 and 2.0
for each tile, regardless of the amount of overlap with the mercator tile.

A constant time regardless of the amount of overlap suggests to me that your source files may lack the proper tiling. If the sources are tiled, the number of bytes transferred (and time) would scale roughly with the amount of overlap.

Can you verify that your sources have overviews? If you're accessing 6 sources to fill a web mercator tile, overviews will help dramatically.
 
 
When testing tile load times on an EC2 t2.nano in the same region, for the first
tile load, CPU time is 120 ms but wall time is 1.1 seconds. That leads me to
believe that the bottleneck is S3 latency.
 
If the code running on Lambda shares the same 90% proportion spent on latency
for each asset, that would imply that 9 seconds total are spent waiting on
latency.
 
Using multithreading with a `ThreadPoolExecutor` takes longer than running
single-threaded. Given the situation, it would seem ideal to use `asyncio` for
the COG network requests to improve performance.

I wonder if Vincent can tell us from his experience if there is a risk of overwhelming GDAL's raster block cache on Lambda when making many current reads? I've seen programs appear to hang when the cache is too small.
 
 
Has this been attempted ever with Rasterio? I saw a [Rasterio example of using
to improve performance on a CPU bound function, and plan to try that out, but
I'm pessimistic about that approach directly because I'd think that the `async`
calls would need to be applied on the core fetch calls directly.

That asyncio example is dated and could be hard to generalize to your problem. I'd love to see a good working example.

You're right that there's only so much we can do in Python about maximizing this conconcurrency. At some level, it's code in GDAL that is making the HTTP requests for parts of the COGs and using a strategy that we can't entirely control from Python.

--
Sean Gillies


Re: Silencing PROJ errors/warnings

gberardinelli@...
 

The warnings are bare when using rasterio (neither of those two prefixes).

Packages were all installed via conda/conda-forge on Windows 10:

gdal version 3.0.4
rasterio version 1.1.3
proj version 6.3.1

Also my original post had a typo, I had tried rasterio.Env(CPL_LOG=os.devnull) (not "CPL_DEBUG").  Still a shot in the dark, perhaps.

Greg


Re: Silencing PROJ errors/warnings

Sean Gillies
 

Hi Greg,

On Fri, Mar 27, 2020 at 12:29 PM <gberardinelli@...> wrote:
Hi all,

As of GDAL 3 I've started seeing warnings written to stderr on open, such as:

proj_create_from_database: datum not found
It's possible that these datasets have malformed projection information, but in cases where I don't care I'd like to be able to silence these types of warnings, as they can easily be emitted hundreds of times.  However I can't find a way to do so.

Using an Env() context with CPL_DEBUG set to os.devnull has no effect, nor does redirecting sys.stderr itself.  It also doesn't look like this is being emitted through the logging module.  So it seems like the underlying library itself may be writing to stderr?  I don't know if this is a rasterio issue, a gdal issue, or a proj issue.

Any pointers are appreciated.

Greg

Do you see a bare "proj_create_from_database: datum not found" or do you see
ERROR 1: PROJ: proj_create_from_database: datum not found
as in https://github.com/OSGeo/gdal/issues/2321?

Which specific GDAL version are you using?

--
Sean Gillies


Silencing PROJ errors/warnings

gberardinelli@...
 

Hi all,

As of GDAL 3 I've started seeing warnings written to stderr on open, such as:

proj_create_from_database: datum not found
It's possible that these datasets have malformed projection information, but in cases where I don't care I'd like to be able to silence these types of warnings, as they can easily be emitted hundreds of times.  However I can't find a way to do so.

Using an Env() context with CPL_DEBUG set to os.devnull has no effect, nor does redirecting sys.stderr itself.  It also doesn't look like this is being emitted through the logging module.  So it seems like the underlying library itself may be writing to stderr?  I don't know if this is a rasterio issue, a gdal issue, or a proj issue.

Any pointers are appreciated.

Greg


Re: Applying Color Ramp to Single-Band Grayscale Images

Sean Gillies
 

Hi,

Have you seen https://rasterio.readthedocs.io/en/latest/topics/color.html#writing-colormaps ? It shows how to write a colormap (or color "ramp") to a new file. You can also open an existing file in "r+" mode and call the dataset's write_colormap() method.

On Tue, Mar 24, 2020 at 4:12 PM <nathan.raley@...> wrote:
Does anyone have any methods of applying a color ramp to a grayscale image? 

I have a NDVI calculated image I am trying to apply color ramp for the decimal based value ranges, but have no idea how to accomplish this.  I was wandering if this was possible via any of the rasterio libraries, and, if so, how would one go about accomplishing this?  I have other band based calculations I'd like to attempt, but until I can figure out how to apply the color ramps, I am at a stand still.

Thanks,

--
Sean Gillies


Re: [EXTERNAL] [rasterio] Applying Color Ramp to Single-Band Grayscale Images

nathan.raley@...
 

Actually, that second link looked like he had the format where it was using decimal based values, that might work.  I'll give it a try.  Thanks!


Re: [EXTERNAL] [rasterio] Applying Color Ramp to Single-Band Grayscale Images

nathan.raley@...
 

Thanks Trent,

I have looked into that, but the part of using the colormap.txt portion still is a bit confusing.  I haven't been able to find sufficient documentation on how to create that file to apply the color, or a ramp between two color ranges, for the values.  I did find a method in the python wrapper library for gdal that supported creating a color ramp for ranges using a specified color; however, that only supported whole number values and not the float based values you get from the NDVI image.


Re: [EXTERNAL] [rasterio] Applying Color Ramp to Single-Band Grayscale Images

Hare, Trent M
 

Nathan,
  Not sure about using rasterio, but have you tried gdaldem using "color-relief"?


shows creating a colorized image and adding hillshade (hillshade step not needed here):

-Trent


From: main@rasterio.groups.io <main@rasterio.groups.io> on behalf of nathan.raley via Groups.Io <nathan.raley@...>
Sent: Tuesday, March 24, 2020 3:12 PM
To: main@rasterio.groups.io <main@rasterio.groups.io>
Subject: [EXTERNAL] [rasterio] Applying Color Ramp to Single-Band Grayscale Images
 
Does anyone have any methods of applying a color ramp to a grayscale image? 

I have a NDVI calculated image I am trying to apply color ramp for the decimal based value ranges, but have no idea how to accomplish this.  I was wandering if this was possible via any of the rasterio libraries, and, if so, how would one go about accomplishing this?  I have other band based calculations I'd like to attempt, but until I can figure out how to apply the color ramps, I am at a stand still.

Thanks,


Applying Color Ramp to Single-Band Grayscale Images

nathan.raley@...
 

Does anyone have any methods of applying a color ramp to a grayscale image? 

I have a NDVI calculated image I am trying to apply color ramp for the decimal based value ranges, but have no idea how to accomplish this.  I was wandering if this was possible via any of the rasterio libraries, and, if so, how would one go about accomplishing this?  I have other band based calculations I'd like to attempt, but until I can figure out how to apply the color ramps, I am at a stand still.

Thanks,


Asyncio + Rasterio for slow network requests?

kylebarron2@...
 

I'm trying to improve performance of dynamic satellite imagery tiling, using
[`cogeo-mosaic-tiler`](https://github.com/developmentseed/cogeo-mosaic-tiler)/[`rio-tiler`](https://github.com/cogeotiff/rio-tiler),
which combines source Cloud-Optimized GeoTIFFs into a web mercator tile on the
fly. I'm using AWS Landsat and NAIP imagery stored in S3 buckets, and running
code on AWS Lambda in the same region.
 
Since NAIP imagery doesn't overlap cleanly with web mercator tiles, at zoom 12 I
have to load on average [6 assets to create one mercator
tile](https://user-images.githubusercontent.com/15164633/77286861-cfc7df00-6c99-11ea-84e9-8ed584b030c0.png).
While profiling the AWS Lambda instance using AWS X-Ray, I found that the
biggest bottleneck was the [base
call](https://github.com/cogeotiff/rio-tiler/blob/6b0d4df0b6aa1454c50312e8d352ed57f0a4e3cb/rio_tiler/utils.py#L449-L455)
to `WarpedVRT.read()`. That call always takes [between 1.7 and 2.0
seconds](https://user-images.githubusercontent.com/15164633/77289999-c5f5aa00-6ca0-11ea-816a-5aaf248a782c.png)
for each tile, regardless of the amount of overlap with the mercator tile.
 
When testing tile load times on an EC2 t2.nano in the same region, for the first
tile load, CPU time is 120 ms but wall time is 1.1 seconds. That leads me to
believe that the bottleneck is S3 latency.
 
If the code running on Lambda shares the same 90% proportion spent on latency
for each asset, that would imply that 9 seconds total are spent waiting on
latency.
 
Using multithreading with a `ThreadPoolExecutor` takes longer than running
single-threaded. Given the situation, it would seem ideal to use `asyncio` for
the COG network requests to improve performance.
 
Has this been attempted ever with Rasterio? I saw a [Rasterio example of using
async](https://github.com/mapbox/rasterio/blob/master/examples/async-rasterio.py)
to improve performance on a CPU bound function, and plan to try that out, but
I'm pessimistic about that approach directly because I'd think that the `async`
calls would need to be applied on the core fetch calls directly.
 
 
Reproduction for tile loading:
```py
import os
from rio_tiler.main import tile
os.environ['CURL_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
os.environ['AWS_REQUEST_PAYER'] ="requester"
address = 's3://naip-visualization/ca/2018/60cm/rgb/34118/m_3411861_ne_11_060_20180723_20190208.tif'
x = 701
y = 1635
z = 12
tilesize = 512
%time data, mask = tile(address, x, y, z, tilesize)
```
```
CPU times: user 119 ms, sys: 20.3 ms, total: 140 ms
Wall time: 1.1 s
```


Re: Overwrite raster in place

Sean Gillies
 

It's not possible to change the shape of a raster dataset, the number of bands, or the data type in place.

On Wed, Mar 18, 2020 at 1:40 PM himat15 via Groups.Io <himat15=yahoo.com@groups.io> wrote:
I have two rasters, and one of them has one more column than the other. So I want to remove the last column in the larger one.
How can I change the raster width in place without having to do  a with open(f, 'r') and then another with open(f, 'w')?
I tried using r+, but when trying to do ds.width = 890, I got AttributeError: attribute 'width' of 'rasterio._base.DatasetBase' objects is not writable

with rio.open(f, 'r+') as ds:
        print(ds.profile) 
 
        ds_arr = ds.read(1) # Do I need to actually modify the np array here and then overwrite the file later?
        print(ds_arr.shape)
        
        profile = ds.profile
        
        width = profile['width']
        height = profile['height']
        
        if width == 891:
            ds.width = 890 # error
            
Profile: {'driver': 'GTiff', 'dtype': 'int16', 'nodata': -1.0, 'width': 891, 'height': 3081, 'count': 1, 'crs': CRS.from_epsg(32629), 'transform': Affine(15.0, 0.0, 232053.930561042,
       0.0, -15.0, 1438436.08069127), 'tiled': False, 'interleave': 'band'}



--
Sean Gillies


Overwrite raster in place

himat15@...
 

I have two rasters, and one of them has one more column than the other. So I want to remove the last column in the larger one.
How can I change the raster width in place without having to do  a with open(f, 'r') and then another with open(f, 'w')?
I tried using r+, but when trying to do ds.width = 890, I got AttributeError: attribute 'width' of 'rasterio._base.DatasetBase' objects is not writable

with rio.open(f, 'r+') as ds:
        print(ds.profile) 
 
        ds_arr = ds.read(1) # Do I need to actually modify the np array here and then overwrite the file later?
        print(ds_arr.shape)
        
        profile = ds.profile
        
        width = profile['width']
        height = profile['height']
        
        if width == 891:
            ds.width = 890 # error
            
Profile: {'driver': 'GTiff', 'dtype': 'int16', 'nodata': -1.0, 'width': 891, 'height': 3081, 'count': 1, 'crs': CRS.from_epsg(32629), 'transform': Affine(15.0, 0.0, 232053.930561042,
       0.0, -15.0, 1438436.08069127), 'tiled': False, 'interleave': 'band'}


Re: Rasterio Python Library Merging R, G, B Bands to Create a GeoReferenced GeoTiff

Sean Gillies
 

Hi,

On Wed, Mar 11, 2020 at 4:17 PM <nathan.raley@...> wrote:
I am attempting to use rasterio to merge 3 bands from Sentinel-2 data, sourced as .JP2 files, into a georeferenced GeoTiff.  However, I am running into 2 problems doing this.

First of all, here is my code for the merge:
    #Setup the bands
    b4 = rio.open(path + Bands.B04.value[0])
    b3 = rio.open(path + Bands.B03.value[0])
    b2 = rio.open(path + Bands.B02.value[0])
   
    #Setup the path for the constructed RGB image
    tif = path + "RGB.tif"
   
    #Create the RGB image
    with rio.open(tif, 'w', driver='Gtiff', width=b4.width, height=b4.height, count=3, crs=b4.crs, tranform=b4.transform, dtype=b4.dtypes[0], photometric="RGB", tile=True, compress='lzw', bigtiff=True) as rgb:
        rgb.write(b2.read(1), 1)
        rgb.write(b3.read(1), 2)
        rgb.write(b4.read(1), 3)
        rgb.close()
I have a couple comments on the code. You need to pass `tiled=True` to rasterio.open(), "tiled", and you should pass blockxsize and blockysize parameters as well. These must be multiples of 16. Also, you're passing "tranform" instead of "transform" and that will be ignored. This may explain why your output has no meaningful bounds.

I've found a pattern that makes dealing with a dataset's creation parameters a bit easier. In your case, it goes like this:

    rgb_profile = b4.profile
    rgb_profile.update(driver="GTiff", count=3, photometric="RGB", compress="LZW", bigtiff="IF_NEEDED", tiled=True, blockxsize=256, blockysize=256)
    with rasterio.open(tif, "w", **rgb_profile) as rgb:
        rgb.write(b2.read(1), 1)
        rgb.write(b3.read(1), 2)
        rgb.write(b4.read(1), 3)
 


Now, how do I copy the source projection information and bounds in the new GeoTiff I am creating?  Second, why is it showing up as all black when viewing the image in most common image browsers, but just fine in mapping software such as ArcMap or ArcGIS Pro?

Any ideas ?

I think it's likely that the common image browsers don't support some of the features of TIFF that GDAL uses. BIGTIFF is one of these.

--
Sean Gillies


Re: Decimated and windowed read in rasterio

umbertofilippo@...
 
Edited

Thanks to @Sean Gillies.

Posting here the solution I also posted on gis.stackexchange.com (https://gis.stackexchange.com/a/353869/9518).

Maybe it would be useful for others in search of a solution for a similar problem.

Basically, it is not possible to simultaneously make a windowed AND decimated read, so the answer to my question is NO.

However, there is an undocumented (AFAIK) parameter called overview_level which can be passed to rasterio.open().

This parameter takes the 0-based index of an overview level. So, assuming overviews are present in a source raster, one can do

src = rasterio.open(raster, overview_level=0)

to create a Rasterio dataset from the first level of overview of source raster.

Then, simply do a windowed read in the common format:

for ji, src_window in src.block_windows(1):
    arr = src.read(1, window=src_window)


Re: Decimated and windowed read in rasterio

umbertofilippo@...
 

The overview_level did the trick!

Thank you so much, I was after this for a long time.

Umberto Minora


Re: Decimated and windowed read in rasterio

umbertofilippo@...
 

On Thu, Mar 12, 2020 at 05:29 PM, Sean Gillies wrote:
Depending on the math, you may or may not get exactly the overview level you are looking for. If you absolutely need a particular overview level, I suggest passing an overview_level keyword argument to rasterio.open().
This is very interesting, I could not find the reference to overview_level in the documentation.
Definitely going to test it!
My aim is to read a specific overview (the greatest level) from a raster and resampling it by implementing a SUMMING algorithm.
Although this might sound like I am not supposed to use windows, this process is part of a more complex program where the user can also request the original resolution, and thus for very big dataset reading with windows is much better.
Will come back with my findings, meanwhile can you please point to a documentation if present where the open method is used alng with the overview_level paramenter?

Thank you so much,

Umberto Minora

221 - 240 of 698