Topics

Occasional "not recognized as a supported file format" errors when reading from S3


Dion Häfner
 

Dear rasterio group,

 

(I initially posted this at https://github.com/mapbox/rasterio/issues/1686)

Lately, we have encountered a strange bug in Terracotta. It basically always leads to errors like these:

(from DHI-GRAS/terracotta#139)

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 213, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 205, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsis3/italy-composite/rasters/italy_2018_red.tif' not recognized as a supported file format.

or

(from DHI-GRAS/terracotta#10 (comment))

Traceback (most recent call last):
  File "rasterio/_io.pyx", line 698, in rasterio._io.DatasetReaderBase._read
  File "rasterio/shim_rasterioex.pxi", line 133, in rasterio._shim.io_multi_band
  File "rasterio/_err.pyx", line 182, in rasterio._err.exc_wrap_int
rasterio._err.CPLE_AppDefinedError: IReadBlock failed at X offset 0, Y offset 0: '/vsis3/bucket/prefix/tile330.tif' does not exist in the file system, and is not recognized as a supported dataset name.

The errors occur on different versions of rasterio, although anecdotally it wasn't a problem pre-1.0.15. It also seems to occur both during rasterio.open, and when actually reading tiles via WarpedVRT.read.

The problem is that we have only observed it with huge raster files, and we haven't been able to reproduce this reliably, or in a way where I could share it with you.

Does anyone have any intuition why this might be happening / what we could look at to debug this?

 

Cheers,
Dion

 


vincent.sarago@...
 

This is not something I've been seen for now. does this happens with Rasterio wheels and rasterio build on gdal source ? 


Dion Häfner
 

We've only tried rasterio wheels.

On 23/05/2019 21.23, vincent.sarago via Groups.Io wrote:
This is not something I've been seen for now. does this happens with Rasterio wheels and rasterio build on gdal source ? 


Sean Gillies
 

Hi Dion,

That error comes from these lines in GDALOpenEx when the function fails to return successfully: https://github.com/OSGeo/gdal/blob/fe0b5dd644abb4ac0c9869db28e9cf977181fbce/gdal/gcore/gdaldataset.cpp#L3455. The way the errors are re-raised obscures the problem.

First, are the /vsis3/ identifiers for the datasets correct? Is Rasterio mangling them? Can you access the data successfully any time at all?

Are you using any advanced HTTP features? HTTP/2? The Rasterio wheels use a slightly old version of curl and that project is ever fixing bugs.

Is multithreading involved? I recommend that you consult https://gdal.org/gdal_vrttut.html#gdal_vrttut_mt as soon as it is back online. Multithreaded access to VRTs is complicated.





If the error occurs every time 


On Fri, May 24, 2019 at 8:13 AM Dion Häfner <dion.haefner@...> wrote:

We've only tried rasterio wheels.

On 23/05/2019 21.23, vincent.sarago via Groups.Io wrote:
This is not something I've been seen for now. does this happens with Rasterio wheels and rasterio build on gdal source ? 



--
Sean Gillies


Dion Häfner
 

Hey Sean,

thank you very much for the detailed assessment. I think this already sheds some light on some of the problems.

To answer your questions:

- The /vsis3/ identifiers are correct and work most of the time, we only see these errors from time to time.

- This is just a default read from a private S3 bucket. I don't know which protocol boto (or GDAL?) uses under the hood.

- In the first case, there is indeed multithreading involved, so this is a hot lead. I will have a look at the documentation as soon as I can find it somewhere :) It working most of the time would of course suggest some sort of race condition.

In the second case, the setup is actually more complicated. We experimented with "cloud-optimized" VRT files, that we created by

1. gdalbuildvrt on many small-ish COG

2. gdaladdo on the resulting VRT into an external .vrt.ovr file

We could then serve these COG mosaics up as a single dataset via Terracotta, but apparently the read from the VRT would fail when we zoomed in too far.

Thank you for your time!

Dion


On 24/05/2019 20.49, Sean Gillies via Groups.Io wrote:
Hi Dion,

That error comes from these lines in GDALOpenEx when the function fails to return successfully: https://github.com/OSGeo/gdal/blob/fe0b5dd644abb4ac0c9869db28e9cf977181fbce/gdal/gcore/gdaldataset.cpp#L3455. The way the errors are re-raised obscures the problem.

First, are the /vsis3/ identifiers for the datasets correct? Is Rasterio mangling them? Can you access the data successfully any time at all?

Are you using any advanced HTTP features? HTTP/2? The Rasterio wheels use a slightly old version of curl and that project is ever fixing bugs.

Is multithreading involved? I recommend that you consult https://gdal.org/gdal_vrttut.html#gdal_vrttut_mt as soon as it is back online. Multithreaded access to VRTs is complicated.





If the error occurs every time 

On Fri, May 24, 2019 at 8:13 AM Dion Häfner <dion.haefner@...> wrote:

We've only tried rasterio wheels.

On 23/05/2019 21.23, vincent.sarago via Groups.Io wrote:
This is not something I've been seen for now. does this happens with Rasterio wheels and rasterio build on gdal source ? 


--
Sean Gillies


Sean Gillies
 

I found the VRT multi-threading guidance at https://gdal.org/drivers/raster/vrt.html#multi-threading-issues. It moved during the site migration.

This smells like a GDAL bug to me, perhaps a manifestation of https://github.com/OSGeo/gdal/issues/1244 or https://github.com/OSGeo/gdal/issues/1031.


On Mon, May 27, 2019 at 2:50 AM Dion Häfner <dion.haefner@...> wrote:

Hey Sean,

thank you very much for the detailed assessment. I think this already sheds some light on some of the problems.

To answer your questions:

- The /vsis3/ identifiers are correct and work most of the time, we only see these errors from time to time.

- This is just a default read from a private S3 bucket. I don't know which protocol boto (or GDAL?) uses under the hood.

- In the first case, there is indeed multithreading involved, so this is a hot lead. I will have a look at the documentation as soon as I can find it somewhere :) It working most of the time would of course suggest some sort of race condition.

In the second case, the setup is actually more complicated. We experimented with "cloud-optimized" VRT files, that we created by

1. gdalbuildvrt on many small-ish COG

2. gdaladdo on the resulting VRT into an external .vrt.ovr file

We could then serve these COG mosaics up as a single dataset via Terracotta, but apparently the read from the VRT would fail when we zoomed in too far.

Thank you for your time!

Dion

On 24/05/2019 20.49, Sean Gillies via Groups.Io wrote:
Hi Dion,

That error comes from these lines in GDALOpenEx when the function fails to return successfully: https://github.com/OSGeo/gdal/blob/fe0b5dd644abb4ac0c9869db28e9cf977181fbce/gdal/gcore/gdaldataset.cpp#L3455. The way the errors are re-raised obscures the problem.

First, are the /vsis3/ identifiers for the datasets correct? Is Rasterio mangling them? Can you access the data successfully any time at all?

Are you using any advanced HTTP features? HTTP/2? The Rasterio wheels use a slightly old version of curl and that project is ever fixing bugs.

Is multithreading involved? I recommend that you consult https://gdal.org/gdal_vrttut.html#gdal_vrttut_mt as soon as it is back online. Multithreaded access to VRTs is complicated.

On Fri, May 24, 2019 at 8:13 AM Dion Häfner <dion.haefner@...> wrote:

We've only tried rasterio wheels.

On 23/05/2019 21.23, vincent.sarago via Groups.Io wrote:
This is not something I've been seen for now. does this happens with Rasterio wheels and rasterio build on gdal source ? 



--
Sean Gillies


Jonas
 
Edited

Here is some more insight on the issue. Remember, we are seeing this when reading small chunks from a cloud-optimized GeoTIFF on S3 with our Terracotta tile server.

On our latest encounters of the issue, we are getting it consistently on the same map tile but not on adjacent tiles that are reading the exact same GeoTIFF.
So this challenges the assumption that this is a race condition, does it not?

Btw, we also tried the solution from https://github.com/OSGeo/gdal/issues/1244#issuecomment-487164897 (CPL_VSIL_CURL_NON_CACHED="/vsis3/") to no avail.


Jonas
 

OK, sorry, forget what I said - the same map tile that earlier was failing is now served fine. :/


Sean Gillies
 

Dion, Jonas:

There is a new report in the FIona tracker that may be related: https://github.com/Toblerity/Fiona/issues/761. There I advised turning on CPL_CURL_VERBOSE=YES to see more details about the HTTP requests that are failing. It will cause the volume of your logs to explode and it may be a challenge to find the signal in the noise, but I believe we will find some clues in the HTTP responses.


On Wed, Jun 12, 2019 at 1:55 AM Jonas <j08lue@...> wrote:
OK, sorry, forget what I said - the same map tile that earlier was failing is now served fine. :/



--
Sean Gillies


Dion Häfner
 

Hey Sean,

I don't know if you've seen our update on the rasterio issue tracker:

https://github.com/mapbox/rasterio/issues/1686#issuecomment-541737735

As it turns out, `GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR` was the culprit. Do you know if that is related to a known bug in GDAL, or whether this is something that should be reported separately?

For now, we have disabled the flag in Terracotta, and everything seems to work fine, but we did take a performance hit from that. So in the long run, it would be great if that could be fixed.

Best,
Dion

On 15/06/2019 16.51, Sean Gillies via Groups.Io wrote:
Dion, Jonas:
There is a new report in the FIona tracker that may be related: https://github.com/Toblerity/Fiona/issues/761. There I advised turning on CPL_CURL_VERBOSE=YES to see more details about the HTTP requests that are failing. It will cause the volume of your logs to explode and it may be a challenge to find the signal in the noise, but I believe we will find some clues in the HTTP responses.
On Wed, Jun 12, 2019 at 1:55 AM Jonas <j08lue@gmail.com <mailto:j08lue@gmail.com>> wrote:
OK, sorry, forget what I said - the same map tile that earlier was
failing is now served fine. :/
--
Sean Gillies