Topics

Reading from S3


hughes.lloyd@...
 

I am trying to read a geoTiff from my private S3 bucket (mapping), but am receiving the following error message

CPLE_OpenFailedError: '/vsis3/mapping/bahamas/S1A_20190821T231100.tif' does not exist in the file system, and is not recognized as a supported dataset name
The code I am using to open the GeoTiff is:

with rasterio.Env(session=AWSSession(aws_secret_access_key=S3_SECRET, aws_access_key_id=S3_KEY, region_name="us-west-1")) as env:
    rasterio.open("s3://mapping/bahamas/S1A_20190821T231100.tif")

I am using rasterio version 1.0.25 and the application is single-threaded. The file does exist and I can access it using s3fs and awscli. What am I missing?


hughes.lloyd@...
 

I am trying to read a GeoTIFF from a private AWS S3 bucket. I have configured GDAL and the appropriate files ~/.aws/config and ~/.aws/credentials. I am using a non-standard AWS region as well, so I needed to set the AWS_S3_ENDPOINT environment variable.

I am able to read the GeoTIFF information using both gdalinfo and rio:

$ gdalinfo /vsis3/s1-image-dataset/test.tif
Driver: GTiff/GeoTIFF
Files: /vsis3/s1-image-dataset/test.tif
Size is 33959, 38507
Coordinate System is:
PROJCS["WGS 84 / UTM zone 17N",
....

and using rio:

$ rio info s3://s1-image-dataset/test.tif
{"bounds": [689299.5634174921, 2622862.3065700093, 1028889.5634174921, 3007932.3065700093], "colorinterp": ["gray"], "compress": "deflate", "count": 1, "crs": "EPSG:32617", "descriptions": [null], "driver": "GTiff" ....

However, when I try to read it in a script using the rasterio Python API the I received the following error:

CPLE_OpenFailedError: '/vsis3/s1-image-dataset/test.tif' not recognized as a supported file format.

The code I am using which produced the issues is

import rasterio
path = "s3://s1-image-dataset/test.tif"
with rasterio.Env(AWS_S3_ENDPOINT='s3.<my region>.amazonaws.com'):
    with rasterio.open(path) as f:
        img = f.read()

This is using Python 3.7, rasterio 1.0.25, and GDAL 2.4.2

The problem only occurs when running this in a Jupyter Notebook (Pangeo to be precise) and it appears that Rasterio exits the environment prematurely

---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

rasterio/_shim.pyx in rasterio._shim.open_dataset()

rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: '/vsis3/s1-image-dataset/test.tif' does not exist in the file system, and is not recognized as a supported dataset name.






Sean Gillies
 

Hi,

The following log message catches my eye:
env: AWS_S3_ENDPOINT="us-west-1"
If that is set in your notebook's environment, it will override the value you pass to Env() in your program, and it looks to be incorrect.

On Thu, Sep 5, 2019 at 8:17 AM <hughes.lloyd@...> wrote:

I am trying to read a GeoTIFF from a private AWS S3 bucket. I have configured GDAL and the appropriate files ~/.aws/config and ~/.aws/credentials. I am using a non-standard AWS region as well, so I needed to set the AWS_S3_ENDPOINT environment variable.

I am able to read the GeoTIFF information using both gdalinfo and rio:

$ gdalinfo /vsis3/s1-image-dataset/test.tif
Driver: GTiff/GeoTIFF
Files: /vsis3/s1-image-dataset/test.tif
Size is 33959, 38507
Coordinate System is:
PROJCS["WGS 84 / UTM zone 17N",
....

and using rio:

$ rio info s3://s1-image-dataset/test.tif
{"bounds": [689299.5634174921, 2622862.3065700093, 1028889.5634174921, 3007932.3065700093], "colorinterp": ["gray"], "compress": "deflate", "count": 1, "crs": "EPSG:32617", "descriptions": [null], "driver": "GTiff" ....

However, when I try to read it in a script using the rasterio Python API the I received the following error:

CPLE_OpenFailedError: '/vsis3/s1-image-dataset/test.tif' not recognized as a supported file format.

The code I am using which produced the issues is

import rasterio
path = "s3://s1-image-dataset/test.tif"
with rasterio.Env(AWS_S3_ENDPOINT='s3.<my region>.amazonaws.com'):
    with rasterio.open(path) as f:
        img = f.read()

This is using Python 3.7, rasterio 1.0.25, and GDAL 2.4.2

The problem only occurs when running this in a Jupyter Notebook (Pangeo to be precise) and it appears that Rasterio exits the environment prematurely

DEBUG:rasterio.env:Entering env context: <rasterio.env.Env object at 0x7f97fb41d898>
DEBUG:rasterio.env:Starting outermost env
DEBUG:rasterio.env:No GDAL environment exists
DEBUG:rasterio.env:New GDAL environment <rasterio._env.GDALEnv object at 0x7f97fb41d908> created
DEBUG:rasterio._env:GDAL_DATA found in environment: '/srv/conda/envs/notebook/share/gdal'.
DEBUG:rasterio._env:PROJ_LIB found in environment: '/srv/conda/envs/notebook/share/proj'.
DEBUG:rasterio._env:Started GDALEnv <rasterio._env.GDALEnv object at 0x7f97fb41d908>.
DEBUG:rasterio.env:Entered env context: <rasterio.env.Env object at 0x7f97fb41d898>
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f97fb41d908> options
DEBUG:rasterio.env:Entering env context: <rasterio.env.Env object at 0x7f97fb3c5898>
DEBUG:rasterio.env:Got a copy of environment <rasterio._env.GDALEnv object at 0x7f97fb41d908> options
DEBUG:rasterio.env:Entered env context: <rasterio.env.Env object at 0x7f97fb3c5898>
DEBUG:rasterio._base:Sharing flag: 32
DEBUG:rasterio.env:Exiting env context: <rasterio.env.Env object at 0x7f97fb3c5898>
DEBUG:rasterio.env:Cleared existing <rasterio._env.GDALEnv object at 0x7f97fb41d908> options
DEBUG:rasterio._env:Stopped GDALEnv <rasterio._env.GDALEnv object at 0x7f97fb41d908>.
DEBUG:rasterio.env:No GDAL environment exists
DEBUG:rasterio.env:New GDAL environment <rasterio._env.GDALEnv object at 0x7f97fb41d908> created
DEBUG:rasterio._env:GDAL_DATA found in environment: '/srv/conda/envs/notebook/share/gdal'.
DEBUG:rasterio._env:PROJ_LIB found in environment: '/srv/conda/envs/notebook/share/proj'.
DEBUG:rasterio._env:Started GDALEnv <rasterio._env.GDALEnv object at 0x7f97fb41d908>.
DEBUG:rasterio.env:Exited env context: <rasterio.env.Env object at 0x7f97fb3c5898>
DEBUG:rasterio.env:Exiting env context: <rasterio.env.Env object at 0x7f97fb41d898>
DEBUG:rasterio.env:Cleared existing <rasterio._env.GDALEnv object at 0x7f97fb41d908> options
DEBUG:rasterio._env:Stopped GDALEnv <rasterio._env.GDALEnv object at 0x7f97fb41d908>.
DEBUG:rasterio.env:Exiting outermost env
DEBUG:rasterio.env:Exited env context: <rasterio.env.Env object at 0x7f97fb41d898>
env: AWS_ACCESS_KEY_ID="XXXXXX"
env: AWS_SECRET_ACCESS_KEY="XXXXXXXX"
env: AWS_S3_ENDPOINT="us-west-1"
---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

rasterio/_shim.pyx in rasterio._shim.open_dataset()

rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: '/vsis3/s1-image-dataset/test.tif' does not exist in the file system, and is not recognized as a supported dataset name.

--
Sean Gillies


Guillaume Lostis
 

Hi,

I will add to the previous message that if you want to specify a non-default region, the environment variable you're looking for is probably AWS_REGION (or AWS_DEFAULT_REGION starting with GDAL 2.3), rather than AWS_S3_ENDPOINT (see https://gdal.org/user/virtual_file_systems.html#vsis3-aws-s3-files-random-reading)

Also, I have successfully used rasterio on private AWS S3 buckets without having to touch any environment variable, so unless I incorrectly understand your case, any extra configuration should not be necessary.

Best,

Guillaume Lostis


hughes.lloyd@...
 

My bucket is hosted in "us-gov-west-1" region and if I don't set 
AWS_S3_ENDPOINT=s3.us-gov-west-1.amazonaws.com
then neither gdalinfo nor rio  work, they throw errors about the file not being found as they still continue to access the standard endpoint, even when I specify the correct region (I've tried in ~/.aws/config and in AWS_REGION, AWS_DEFAULT_REGION). You can see from the following that the endpoint remains incorrect:
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s1-image-dataset.s3.amazonaws.com/test.tif: 403


hughes.lloyd@...
 

The issue doesn't exist outside of Jupyter notebooks. It seems once I am inside a notebook that rasterio does not function in the same manner even when the environment variables are identical.

Would be interested to know if you have managed to read form S3 form inside a Notebook


Sean Gillies
 

Hi Hughes,

Yes, I've been able to read raster data from S3 in a Jupyter notebook.

What do you make of the observation I made earlier today about the
env: AWS_S3_ENDPOINT="us-west-1"
log message from your notebook? I think this might be the key.

On Thu, Sep 5, 2019 at 1:10 PM <hughes.lloyd@...> wrote:
The issue doesn't exist outside of Jupyter notebooks. It seems once I am inside a notebook that rasterio does not function in the same manner even when the environment variables are identical.

Would be interested to know if you have managed to read form S3 form inside a Notebook



--
Sean Gillies


hughes.lloyd@...
 

Hi Sean,

env: AWS_S3_ENDPOINT="us-west-1"
This was indeed an error, although changing it did not fix the problem. I have included "fresh" logs below to show the problem still persists. Furthermore, as I stated above I have to set the AWS_S3_ENDPOINT environment variable otherwise rasterio does not work with to the "us-gov-west-1" one (as I showed above). And in the example below which:

$ aws configure
AWS Access Key ID [****************]:
AWS Secret Access Key [****************]:
Default region name [us-gov-west-1]:
Default output format [None]:

$ rio info "s3://s1-image-dataset/test.tif"
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s1-image-dataset.s3.amazonaws.com/test.tif: 403 <- Region is not in endpoint even though it is configured!
Traceback (most recent call last):
File "rasterio/_base.pyx", line 216, in rasterio._base.DatasetBase.__init__
File "rasterio/_shim.pyx", line 64, in rasterio._shim.open_dataset
File "rasterio/_err.pyx", line 205, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_AWSError: The AWS Access Key Id you provided does not exist in our records.

$ export AWS_S3_ENDPOINT='s3.us-gov-west-1.amazonaws.com'
$ rio info "s3://s1-image-dataset/test.tif"
{"bounds": [689299.5634174921, 2622862.3065700093, 1028889.5634174921, 3007932.3065700093], ....

Now that it works in the Terminal when setting AWS_S3_ENDPOINT let's turn back to the notebook example where only the ~/.aws/config and ~/.aws/credentials files configured, and not AWS_S3_ENDPOINT
import rasterio
path = "s3://s1-image-dataset/test.tif"
with rasterio.Env() as env:
    with rasterio.open(path) as f:
        print(f.meta)
This gives the following DEBUG log
So let's try specifying the region to an AWSSession, along with the Key and Secret
import rasterio
path = "s3://s1-image-dataset/test.tif"
with rasterio.Env(AWSSession(aws_access_key_id="XXXX", aws_secret_access_key="XXXX", region_name="us-gov-west-1")) as env:
    with rasterio.open(path) as f:
        print(f.meta)
Still the same problem persists (only when running this in a Jupyter Notebook though).
Upgraded to rasterio 1.0.26 now the following works, but I have to specify AWS_S3_ENDPOINT explicitly otherwise it does not work as shown above
import rasterio
path = "s3://s1-image-dataset/test.tif"
with rasterio.Env(AWSSession(aws_access_key_id="XXXX", aws_secret_access_key="XXXX", region_name="us-gov-west-1"), AWS_S3_ENDPOINT='s3.us-gov-west-1.amazonaws.com') as env:
    with rasterio.open(path) as f:
        print(f.meta)

Perhaps this is not a bug, but it seems counter intuitive to directly need to specify AWS Endpoints when there is a region specification needed as well as the two are related.
 
 


scott
 

Hughes,

Have you tried setting `os.environ['AWS_S3_ENDPOINT']='s3.us-gov-west-1.amazonaws.com'` before opening the file?

This reminds me of a previous (but resolved) issue with requester pays configuration: https://github.com/mapbox/rasterio/issues/692#issuecomment-362434388 

Scott


Sean Gillies
 

Hughes,

On Fri, Sep 6, 2019 at 3:48 AM <hughes.lloyd@...> wrote:
Hi Sean,

...

Perhaps this is not a bug, but it seems counter intuitive to directly need to specify AWS Endpoints when there is a region specification needed as well as the two are related.


The AWS_S3_ENDPOINT config option is intended to allow GDAL users to work with S3-compatible systems like https://min.io/index.html. It shouldn't be needed for the Gov Cloud, specification of the region should suffice, as you expect. I'm going to dig in rasterio and ask on gdal-dev. I'll follow up here soon.

--
Sean Gillies


Sean Gillies
 

Hughes, would you be willing to run

CPL_CURL_VERBOSE=1 rio info "s3://s1-image-dataset/test.tif"

on your computer after unsetting AWS_S3_ENDPOINT and show us the output after sanitizing it (replace your key with xxxxx, but otherwise leave the Authorization headers readable)? If you do this, we'll see information about the HTTP requests that are made and can see if GDAL is failing to navigate a redirect or something like that.


On Fri, Sep 6, 2019 at 8:47 AM Sean Gillies via Groups.Io <sean.gillies=gmail.com@groups.io> wrote:
Hughes,

On Fri, Sep 6, 2019 at 3:48 AM <hughes.lloyd@...> wrote:
Hi Sean,

...

Perhaps this is not a bug, but it seems counter intuitive to directly need to specify AWS Endpoints when there is a region specification needed as well as the two are related.


The AWS_S3_ENDPOINT config option is intended to allow GDAL users to work with S3-compatible systems like https://min.io/index.html. It shouldn't be needed for the Gov Cloud, specification of the region should suffice, as you expect. I'm going to dig in rasterio and ask on gdal-dev. I'll follow up here soon.

--
Sean Gillies



--
Sean Gillies