Topics

Supporting many cloud storage systems

Sean Gillies
 

Hi all,

Until very recently we didn't have any native support for cloud storage other than S3. We can set environment variables that GDAL expects and open files "mounted" at, for example, /vsifoo/bar/raster.tif, yes. By native I mean support for "s3" and "foo" URIs and integration with Python SDKs for specific cloud storage systems.

In a comment on a PR that would add native support for OpenStack Swift,
https://github.com/mapbox/rasterio/pull/1543#issuecomment-437074381, I am thinking out loud about the costs of maintaining such native support. As you know, Rasterio is a growing, but small project. Testing against all the cloud storage systems and curating tokens for them in our CI is something I don't have time for. Having champions for each cloud would help, but that's not entirely reliable as people do come and go from open source projects.

Does anyone object to inclusion of code in Rasterio that supports cloud storage systems but doesn't actually fetch any bits from Alibaba or Swift or whatever? I'm not saying that the code would be untested, our CI can install the cloud provider's Python SDKs and test using mock services.

Yes? No?

--
Sean Gillies

Madhav Desetty
 

Yes. I am willing to contribute and champion Google Cloud Storage. We at Airbus use Google Cloud Platform so there is a lot of interest to add Google Cloud Support. I am looking at marblecutter-virtual which uses rasterio to stream COGs and this is of great interest to me. 

Guy Doulberg
 

Hi Sean,

Having more native supports for cloud storage is super, in the project I am part of we are using azure blob storage.

But I don't feel rasterio is lacking something on azure support because it is using vsiaz support added in GDAL,

So I wanted to ask why adding support to more cloud service is a rasterio issue and not a GDAL issue?

Thanks, Guy

Sean Gillies
 

Hi Guy,

Indeed, you can set a couple of environment variables and run `with rasterio.open("/vsiaz/example.tif")` and things should work.

However, I'm thinking about applications where Rasterio is used in combination with classes from a cloud's Python SDK (like https://docs.microsoft.com/en-us/python/api/overview/azure/storage?view=azure-python for Azure) and where resources are identified by their official identifiers. These identifiers would be https: URLs for Azure blobs (described in https://docs.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata) and ARNs or s3: URLs for AWS objects.

In a Python application using the Azure SDK, I want to be able to do something like this (following from the example in https://docs.microsoft.com/en-us/python/api/overview/azure/storage?view=azure-python):

blob_url = blob_service.make_blob_url('mycontainername', 'myblobname')
with rasterio.open(blob_url) as src:
    src.read()

I don't think a Rasterio user should need to think about the /vsiaz/ details that are specific to GDAL at all.

On Sun, Dec 9, 2018 at 12:22 AM Guy Doulberg <guyd@...> wrote:
Hi Sean,

Having more native supports for cloud storage is super, in the project I am part of we are using azure blob storage.

But I don't feel rasterio is lacking something on azure support because it is using vsiaz support added in GDAL,

So I wanted to ask why adding support to more cloud service is a rasterio issue and not a GDAL issue?

Thanks, Guy



--
Sean Gillies