Proposal: Allow other cloud object store providers


Sean Gillies
 

Hi Ashley,

On Fri, Jun 25, 2021 at 6:07 PM <ashley.sommer@...> wrote:

Thanks for your reply, Sean.

> are you proposing that potentially any cloud platform would be made first class in rasterio, concretely, as in GDAL and as with rasterio/S3 today?

Yes, thats right.

> What would you think if rasterio were to take the opposite approach and require users to write ~5 lines of code themselves to adapt output of, say, keystonev3 and swiftclient to a standard interface in rasterio

Yeah, I actually had the same thought too, but wasn't sure if it would be received well.

That is actually kind-of how you can do it in rasterio already. You can manually create a `keystonev3` or `swiftclient` auth session, and populate it with your credentials. You can then manually create a rasterio `SwiftClient`, and give it that auth session. Then pass the pre-configured `SwiftClient` into `session.Env()` and rasterio will use that as the Cloud Session. Its a bit of boilerplate code, but it works across all of the existing `Session` subclasses for the different cloud platforms already.

Should that be the "standard" way of doing it? Could it be cleaner? Would AWS S3 still be a special case?

Yes, I think that should be the standard way. AWS would have to remain a special case until we deprecate the special parameters properly, but we can already do what you outlined above for AWS. 

My end goal is to be able to use OpenStack Swift ObjectStore as a storage backend for an opendatacube project. Opendatacube only supports AWS S3 for now, because it relies on rasterio's "first-class" interface to S3. I was told, if I want to get other cloud providers working natively in opendatacube, we need them to be fully supported by rasterio first (as you mentioned, they're already fully supported by GDAL).

Ashley Sommer

I'm not familiar with opendatacube, but I think it should be possible to use Swift now without modifying either opendatacube or rasterio. GDAL already has support for 4 different authentication mechanisms (see https://gdal.org/user/virtual_file_systems.html#vsiswift) and as far as I know all of those options can be set using similarly named environmental variables, not only through the GDAL API as rasterio does here: https://github.com/mapbox/rasterio/blob/db03b66e81b489d3f5f01c9edfb6fc720250a2c1/rasterio/env.py#L233-L246. For example, one could call rasterio.session.SwiftSession(...).get_credential_options() and then update os.environ with the result.

I hope this is a useful suggestion and not a red herring,

--
Sean Gillies


ashley.sommer@...
 

Thanks for your reply, Sean.

> are you proposing that potentially any cloud platform would be made first class in rasterio, concretely, as in GDAL and as with rasterio/S3 today?

Yes, thats right.

> What would you think if rasterio were to take the opposite approach and require users to write ~5 lines of code themselves to adapt output of, say, keystonev3 and swiftclient to a standard interface in rasterio

Yeah, I actually had the same thought too, but wasn't sure if it would be received well.

That is actually kind-of how you can do it in rasterio already. You can manually create a `keystonev3` or `swiftclient` auth session, and populate it with your credentials. You can then manually create a rasterio `SwiftClient`, and give it that auth session. Then pass the pre-configured `SwiftClient` into `session.Env()` and rasterio will use that as the Cloud Session. Its a bit of boilerplate code, but it works across all of the existing `Session` subclasses for the different cloud platforms already.

Should that be the "standard" way of doing it? Could it be cleaner? Would AWS S3 still be a special case?

My end goal is to be able to use OpenStack Swift ObjectStore as a storage backend for an opendatacube project. Opendatacube only supports AWS S3 for now, because it relies on rasterio's "first-class" interface to S3. I was told, if I want to get other cloud providers working natively in opendatacube, we need them to be fully supported by rasterio first (as you mentioned, they're already fully supported by GDAL).

Ashley Sommer


Sean Gillies
 

Hi,

Thanks for bringing this up! To be sure that I understand, are you proposing that potentially any cloud platform would be made first class in rasterio, concretely, as in GDAL and as with rasterio/S3 today? AWS is special in rasterio because it's what I use at work and is where most of the important public raster datasets were hosted 2-3 years ago. GDAL makes all cloud platforms first class because people or organizations paid the maintainer to do it and because it's a less complex approach than making GDAL extendable.  

What would you think if rasterio were to take the opposite approach and require users to write ~5 lines of code themselves to adapt output of, say, keystonev3 and swiftclient to a standard interface in rasterio?

On Sun, Jun 13, 2021 at 10:03 PM <ashley.sommer@...> wrote:

Hi Everyone,

I understand that AWS S3 is by far the most common cloud object store provider, especially in the geospatial community. Unfortunately some users don't get to choose which object store provider they're given to use.

Rasterio already has Session support for five different Cloud sessions, including GSSession, OSSSession, AzureSession, SwiftSession and AWSSession. (In this case, a sessions are "classes that configure access to secured resources".)

However it seems that only AWS is a first-class citizen in rasterIO for the following reasons:

  • In the docs, on the topic of the use of objects on the cloud as virtual files, only support for AWS S3 is shown: https://rasterio.readthedocs.io/en/latest/topics/vsi.html
    • Other cloud providers are not even mentioned. I didn't know other sessions existed until I looked into the code.
  • When installing rasterio, you can add support for credentailizing AWS S3 (with boto3) using the package "extras" syntax: `pip install rasterio[s3]`.
    • No other cloud providers have their own addon available at install-time.
  • When configuring a rasterio context-manger with `rasterio.Env()` you have the ability to pass in an AWSSession or aws credentials to credentialize your context.
    • If you don't pass in credentails, `Env` will make a session using `Session.aws_or_dummy()`, but doesn't attempt to check the if other provider sessions should be used.

I propose the change to move the current object-store feature implementation to be less AWS focused, with
  • Ability to install requirement packages for other cloud providers, using additional "extras" options:
    • eg, `rasterio[gs]`, `rasterio[azure]`, `rasterio[swift]`
  • Modify the `session.Env()` context manger to be cloud-platform agnostic, ie, no "aws-or-dummy" behaviour, based on which supporting-packages are installed and which environment variables are present.
  • Document all of the different cloud providers that Rasterio supports, and how to configure and use them.

I understand this is a large piece of work, and an tall proposal for an opensource project. I'm usually one to scratch my own itch, my specific requirement out of this is to be able to use rasterio with my existing OpenStack Swift object store. I want to be able to use swift by passing pre-configured application credentials and using openstack `keystonev3` and `swiftclient` libraries to configure and credentialize the context used for GDAL. Of course I could simply open a PR with some code to patch in that support in the existing SwiftSession, but I think a cleaner solution would involve looking at the issues highlighted above, to make a more generalized fix overall, on top of the extra Swift features that I require.

Thanks for reading.



--
Sean Gillies


ashley.sommer@...
 

Hi Everyone,

I understand that AWS S3 is by far the most common cloud object store provider, especially in the geospatial community. Unfortunately some users don't get to choose which object store provider they're given to use.

Rasterio already has Session support for five different Cloud sessions, including GSSession, OSSSession, AzureSession, SwiftSession and AWSSession. (In this case, a sessions are "classes that configure access to secured resources".)

However it seems that only AWS is a first-class citizen in rasterIO for the following reasons:

  • In the docs, on the topic of the use of objects on the cloud as virtual files, only support for AWS S3 is shown: https://rasterio.readthedocs.io/en/latest/topics/vsi.html
    • Other cloud providers are not even mentioned. I didn't know other sessions existed until I looked into the code.
  • When installing rasterio, you can add support for credentailizing AWS S3 (with boto3) using the package "extras" syntax: `pip install rasterio[s3]`.
    • No other cloud providers have their own addon available at install-time.
  • When configuring a rasterio context-manger with `rasterio.Env()` you have the ability to pass in an AWSSession or aws credentials to credentialize your context.
    • If you don't pass in credentails, `Env` will make a session using `Session.aws_or_dummy()`, but doesn't attempt to check the if other provider sessions should be used.

I propose the change to move the current object-store feature implementation to be less AWS focused, with
  • Ability to install requirement packages for other cloud providers, using additional "extras" options:
    • eg, `rasterio[gs]`, `rasterio[azure]`, `rasterio[swift]`
  • Modify the `session.Env()` context manger to be cloud-platform agnostic, ie, no "aws-or-dummy" behaviour, based on which supporting-packages are installed and which environment variables are present.
  • Document all of the different cloud providers that Rasterio supports, and how to configure and use them.

I understand this is a large piece of work, and an tall proposal for an opensource project. I'm usually one to scratch my own itch, my specific requirement out of this is to be able to use rasterio with my existing OpenStack Swift object store. I want to be able to use swift by passing pre-configured application credentials and using openstack `keystonev3` and `swiftclient` libraries to configure and credentialize the context used for GDAL. Of course I could simply open a PR with some code to patch in that support in the existing SwiftSession, but I think a cleaner solution would involve looking at the issues highlighted above, to make a more generalized fix overall, on top of the extra Swift features that I require.

Thanks for reading.