amazon-s3apache-sparkgoogle-cloud-storagebotojets3t

How does one use Spark with google cloud storage's "interoperability mode"?


Google offers "s3-compatible" access to their Cloud Storage service in the form of something called "Interoperability Mode".

We're running spark on a closed network and our connection to the internet is through a proxy. Google's own hadoop connector for cloud storage doesn't have any configuration settings for a proxy, so we have to use the built-in spark s3a connector, which lets you set all the properties you'd need to set to use a proxy that's talking to the internet and the appropriate google URL endpoints via core-site.xml:

<!-- example xml -->
<name>fs.s3a.access.key</name>
<value>....</value>

<name>fs.s3a.secret.key</name>
<value>....</value>

<name>fs.s3a.endpoint</name>
<value>https://storage.googleapis.com</value>

<name>fs.s3a.connection.ssl.enabled</name>
<value>True</value>

<name>fs.s3a.proxy.host</name>
<value>proxyhost</value>

<name>fs.s3a.proxy.port</name>
<value>12345</value>

However, unlike with boto, which works fine with the proxy in our environment with similar settings, Spark is throwing a com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception when it tries to use our proxy that looks like this:

 com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception: 
   The provided security credentials are not valid.
   (Service: Amazon S3; Status Code: 403; Error Code: InvalidSecurity; 
   Request ID: null), S3 Extended Request ID: null

What am I doing wrong here, or is this simply unsupported?

In the same vein, I'm curious if this version of spark is even using the jets3t library? I'm finding conflicting information.


Solution

  • I eventually figured this out. You have to remove some specific offending jars from the classpath. I've detailed my solution in a gist for future me. :)

    https://gist.github.com/chicagobuss/6557dbf1ad97e5a09709