Google offers "s3-compatible" access to their Cloud Storage service in the form of something called "Interoperability Mode".
We're running spark on a closed network and our connection to the internet is through a proxy. Google's own hadoop connector for cloud storage doesn't have any configuration settings for a proxy, so we have to use the built-in spark s3a connector, which lets you set all the properties you'd need to set to use a proxy that's talking to the internet and the appropriate google URL endpoints via core-site.xml:
<!-- example xml -->
<name>fs.s3a.access.key</name>
<value>....</value>
<name>fs.s3a.secret.key</name>
<value>....</value>
<name>fs.s3a.endpoint</name>
<value>https://storage.googleapis.com</value>
<name>fs.s3a.connection.ssl.enabled</name>
<value>True</value>
<name>fs.s3a.proxy.host</name>
<value>proxyhost</value>
<name>fs.s3a.proxy.port</name>
<value>12345</value>
However, unlike with boto, which works fine with the proxy in our environment with similar settings, Spark is throwing a com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception when it tries to use our proxy that looks like this:
com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception:
The provided security credentials are not valid.
(Service: Amazon S3; Status Code: 403; Error Code: InvalidSecurity;
Request ID: null), S3 Extended Request ID: null
What am I doing wrong here, or is this simply unsupported?
In the same vein, I'm curious if this version of spark is even using the jets3t library? I'm finding conflicting information.
I eventually figured this out. You have to remove some specific offending jars from the classpath. I've detailed my solution in a gist for future me. :)