We're migrating our EMR cluster to Dataproc, and we're relying on Oozie to run our workflows. The first challenges is how to load the workflow.xml
from Cloud Storage bucket. We used to do it using S3:
oozie.coord.application.path=s3://my_workflow/workflows/daily
Trying to use same approach on GCS does not work at all
oozie.coord.application.path=gs://my_workflow/workflows/daily
When I try to run the Oozie job, I got this error:
gs URI scheme not supported
Do I have to manually configure the scheme on Oozie? I'm using Dataproc initialization action to deploy Oozie.
I reproduced your problem. Seems Oozie init action doesn't support loading workflow.xml from GCS yet. I think you can file a bug for the init action, but for now you might have to put the file in HDFS.
Regarding the fix, it needs:
1) In /etc/oozie/conf/oozie-site.xml
, add
<property>
<name>oozie.service.HadoopAccessorService.supported.filesystems</name>
<value>hdfs,gs</value>
<decscription>...</decscription>
</property>
2) In /etc/oozie/conf/hadoop-conf/core-site.xml
, add
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>The AbstractFileSystem for gs: uris.</description>
</property>
<property>
<name>google.cloud.auth.service.account.enable</name>
<value>false</value>
<description>
Whether to use a service account for GCS authorization.
Setting this property to `false` will disable use of service accounts for
authentication.
</description>
</property>
3) Copy gcs-connector.jar
from /usr/lib/hadoop/lib/
to /usr/lib/oozie/lib
.
4) Restart Ooozie service with
sudo systemctl restart oozie.service