google-cloud-platformooziegoogle-cloud-dataprocoozie-workflow

How to use GCS bucket as workflow file source for Oozie in Dataproc


We're migrating our EMR cluster to Dataproc, and we're relying on Oozie to run our workflows. The first challenges is how to load the workflow.xml from Cloud Storage bucket. We used to do it using S3:

oozie.coord.application.path=s3://my_workflow/workflows/daily

Trying to use same approach on GCS does not work at all

oozie.coord.application.path=gs://my_workflow/workflows/daily

When I try to run the Oozie job, I got this error:

gs URI scheme not supported

Do I have to manually configure the scheme on Oozie? I'm using Dataproc initialization action to deploy Oozie.


Solution

  • I reproduced your problem. Seems Oozie init action doesn't support loading workflow.xml from GCS yet. I think you can file a bug for the init action, but for now you might have to put the file in HDFS.

    Regarding the fix, it needs:

    1) In /etc/oozie/conf/oozie-site.xml, add

    <property>
      <name>oozie.service.HadoopAccessorService.supported.filesystems</name>     
       <value>hdfs,gs</value>
       <decscription>...</decscription>
    </property>
    

    2) In /etc/oozie/conf/hadoop-conf/core-site.xml, add

    <property>
      <name>fs.AbstractFileSystem.gs.impl</name>
      <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
      <description>The AbstractFileSystem for gs: uris.</description>
    </property>
    <property>
      <name>google.cloud.auth.service.account.enable</name>
      <value>false</value>
      <description>
        Whether to use a service account for GCS authorization.
        Setting this property to `false` will disable use of service accounts for
        authentication.
      </description>
    </property>
    

    3) Copy gcs-connector.jar from /usr/lib/hadoop/lib/ to /usr/lib/oozie/lib.

    4) Restart Ooozie service with

    sudo systemctl restart oozie.service