apache-sparkgoogle-cloud-platformgoogle-cloud-storagegoogle-cloud-dataprocsre

manage dataproc cluster access using service account and IAM roles


I am a beginner in cloud and would like to limit my dataproc cluster access to a given gcs buckets in my project.

Lets says I have created a service account named as 'data-proc-service-account@my-cloud-project.iam.gserviceaccount.com' and then I create a dataproc cluster and assign service account to it.

Now I have created two gcs bucket named as

'gs://my-test-bucket/spark-input-files/'
'gs://my-test-bucket/spark-output-files/'

These buckets holds some input files which needs to be accessed by spark jobs running on my dataproc cluster and also act as a location wherein my spark jobs can write some output files.

I think I have to go and edit my bucket permission as shown in given link. Edit Bucket Permission

I want that my spark jobs can only read files from this specific bucket 'gs://my-test-bucket/spark-input-files/'. and if they are writing to a gcs bucket, they can only write to ''gs://my-test-bucket/spark-output-files/'

Question here is: (most likely a question related to SRE resource)

What all IAM permission needs to be added to my data proc service account data-proc-service-account@my-cloud-project.iam.gserviceaccount.com on IAM console page.

and what all read/write permissions needs to be added for given specific buckets, Which I believe has to be configured via adding member and assigning right permission to it. (as shown in the link mentioned above)

Do I need to add my data proc service account as a member and can add below these two roles. will this work?

Storage Object Creator  for bucket 'gs://my-test-bucket/spark-output-files/
Storage Object Viewer   for bucket 'gs://my-test-bucket/spark-input-files/'

Also let me know in case I have missed anything or something better can be done.


Solution

  • According to the Dataproc IAM doc:

    To create a cluster with a user-specified service account, the specified service
    account must have all permissions granted by the Dataproc Worker role. Additional
    roles may be required depending on configured features.
    

    The dataproc.worker role has a list of GCS related permissions, including things like storage.objects.get and storage.objects.create. And these apply to any buckets.

    What you want to do, is to give your service account almost identical permissions to dataproc.worker role, but limit all the storage.xxx.xxx permissions to the Dataproc staging bucket. Then in addition, add write access to your output bucket and read access to your input bucket.

    Or you can use a different service account than the Dataproc service account when you run your Spark job. This job specific service account will only need the read access to input bucket and write access to output bucket. Assuming you are using the GCS connector (which comes pre-installed on Dataproc clusters) to access GCS, you can follow the instructions found here. But in this case you will have to distribute the service account key across worker nodes or put it in GCS/HDFS.