I am a beginner in cloud and would like to limit my dataproc cluster
access to a given gcs buckets
in my project.
Lets says I have created a service account
named as 'data-proc-service-account@my-cloud-project.iam.gserviceaccount.com'
and then I create a dataproc cluster and assign service account to it.
Now I have created two gcs bucket named as
'gs://my-test-bucket/spark-input-files/'
'gs://my-test-bucket/spark-output-files/'
These buckets holds some input files which needs to be accessed by spark jobs running on my dataproc cluster and also act as a location wherein my spark jobs can write some output files.
I think I have to go and edit my bucket permission as shown in given link. Edit Bucket Permission
I want that my spark jobs can only read files from this specific bucket 'gs://my-test-bucket/spark-input-files/'
.
and if they are writing to a gcs bucket, they can only write to ''gs://my-test-bucket/spark-output-files/'
Question here is: (most likely a question related to SRE resource)
What all IAM permission needs to be added to my data proc service account
data-proc-service-account@my-cloud-project.iam.gserviceaccount.com
on IAM
console page.
and what all read/write permissions needs to be added for given specific buckets, Which I believe has to be configured via adding member and assigning right permission to it. (as shown in the link mentioned above)
Do I need to add my data proc service account as a member and can add below these two roles. will this work?
Storage Object Creator for bucket 'gs://my-test-bucket/spark-output-files/
Storage Object Viewer for bucket 'gs://my-test-bucket/spark-input-files/'
Also let me know in case I have missed anything or something better can be done.
According to the Dataproc IAM doc:
To create a cluster with a user-specified service account, the specified service
account must have all permissions granted by the Dataproc Worker role. Additional
roles may be required depending on configured features.
The dataproc.worker
role has a list of GCS related permissions, including things like storage.objects.get
and storage.objects.create
. And these apply to any buckets.
What you want to do, is to give your service account almost identical permissions to dataproc.worker
role, but limit all the storage.xxx.xxx
permissions to the Dataproc staging bucket. Then in addition, add write access to your output bucket and read access to your input bucket.
Or you can use a different service account than the Dataproc service account when you run your Spark job. This job specific service account will only need the read access to input bucket and write access to output bucket. Assuming you are using the GCS connector (which comes pre-installed on Dataproc clusters) to access GCS, you can follow the instructions found here. But in this case you will have to distribute the service account key across worker nodes or put it in GCS/HDFS.