I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. However, I want to specify the --public-ip-address
that I normally specify using the gcloud CLI (See: https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/create#--public-ip-address), but it does not seem to be a possible input for ClusterGenerator (source: https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/dataproc.py#L122)?
How would one adapt the --public-ip-address
when using ClusterGenerator or just the common API-format?
My code is currently like this:
from airflow import models
from airflow.providers.google.cloud.operators.dataproc import (
ClusterGenerator,
DataprocCreateClusterOperator,
)
from google.api_core.retry import Retry
DATAPROC_CLUSTER_CONFIG = ClusterGenerator(
project_id=GCP_PROJECT,
region=GCP_REGION,
master_machine_type="n2-standard-4",
master_disk_type="pd-standard",
master_disk_size=500,
worker_machine_type="n2-standard-4",
worker_disk_type="pd-standard",
worker_disk_size=500,
num_workers=2,
image_version="2.2-ubuntu22",
storage_bucket=DATAPROC_BUCKET,
properties={
"dataproc:pip.packages": "sentence-transformers==3.0.1,pydantic==2.8.2"
},
internal_ip_only=False,
enable_component_gateway=True,
).make()
with models.DAG() as dag:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id="create_dataproc_cluster",
project_id=GCP_PROJECT,
cluster_config=DATAPROC_CLUSTER_CONFIG,
region=GCP_REGION,
cluster_name=DATAPROC_CLUSTER_NAME,
retry=Retry(maximum=100.0, initial=10.0, multiplier=1.0),
)
Finally, I found out that the ClusterGenerator class does not work - by reading this issue: https://github.com/apache/airflow/issues/17089
You will get the same settings on your Dataproc cluster (as --public-ip-address
) by adding: internal_ip_only=False and enable_component_gateway=True. However in my previous code, ClusterGenerator was not working.
You need to specify the config yourself like:
DATAPROC_CLUSTER_CONFIG = {
"config_bucket": DATAPROC_BUCKET,
"gce_cluster_config": {
"internal_ip_only": False,
},
"software_config": {
"properties": {
"dataproc:pip.packages": "sentence-transformers==3.0.1,pydantic==2.8.2",
}
},
"endpoint_config": {
"enable_http_port_access": True
}
}