airflowgcloudgoogle-cloud-dataprocdataproc

Creating Dataproc Cluster with public-ip-address using DataprocCreateClusterOperator in Airflow


I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. However, I want to specify the --public-ip-address that I normally specify using the gcloud CLI (See: https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/create#--public-ip-address), but it does not seem to be a possible input for ClusterGenerator (source: https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/dataproc.py#L122)?

How would one adapt the --public-ip-address when using ClusterGenerator or just the common API-format?

My code is currently like this:

from airflow import models
from airflow.providers.google.cloud.operators.dataproc import (
    ClusterGenerator,
    DataprocCreateClusterOperator,
)
from google.api_core.retry import Retry

DATAPROC_CLUSTER_CONFIG = ClusterGenerator(
    project_id=GCP_PROJECT,
    region=GCP_REGION,
    master_machine_type="n2-standard-4",
    master_disk_type="pd-standard",
    master_disk_size=500,
    worker_machine_type="n2-standard-4",
    worker_disk_type="pd-standard",
    worker_disk_size=500,
    num_workers=2,
    image_version="2.2-ubuntu22",
    storage_bucket=DATAPROC_BUCKET,
    properties={
        "dataproc:pip.packages": "sentence-transformers==3.0.1,pydantic==2.8.2"
    },
    internal_ip_only=False,
    enable_component_gateway=True,
).make()

with models.DAG() as dag:
    create_dataproc_cluster = DataprocCreateClusterOperator(
        task_id="create_dataproc_cluster",
        project_id=GCP_PROJECT,
        cluster_config=DATAPROC_CLUSTER_CONFIG,
        region=GCP_REGION,
        cluster_name=DATAPROC_CLUSTER_NAME,
        retry=Retry(maximum=100.0, initial=10.0, multiplier=1.0),
    )

Solution

  • Finally, I found out that the ClusterGenerator class does not work - by reading this issue: https://github.com/apache/airflow/issues/17089

    You will get the same settings on your Dataproc cluster (as --public-ip-address) by adding: internal_ip_only=False and enable_component_gateway=True. However in my previous code, ClusterGenerator was not working.

    You need to specify the config yourself like:

    DATAPROC_CLUSTER_CONFIG = {
        "config_bucket": DATAPROC_BUCKET,
        "gce_cluster_config": {
            "internal_ip_only": False,
        },
        "software_config": {
            "properties": {
            "dataproc:pip.packages": "sentence-transformers==3.0.1,pydantic==2.8.2",
            }
        },
        "endpoint_config": {
            "enable_http_port_access": True
        }
    }