elasticsearchgoogle-cloud-platformgoogle-cloud-dataflowdataflow

Dataflow PubSub to Elasticsearch Template proxy


We need to create a Dataflow job that ingests from PubSub to Elasticsearch but the job can not make outbound internet connections to reach Elastic Cloud.

Is there a way to pass proxy parameters to the Dataflow vm on creation time?

Found this article but proxy parameters are part of a maven app, I'm not sure how to use it here.

https://leifengblog.net/blog/run-dataflow-jobs-in-a-shared-vpc-on-gcp/

Thanks


Solution

  • To reach an external endpoint you’ll need to configure internet access and firewall settings, depending on your use case, your VMs may also need access to other resources you can check in this document which method you’ll need to configure for Dataflow. Before selecting the method that you’ll choose please check the document how to specify a network or a subnetwork.

    In GCP, in subnetwork, you can enable Google Private Access, and the VMs in that subnetwork will be able to reach all the GCP endpoints (Dataflow, BigQuery, etc), even if they have private IPs only. There is no need to set up a proxy. See this document.

    For instance, for Java pipelines, I normally use private IPs only for the Dataflow workers, and they are able to reach Pubsub, BigQuery, Bigtable, etc.

    For Python pipelines, if you have external dependencies, the workers will need to reach the PyPi, and for that, you need Internet connectivity. If you want to use private IPs in Python pipelines, you can ship those external dependencies in a custom container, so the workers don't need to download them.

    You can use a maven file right after you write your pipeline, you must create and stage your template file(mvn) you can follow this example.