google-cloud-platformgoogle-cloud-sqlgoogle-cloud-data-fusion

Private Data Fusion instance can't write to private CloudSQL instance


I am learning about gcp and I got a question regarding the execution pipeline of Data Fusion. I have a private Data Fusion Instance and a private CloudSQL instance running postgres. I peered both of them to the same VPC (the VPC only contains one subnet). I also set up a ProxyVM for them, for that I was using this guide, I did however need to change some things, such as the driver being used. The guide pointed out that I need to use the “normal” JDBC driver instead of the cloudSQL version. That however didn’t work for me, but it did work with the CloudSQL-Postgres JDBC driver. The connection seems to be successful, at least that’s what gcp is telling me.

The problem arises when I try to run a pipeline. The idea of the pipeline is just to transfer a csv file in google cloud storage into a table of the postgres cloudSQL instance. During the pipeline setup I also get no errors. The Error shows up once I deploy and run the pipeline and mentions “Schema validation failed.

Exact Error Message: “Exception while trying to validate schema of database table public."<table_name>" for connection jdbc:postgresql:///<db_name>.“

It’s also worth mentioning that when I convert the entire schema to “string” in the Database as well as in Data Fusion the “preview” mode runs successfully, but crashes again when I actually deploy and run it.

So my question is, what the problem cloud be. For Dataproc I also read that if I deploy a Dataproc cluster in a new VPC (what I did) I need to set the appropriate Firewall rule (according to this document). I did set this rule.

Maybe I configured the VM wrong, or I just deployed the pipeline entirely wrong. I also am able to transfer Data from a on-Premise Database into GCP-Storage without any problems. Just when I use CloudSQL as a sink I get this Error.

Any help is highly appreciated.


Solution

  • I was following the guide too closely. In the guide it says the following during the VM-Setup:

    gcloud compute firewall-rules create allow-private-cdf \
    --allow=tcp:22,tcp:${SQL_PORT} \
    --source-ranges=$CDF_IP_RANGE --network=$NETWORK --project=$PROJECT
    

    A firewall rule also needs to allow access for the dataproc cluster in the VPC to the proxy-VM. After I added that it worked as expected.