google-cloud-platformgoogle-cloud-sqlgoogle-cloud-mlgcp-ai-platform-training

How to connect AI Platform Training job to Cloud SQL PSQL DB?


I have a simple python program to connect to a PSQL DB on google cloud platform. When I run it locally it connects successfully (only if I manually tell the DB to allow my local IP address) to the DB via the host address (public IP), port, username, and password.

When I package this in a Docker image and run it locally, it connects successfully (only if I manually tell the DB to allow my local IP address).

Here is where it fails: If I stop telling the DB to allow my local IP address, it fails.

Also, After I push my docker image to google cloud container registry. Then use ai-platform training job to grab the container and do something via the code:

gcloud ai-platform jobs submit training $JOB_NAME   --region $REGION   --master-image-uri $IMAGE_URI  --   app.py --user_arg='Y'

I communicate with the image via flags and I am sure the image is responding properly. However, when I try to connect to the PSQL DB, I'm getting the error:

psycopg2.OperationalError: could not connect to server: Connection timed out.
Is the server running on host ... and accepting TCP/IP connections on port ...?

I do not want to use the cloud sql proxy to solve this problem, nor do I want to set any sort of static IP and manually "allow" it in the DB settings.

I want to facilitate the connection via the IAM server accounts. I gave all of the services the following permissions: Cloud SQL Admin, Cloud SQL Editor, Cloud SQL Client, Cloud SQL Instance User, Cloud SQL Service Agent.

As you can tell, I gave the permissions to every account that I can, and it still isn't connecting. Any help would be appreciated!

Also, when I call gcloud ai-platform jobs submit training ... I know that some service account creates an instance to execute the job. I think it is this instance that can't connect. I've read so many gcloud docs already and I am baffled. Maybe I missed something obvious :(


Solution

  • When you run a job with AI Platform, you run it in serverless. Not in your project, but in Google side, somewhere but not in your project. Therefore, the created VM(s) aren't in your project (you don't see them into Compute Engine page) and thus not in your VPC.

    So, to open the port 5432 is totally useless because it's not the same network. The only solution is to keep a public IP on your database (with no network authorized, only the public IP) and to use Cloud SQL proxy (even if you don't want).

    Of course, the other solution is to authorized the 0.0.0.0/0 network on the PSQL public IP, but it's absolutely not a good piece of advice!

    However, I have a remark: IMO it's not the right pattern to use directly the database in your training job. Indeed, in the training job, you need speed, efficiency and to reduce the latency. Using a database is not really great for this.

    The correct pattern could be