postgresqlgoogle-bigquerygoogle-cloud-sqlread-replication

Diagnosing sudden GCP CloudSQL read replica stalling


A few weeks ago our CloudSQL postgres 9.6 read replica started intermittently stopping replication completely until it gets rebooted. After rebooting, it'll catch up and run happily for a few hours before needing to be rebooted again. The read replica is mainly used to feed BigQuery via GCP Federated Query Sync.

What I've looked at so far:

What can I look at to further diagnose why the read replica is stalling out?

We'd like to avoid pg_audit since our DB is heavy on PII. Is there an alternative that doesn't output parameterized values?

Our current plan is to spin up another replica that isn't connected to BigQuery which will hopefully tell us whether the problem is a locking on the primary's side or something on BigQuery's.


Solution

  • After GCP failed to respond to us, we spun up a new replica, moved all connections over to it, and spun down the old. Now it's behaving just fine. Seems like something on CloudSQL's backend got into a bad state.