[SOLVED] Diagnosing sudden GCP CloudSQL read replica stalling

Diagnosing sudden GCP CloudSQL read replica stalling

A few weeks ago our CloudSQL postgres 9.6 read replica started intermittently stopping replication completely until it gets rebooted. After rebooting, it'll catch up and run happily for a few hours before needing to be rebooted again. The read replica is mainly used to feed BigQuery via GCP Federated Query Sync.

What I've looked at so far:

We made no changes to either the primary or replica
No changes in schema
The day it started happening doesn't coincide with any deployments
It's only 250GB which is only 25% of the drive
There's no spike in CPU/memory when it stalls out
There are some long-running queries on the primary's side as well as BigQuery, but they don't coincide with the stall outs and are running/replicating just fine when the replica is behaving
There are no errors in either the primary or replica's logs
hot_standby_feedback is enabled and max_standby_streaming_delay and max_standby_archive_delay are both -1.

What can I look at to further diagnose why the read replica is stalling out?

We'd like to avoid pg_audit since our DB is heavy on PII. Is there an alternative that doesn't output parameterized values?

Our current plan is to spin up another replica that isn't connected to BigQuery which will hopefully tell us whether the problem is a locking on the primary's side or something on BigQuery's.

Solution

After GCP failed to respond to us, we spun up a new replica, moved all connections over to it, and spun down the old. Now it's behaving just fine. Seems like something on CloudSQL's backend got into a bad state.