postgresqlprometheusgrafana

How to measure the replication in PostgreSQL with Grafana


I’m looking for a way to monitor in graphana replication among the nodes of a PostgreSQL cluster.

I have been reviewing and my version has the following options available:

pg_stat_replication_reply_time pg_stat_replication_pg_wal_lsn_diff pg_stat_replication_pg_current_wal_lsn_bytes

Currently I'm working with the pg_stat_replication_reply_time metric which according to what I have researched measures the time it takes to write from the primary node to the standby.

enter image description here

It's the first time that I work with these monitoring tools, can you advise me about the metric I'm using is correct?


Solution

  • I personally prefer using either pg_stat_replication_pg_wal_lsn_diff or pg_replication_slots_pg_wal_lsn_diff gauges. The second is better (especially for alerts) but it only works if you're using replication slots. Both show how far (in bytes) a replica is behind the master. Since it's a gauge, you don't need any functions to use it, just put the name into the query box

    pg_stat_replication_pg_wal_lsn_diff
    # or, if you have variables
    pg_stat_replication_pg_wal_lsn_diff{instance="$instance"}
    

    As for the other metrics you mentioned, here's what I know: