After reading the documentation of PgPool I was left confused which option would suit my use case best. I need a main database instance which would serve the queries and 1 or more replicas (standbys) of the main one which would be used for disaster recovery scenarios. What is very improtant for me is that all transactions committed successfully to the master node are guaranteed to be replicated eventually to the replicas such that when a failover occurs, the replica database instance has all transactions up to and including the latest one applied to it.
In terms of asynchronous replication, I have not seen any mention whether that is the case in the PgPool documentation, however, it does indeed mention some potential data loss occurring which is a bit too vague for me to draw any conclusions.
To combat this data loss, the documentation suggests to use synchronous streaming replication which before committing a transaction in the main node, ensures that all replicas have applied that change also. Thus, this method is slower than the asynchronous one but if there is no data loss, it could be viable.
Is synchronous replication the only method that allows me to achieve my use-case or would the asynchronous replication also do the trick? Also, what constitutes the potential data loss in the asynchronous replication?
Asynchronous replication means that the primary server does not wait for a standby server before reporting a successful COMMIT
to the client. As a consequence, if the primary server fails, it is possible that the client believes that a certain transaction is committed, but none of the standby servers has yet received the WAL information. In a high availability setup, where you promote a standby in case of loss of the primary server, that means that you could potentially lose committed transactions, although it typically takes only a split second for the information to reach the standby.
With synchronous replication, the primary waits until the first available synchronous standby server has reported that it has received and persisted the WAL information before reporting a successful COMMIT
to the client (the details of this, like which standby server is chosen, how many of them have to report back and when exactly WAL counts as received by the standby are configurable). So no transaction that has been reported committed to the client can get lost, even if the primary server is gone for good.
While it is technically simple to configure synchronous replication, it poses an architectural and design problem, so that asynchronous replication is often the better choice:
Synchronous replication slows down all data modification drastically. To work reasonably well, the network latency between primary and standby has to be very low. You usually cannot reasonably use synchronous replication between different data centers.
Synchronous replication reduces the availability of the whole system, because failure of the standby server prevents any data modification from succeeding. For that reason, you need to have at least two synchronous standby servers, one that is kept synchronized and one as a stand-in.
Even with synchronous replication, it is not guaranteed that reading from the standby after writing to the primary will give you the new data, because by default the primary doesn't wait for WAL to be replayed on the standby. If you want that, you need to set synchronous_commit = remote_apply
on the primary server. But since queries on the standby can conflict with WAL replay, you will either have to deal with replication (and commit) delay or canceled queries on the standby. So using synchronous replication as a tool for horizontal scaling is reasonably possible only if you can deal with data modifications not being immediately visible on the standby.