We are using pgbackrest to backup our database to Amazon S3. We do full backups once a week and an incremental backup every other day. Size of our database is around 1TB, a full backup is around 600GB and an incremental backup is also around 400GB!
We found out that even read access (pure select statements) on the database has the effect that the underlying data files (in /usr/local/pgsql/data/base/xxxxxx) change. This results in large incremental backups and also in very large storage (costs) on Amazon S3.
Usually the files with low index names (e.g. 391089.1) change on read access.
On an update, we see changes in one or more files - the index could correlate to the age of the row in the table.
Some more facts:
We see the phenomenon on multiple servers.
Can someone explain, why postgresql changes data files on pure read access? We tested on a pure database without any other resources accessing the database.
This is normal. Some cases I can think of right away are:
a SELECT
or other SQL statement setting a hint bit
This is a shortcut for subsequent statements that access the data, so they don't have t consult the commit log any more.
a SELECT ... FOR UPDATE
writing a row lock
autovacuum removing dead row versions
These are leftovers from DELETE
or UPDATE
.
autovacuum freezing old visible row versions
This is necessary to prevent data corruption if the transaction ID counter wraps around.
The only way to fairly reliably prevent PostgreSQL from modifying a table in the future is:
never perform an INSERT
, UPDATE
or DELETE
on it
run VACUUM (FREEZE)
on the table and make sure that there are no concurrent transactions