sqlpostgresqlchecksum

Is there a more elegant way to detect changes in a large SQL table without altering it?


Suppose you have a reasonably large (for local definitions of “large”), but relatively stable table.

Right now, I want to take a checksum of some kind (any kind) of the contents of the entire table.

The naïve approach might be to walk the entire table, taking the checksum (say, MD5) of the concatenation of every column on each row, and then perhaps concatenate them and take its MD5sum.

From the client side, that might be optimized a little by progressively appending columns' values into the MD5 sum routine, progressively mutating the value.

The reason for this, is that at some point in future, we want to re-connect to the database, and ensure that no other users may have mutated the table: that includes INSERT, UPDATE, and DELETE.

Is there a nicer way to determine if any change/s have occurred to a particular table? Or a more efficient/faster way?

Update/clarification:

(This is for Postgres, if it helps. I'd prefer to avoid poking transaction journals or anything like that, but if there's a way to do so, I'm not against the idea.)


Solution

  • Update for postgres 10+

    Now that logical replication is available in postgres, it's generally the best option for continuous change capture. Use that.

    Adding columns and triggers is really quite safe

    While I realise you've said it's a large table in a production DB so you say you can't modify it, I want to explain how you can make a very low impact change.

    In PostgreSQL, an ALTER TABLE ... ADD COLUMN of a nullable column takes only moments and doesn't require a table re-write. It does require an exclusive lock, but the main consequence of that is that it can take a long time before the ALTER TABLE can actually proceed, it won't hold anything else up while it waits for a chance to get the lock.

    The same is true of creating a trigger on the table.

    This means that it's quite safe to add a modified_at or created_at column and an associated trigger function to maintain them to a live table that's in intensive real-world use. Rows added before the column was created will be null, which makes perfect sense since you don't know when they were added/modified. Your trigger will set the modified_at field whenever a row changes, so they'll get progressively filled in.

    For your purposes it's probably more useful to have a trigger-maintained side-table that tracks the timestamp of the last change (insert/update/delete) anywhere in the table. That'll save you from storing a whole bunch of timestamps on disk and will let you discover when deletes have happened. A single-row side-table with a row you update on each change using a FOR EACH STATEMENT trigger will be quite low-cost. It's not a good idea for most tables because of contention - it essentially serializes all transactions that attempt to write to the table on the row update lock. In your case that might well be fine, since the table is large and rarely updated.

    A third alternative is to have the side table accumulate a running log of the timestamps of insert/update/delete statements or even the individual rows. This allows your client read the change-log table instead of the main table and make small changes to its cached data rather than invalidating and re-reading the whole cache. The downside is that you have to have a way to periodically purge old and unwanted change log records.

    So... there's really no operational reason why you can't change the table. There may well be business policy reasons that prevent you from doing so even though you know it's quite safe, though.

    ... but if you really, really, really can't:

    Another option is to use the existing "md5agg" extension: http://llg.cubic.org/pg-mdagg/ . Or to apply the patch currently circulating pgsql-hackers to add an "md5_agg" to the next release to your PostgreSQL install if you built from source.

    Logical replication

    The bi-directional replication for PostgreSQL project has produced functionality that allows you to listen for and replay logical changes (row inserts/updates/deletes) without requiring triggers on tables. The pg_receivellog tool would likely suit your purposes well when wrapped with a little scripting.

    The downside is that you'd have to run a patched PostgreSQL 9.3, so I'm guessing if you can't change a table, running a bunch of experimental code that's likely to change incompatibly in future isn't going to be high on your priority list ;-) . It's included in the stock release of 9.4 though, see "changeset extraction".

    Testing the relfilenode timestamp won't work

    You might think you could look at the modified timestamp(s) of the file(s) that back the table on disk. This won't be very useful:

    Examine the transaction logs

    In theory you could attempt to decode the transaction logs with pg_xlogreader and find records that affect the table of interest. You'd have to try to exclude activity caused by vacuum, full page writes after hint bit setting, and of course the huge amount of activity from every other table in the entire database cluster.

    The performance impact of this is likely to be huge, since every change to every database on the entire system must be examined.

    All in all, adding a trigger on a table is trivial in comparison.