postgresqlpartitioningpostgresql-10vacuumtable-partitioning

Optimal Postgres partitioning based on date


I am looking at ways to improve deletion of data in my Postgres (partitioned) tables, not worsening the access performance.

Using: Postgres 10.2

Ignoring some unrelated columns, I have the table transactions with these columns (omitting some unrelated columns):

transaction_id PK
location
type
user_id
transaction_date

Some important points about the current table:

  1. In production, it has around 100 million rows
  2. The table is partitioned (manually) based on the user_id (modulo 100). This means that the transactions for say user with user_id 3 will go to transactions_3 and with user_id 2356 will go to transactions_56.
  3. We insert the records manually since Postgres(10) doesn't support this type of partitioning out of the box and since we already know the user for which the transaction has to be inserted ( also upon retrieval)

What works well: Insertions and retrieval, since we already know the user - we know which table to look at and hence doesn't have to go through the 100 partitions to find it.

What does not: We have a process which deletes the old data quite often - based on users subscription. But this leads to problems (space issues) often as the data deleted doesn't get freed up immediately. As plain VACUUM may not be enough when a table contains large numbers of dead row versions as a result of massive update or delete activity (as is our case here)

How we would like to improve this is to be able to store the the data in partitions based on the transaction date - and then be able to drop the tables when the subscription is over. This will make sure that the space is usable again immediately.

In short, our main goal is to improve the deletion process so that the space is recovered immediately - at the same time making sure the access performance is not worsened

I have a few questions regarding this:

  1. If we have the table partitioned based on date, I would think this (access at least) is going to be slower as it now has to scan all the 100 tables to see where the transaction id is?
  2. Is it really possible to achieve this, keeping the retrieval of transactions as before - while improving the deletion process. If so, how?
  3. I would think having it partitioned both on the days and account is not really a possible\good solution - due to the large number of tables that could be created? (Need to keep data for a maximum of 2 years)
  4. For this, do we need to move to a newer Postgres, say Postgres 14 (it being the latest). I understand that it is always good to upgrade to the latest version. But I wanted to know - if it is really possible to do this without the Postgres upgrade.

Hoping to get some guidance here on the way forward.


Solution

  • First: upgrading PostgreSQL would be a very good idea, not only because hash partitioning was introduced after v10, but also because of the numerous improvements in performance and features for partitioning since v10.

    I have the feeling that the partitioning scheme you are using right now (home-grown hash partitioning) does not help you much. You cannot get rid of a customer with a simple DROP TABLE (which would be nice), and deleting 10 million rows in a partition is not any more fun than deleting them in a single large table. On the contrary – the relative bloat once autovacuum is done will be more. The only saving grace is that autovacuum will work more efficiently, since it can treat each partition on its own.

    To answer your questions:

    1. Yes, partitioning makes most queries slower; hopefully not much slower. That is the price you are paying.

    2. No, your queries will get somewhat slower (proportional to the numbr of partitions, so keep that moderate).

    3. You can partition on both criteria, since a partition can again be a partitioned table. But I question if that is really a good idea, since I doubt that your current partitioning scheme is really beneficial.

    4. Yes, use at least v12, ideally v14.