postgresqlquery-optimizationdatabase-performancedatabase-optimizationpostgresql-11

Move rows older that x days to archive table or partition table in Postgres 11


I would like to speed up the queries on my big table that contains lots of old data.

I have a table named post that has the date column created_at. The table has over ~31 million rows and ~30 million rows older than 30 days.

Actually, I want this:

Any detailed and concrete solution in PostgresSQL 11.15?

My ideas:

Thanks


Solution

  • This is to split your data into a post and post_archive table. It's a common approach, and I've done it (with SQL Server).

    Before you do anything else, make sure you have an index on your created_at column on your post table. Important.

    Next, you need to use a common expression to mean "thirty days ago". This is it.

     (CURRENT_DATE - INTERVAL '30 DAY')::DATE
    

    Next, back everything up. You knew that.

    Then, here's your process to set up your two tables.

    1. CREATE TABLE post_archive AS TABLE post; to populate your archive table.

    2. Do these two steps to repopulate your post table with the most recent thirty days. It will take forever to DELETE all those rows, so we'll truncate the table and repopulate it. That's also good because it's like starting from scratch with a much smaller table, which is what you want. This takes a modest amount of downtime.

      TRUNCATE TABLE post;
      INSERT INTO post SELECT * FROM post_archive
       WHERE created_at > (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
      
    3. DELETE FROM post_archive WHERE created_at > (CURRENT_DATE - INTERVAL '30 DAY')::DATE; to remove the most recent thirty days from your archive table.

    Now, you have the two tables.

    Your next step is the daily row-migration job. PostgreSQL lacks a built-in job scheduler like SQL Server's Job or MySQL's EVENT so your best bet is a cronjob.

    It's probably wise to do the migration daily if that fits with your business rules. Why? Many-row DELETEs and INSERTs cause big transactions, and that can make your RDBMS server thrash. Smaller numbers of rows are better.

    The SQL you need is something like this:

    INSERT INTO post_archive SELECT * FROM post 
     WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE; 
    DELETE FROM post 
     WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
    

    You can package this up as a shell script. On UNIX-derived systems like Linux and FreeBSD the shell script file might look like this.

    #!/bin/sh
    psql postgres://username:password@hostname:5432/database << SQLSTATEMENTS
    INSERT INTO post_archive SELECT * FROM post 
     WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE; 
    DELETE FROM post 
     WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
    SQLSTATEMENTS
    

    Then run the shell script from cron a few minutes after 3am each day.

    Some notes:

    Cronjobs are quite reliable on Linux and FreeBSD.