Currently, I'm storing a large amount of time-series data into the following PostgreSQL table:
create table view (
user_id uuid,
product_id integer,
banner_id integer,
campaign_id integer,
price integer,
created_at timestamp
);
I get around 80 million entries on this table per day and then aggregate daily into another table. I've decided to reduce my aggregation window into a smaller tumbling window, of 1 minute, aligning itself to the top of the hour. This was done to save on storage space, since this table is getting bigger and bigger.
This means that I will aggregate every 1 minute and store the results of the calculation into another table. I am, however, afraid that my constant deletions (every 1 minute) will lock the table and prevent insertions from happening. Here's the deletion query:
delete from view where created_at between '2023-01-01 13:58:00' and '2023-01-01 13:59:00'
Which I run after my aggregation one, which looks a bit like this:
select * from view where created_at between '2023-01-01 13:58:00' and '2023-01-01 13:59:00'
I have around ~900 inserts per second on the table. The tumbling window allows me to not worry about race conditions or missing any data, but I'm concerned the deletes might affect the table.
Therefore, my question is:
DELETE
does not block INSERT
I don't see how a DELETE
could block INSERT
. Sure, it competes for general resources like RAM and I/O (like any other process on the same server). And it leads to some table (and index) bloat and work in the background for autovacuum. But no blockage.
... since this table is getting bigger and bigger.
This should not happen. Whether you delete daily or every minute, deleted rows should be marked dead by autovacuum
eventually, and space should be reused. After growing by some factor, depending on your autovacuum settings and write patterns, the table should stop growing further. Bigger for only daily deletes, but all the same. If your table does not stop growing, something's not right in your DB.
That said, your case sounds like a textbook example for range partitioning with the timestamp as partition key. Attach a new partition for every next day (ahead of time, by an automated process). Postgres will redirect entries to the partition for the current day. Drop (or detach) yesterday's partition once you are done with it.
Not only is dropping a partition hugely faster than deleting millions of rows. It also removes most work for autovacuum, keeps storage to a minimum (no bloat in table or indexes), and keeps inserted rows in physical storage order (which typically helps performance of aggregating data). All earlier points (including your question, really) become irrelevant.
There are detailed instructions for Partition Maintenance in the manual.
Unless there is more going on, that would stand against it, switch to a partitioned table at your earliest opportunity.
Would creating an unlogged table help in this scenario?
Sure. Not writing to WAL saves a lot of I/O and storage. But you buy that at the risk of data loss. The manual:
However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers.
Don't name a table "view". (Probably just obfuscating the real name here.)
Don't use BETWEEN
for timestamps. See: