I am using PostgresSQL 15 hosted by AWS Aurora serverless v2. I have a table includes 200 millions rows and I need to update a value on a column for all those rows. This table has primary key id which is generated by random uuid. It doesn't have any other index.
When run update Table set column=value
on this table, it consumes all database cpu/memory which impacts production traffic. Since the table doesn't have any distribution key, I can't update the rows by page.
I am looking for a way to update it slowly. I don't mind if it takes a few days or a few weeks.
Do it in small batches
create table already_updated as select id from your_table limit 0;
create unique index on already_updated(id)with(fillfactor=100);
This processes 10k at a time, trying to process the table in chunks, in the order it's written on disk according to the hidden system column ctid
:
with selected_batch as(
select ctid
from your_table
where id not in (table already_updated)
order by ctid
limit 1e4)
,updated_batch as(
update your_table
set col1='new_value'
from selected_batch
where selected_batch.ctid=your_table.ctid
returning your_table.id)
insert into already_updated table updated_batch;
If that's really a static value, the same for all 200M records, you don't need the separate progress-tracking table or the consistency of id
- the absence of 'new_value'
means the row needs the update:
with cte as(select ctid
from your_table
where col1<>'new_value'
limit 1e4)
update your_table
set col1='new_value'
from cte--,pg_sleep(1.2)--not necessarily advised, extends lock time
where cte.ctid=your_table.ctid;
Same thing if it's not entirely the same for all rows but still possible to tell a not-yet-updated row from one that's already processed in a where
. If you're doing this on regular basis to update a field that's calculated based on other fields of that table, consider making it a generated column:
alter table your_table
drop column if exists col1
, add column col1 numeric generated always as (col2*col3) stored;
That'll update on its own whenever any of the other two does.
To add delay between batches, it's best to use a select pg_sleep(1.2);
separately, in between them, at the end of each loop. You can add it straight into the update
query like I showed above, but apart from slowing down the query, it will extend the amount of time the target rows are locked for update.
select count(*) from your_table;
then keep subtracting the batch size (10k in the examples above) after each loop.update
based on previous ones: if it takes 10s to update 10k rows the first time around, you can interpolate and assume that after running for 4s, the second batch processed 4k rows.nextval()
into your update
and call it from outside to see the progress (sequences are subject to dirty read). That slows the query down a bit:begin; drop sequence if exists progress;
create sequence progress;
select nextval('progress');
commit;
create extension if not exists dblink;
select dblink_connect('another_client','');
select dblink_send_query('another_client',
$q$ with cte as(select ctid
from test
where z is null
limit 5000)
update test
set z='new_value'
from cte
where cte.ctid=test.ctid
and nextval('progress')<>0;
$q$);
select elapsed
, progress
from generate_series(1,8)n
cross join lateral(
select .002*n*'1s'::interval as elapsed
, nextval('progress')-n-1 as progress
, pg_sleep(.002*n));
elapsed | progress |
---|---|
00:00:00.002 | 187 |
00:00:00.004 | 391 |
00:00:00.006 | 752 |
00:00:00.008 | 1391 |
00:00:00.01 | 2238 |
00:00:00.012 | 3166 |
00:00:00.014 | 4296 |
00:00:00.016 | 5000 |
View
without having to updateYou can also consider hiding the table behind a view
if you need an additional column with a static value that's always the same for all rows, or if relatively few need to diverge from it:
demo at db<>fiddle
create table test(id,x)as select generate_series(1,5e5),(random()*1e7)::int;
alter table test add primary key(id);
create table default_value_for_test(id,y)as values(0,'default value');
alter table default_value_for_test
add primary key(id)
,add constraint only1row check(id=0);
create table diverged_values_for_test(id,y)as values(1,'val1'),(4,'val4');
alter table diverged_values_for_test
add primary key(id)
,add foreign key(id)references test(id)
on update cascade on delete cascade;
create view v_test as
select t.*, coalesce(diverged_.y,default_.y) as y
from test as t(id,x)
left join diverged_values_for_test as diverged_(id,y)using(id)
cross join default_value_for_test as default_(id,y);
explain analyze verbose
update default_value_for_test set y='newer static value';
Update on public.default_value_for_test (actual time=0.048..0.049 rows=0 loops=1) |
-> Seq Scan on public.default_value_for_test (actual time=0.009..0.010 rows=1 loops=1) |
Execution Time: 0.081 ms |
It took under .1ms
to update the tiny single-row, single-column table that holds what you want to see as the additional, static column of the big one. Alternatively, you could also keep the default value as a literal constant in the view definition and whenever you need to change it, you use create
or replace
view
to update it.
You're also free to add exceptions and let the value diverge from that for some rows - if there's little traffic on the column, it'll take some time before it's viable to instead make the diverged_
an actual additional column in your table with 200M records, and make the default_
actual default
on that column. Inserts and updates on diverged_
should also be way faster than updates on the main table.
If you instead tried to save it in all 500k rows of the test
table right now, it would obviously take more time and space:
alter table test add column z text default 'another static value';
explain analyze verbose
update test set z='another static value';
Update on public.test (actual time=4894.635..4894.636 rows=0 loops=1) |
-> Seq Scan on public.test (actual time=0.009..143.322 rows=500000 loops=1) |
Execution Time: 4894.668 ms |
The cost of having your table split in three and moved behind a view is incurred on reads: the three tables obviously need to be join
ed together to get the full picture but the price of cross join
ing the single cell to every row is negligible and left join diverged_
depends on how many diverged.
For batch updates, you can also declare
a cursor and use update..where current of your_cursor
.