I have the table with 10TB
of data.
I want to perform the COUNT(1)/COUNT(*)
to check the EXACT COUNT between main table and archive table.
I tried following SQL's but didn't get optimal solution, queries keep running for more than 15min and counting...
My try:
Try 1: Not worked, query keeps running.
SELECT COUNT(*) FROM large_table WHERE column_date <= '2023-01-01 00:00:00';
Try 2: This works but how to apply condition and also not sure about EXACT count.
SELECT reltuples AS estimate FROM pg_class where relname = 'large_table';
Try 3: This too keep running, didn't get result after running for more than 10min
. Using EXPLAIN ANALYZE to get exact count.
do $$
declare
r record;
count integer;
begin
FOR r IN EXECUTE 'EXPLAIN ANALYZE SELECT * FROM large_table where column_date <=
''2023-01-01 00:00:00'';'
LOOP
count := substring(r."QUERY PLAN" FROM ' rows=([[:digit:]]+) loops');
EXIT WHEN count IS NOT NULL;
END LOOP;
raise info '%',count;
end;
$$
As already established by @Laurenz Albe, Postgres unfortunately does need to really scan and count all that. If you're not satisfied with the estimates found in pg_class
and really need exact count, the sad answer is you'll have to wait or collect your own.
You could use tally tables to speed this up at the price of some overhead on that table. Even if it's hard to predict the ranges of column_date
you'll need to query, that'd allow you to retrieve exact counts for most of the range in near constant time, down to the resolution of your tally tables, then only count the remainder below that.
E.g. to count everything between two microsecond-precise timestamps, you can easily sum yearly/monthly/daily counts in between those, then add records between lower/upper bound and midnight on their dates.
demo at db<>fiddle
create table large_table_tally_daily as
select date_trunc('day',column_date) as column_date
,count(*)
from large_table
group by 1;
create function trgf_large_table_tally_daily_delete_insert()returns trigger as $f$
begin
with cte as(
select date_trunc('day',column_date) as column_date
,(case when TG_OP='DELETE' then -1 else 1 end) * count(*) as diff
from difftab
group by 1)
update large_table_tally_daily as a
set count=a.count+cte.diff
from cte
where a.column_date=cte.column_date;
return null;
end $f$ language plpgsql;
create trigger trg_large_table_tally_daily_delete
after delete on large_table
referencing old table as difftab
for each statement
execute function trgf_large_table_tally_daily_delete_insert();
create trigger trg_large_table_tally_daily_insert
after insert on large_table
referencing new table as difftab
for each statement
execute function trgf_large_table_tally_daily_delete_insert();
/*update, truncate*/
create view large_table_tally_yearly as
select date_trunc('year',column_date) as column_date
, sum(count)
from large_table_tally_daily
group by 1
order by 1;
table large_table_tally_yearly;
column_date | sum |
---|---|
2019-01-01 00:00:00+00 | 7554 |
2020-01-01 00:00:00+00 | 140586 |
2021-01-01 00:00:00+00 | 139782 |
2022-01-01 00:00:00+00 | 140437 |
2023-01-01 00:00:00+00 | 139971 |
2024-01-01 00:00:00+00 | 131670 |
insert into large_table select from generate_series(1,100);
table large_table_tally_yearly;
column_date | sum |
---|---|
2019-01-01 00:00:00+00 | 7555 |
2020-01-01 00:00:00+00 | 140613 |
2021-01-01 00:00:00+00 | 139803 |
2022-01-01 00:00:00+00 | 140451 |
2023-01-01 00:00:00+00 | 139988 |
2024-01-01 00:00:00+00 | 131690 |
delete from large_table
where ctid in(select ctid
from large_table tablesample bernoulli(0.046)repeatable(.42)
limit 200);
table large_table_tally_yearly;
column_date | sum |
---|---|
2019-01-01 00:00:00+00 | 7552 |
2020-01-01 00:00:00+00 | 140564 |
2021-01-01 00:00:00+00 | 139765 |
2022-01-01 00:00:00+00 | 140418 |
2023-01-01 00:00:00+00 | 139949 |
2024-01-01 00:00:00+00 | 131652 |
With daily resolution, the demo shows this ugly thing can get you the answer about 100x faster (4x with yearly) in 0.9ms
on 700k rows, compared to 98.9ms
using direct count(*)
:
with cte as(select '2020-06-07 12:34:56.123456'::timestamptz as lower_bound
,'2023-01-02 01:23:45.678901'::timestamptz as upper_bound)
select (select sum(count)
from large_table_tally_daily cross join cte
where column_date>=lower_bound
and column_date+'1 day'::interval<upper_bound)
+(select count(*) from large_table cross join cte
where column_date>=greatest( date_trunc('day',upper_bound)
,lower_bound)
and column_date<upper_bound )
+(select count(*) from large_table cross join cte
where column_date>=lower_bound
and column_date<least( date_trunc('day',lower_bound)+'1 day'::interval
,upper_bound) );
Counts for the 937 days in between are retrieved in near constant time and queries adding the remainders have very high specificity.
It's handy if you're dealing with things like logs or sensor readings that just keep flowing in and never update. It's not really worth it if the entire time range in those 10TB sustains lots of random traffic where latency is important. With more and more columns you need to take into account, it's less and less maintainable.
As a side note, you can explain(format json)
if you plan to parse its results:
do $d$
declare r json;
count integer;
begin
EXECUTE 'EXPLAIN(format json)SELECT*FROM large_table where column_date <=
''2023-01-01 00:00:00'';' INTO r;
count := r#>>'{0,Plan,Plan Rows}';
raise info '%',count;
end $d$;