I'm wondering about the performance of Postgres' IN
operator. Specifically, does it hash the values it's passed?
Example:
SELECT * FROM table WHERE col IN (1, 2, 3)
versus
SELECT * FROM table WHERE col in (1, 1, 1, 2, 2, 3, 3, 3, 3)
Are these two similar in performance? That is, does IN
hash both lists its given?
Another example with a subquery (note the DISTINCT
keyword):
SELECT *
FROM table
WHERE col IN (SELECT DISTINCT col FROM another_table WHERE ...)
Is this any better (or worse) than doing:
SELECT *
FROM table
WHERE col IN (SELECT col FROM another_table WHERE ...)
From my understanding:
SELECT DISTINCT ...
in the subquery is actually extra work, and not performant. I'm not sure why though, I was just told this.DISTINCT ...
we also construct our own hash table (similar to IN
)?With a simple test program (of course too simple to take into account the additional data columns, indices, and other size-modifying factors),
that we'll at least voluntarily make scan the whole table by removing all kind of filters:
create temp table another_table (col bigint);
insert into another_table
select random() * 500000 from generate_series(1, 1000000);
create temp table tabl (col bigint);
insert into tabl
select random() * 500000 from generate_series(1, 1000000);
select count(*) n, count(distinct col) n_distinct from another_table;
select count(*) n, count(distinct col) n_distinct from tabl;
explain analyze
SELECT max(col)
FROM tabl
WHERE col IN (SELECT DISTINCT col FROM another_table);
explain analyze
SELECT max(col)
FROM tabl
WHERE col IN (SELECT col FROM another_table);
you will see that they have an identical plan, and notably both do Group Key: another_table.col
before entering an Hash Join.
So even if you don't explicit it, internally PostgreSQL (17.5 here) will DISTINCT
.
Even when limiting the results to WHERE col < 999
, and adding an index, will it HashAggregate
before pushing values to the IN
.