sqlpostgresqlperformance

Does SQL's IN operator hash its input?


I'm wondering about the performance of Postgres' IN operator. Specifically, does it hash the values it's passed?

Example:

SELECT * FROM table WHERE col IN (1, 2, 3)

versus

SELECT * FROM table WHERE col in (1, 1, 1, 2, 2, 3, 3, 3, 3)

Are these two similar in performance? That is, does IN hash both lists its given?

Another example with a subquery (note the DISTINCT keyword):

SELECT *
FROM table
WHERE col IN (SELECT DISTINCT col FROM another_table WHERE ...)

Is this any better (or worse) than doing:

SELECT *
FROM table
WHERE col IN (SELECT col FROM another_table WHERE ...)

From my understanding:


Solution

  • With a simple test program (of course too simple to take into account the additional data columns, indices, and other size-modifying factors),
    that we'll at least voluntarily make scan the whole table by removing all kind of filters:

    create temp table another_table (col bigint);
    insert into another_table
        select random() * 500000 from generate_series(1, 1000000);
    create temp table tabl (col bigint);
    insert into tabl
        select random() * 500000 from generate_series(1, 1000000);
        
    select count(*) n, count(distinct col) n_distinct from another_table;
    select count(*) n, count(distinct col) n_distinct from tabl;
    
    explain analyze
        SELECT max(col)
        FROM tabl
        WHERE col IN (SELECT DISTINCT col FROM another_table);
    
    explain analyze
        SELECT max(col)
        FROM tabl
        WHERE col IN (SELECT col FROM another_table);
    

    you will see that they have an identical plan, and notably both do Group Key: another_table.col before entering an Hash Join.

    So even if you don't explicit it, internally PostgreSQL (17.5 here) will DISTINCT.

    Even when limiting the results to WHERE col < 999, and adding an index, will it HashAggregate before pushing values to the IN.