postgresqlindexingpartial-index

Efficiently indexing sparse nullable column in Postgres


I need to index a Postgres column that consists of mostly NULL values. I don't want the NULL values to be stored in the index (to make the index smaller, and to speed up row insertion). However, adding a partial index on IS NOT NULL only lets me search efficiently for all non-null values, not for a specific non-null value.

How can I define an index on the value in this column that excludes only NULL values from the index?


Solution

  • partial index on IS NOT NULL only lets me search efficiently for all non-null values, not for a specific non-null value.

    That sounds like one of these:

    create index on tbl(col is not null);
    create index on tbl(col is not null, col);
    create index on tbl(col, col is not null);
    

    But none of the above is a partial index. That would be

    create index on tbl(col)where(col is not null);
    

    And it does let you "search efficiently for specific non-null values" while being exactly "an index on the value in this column that excludes only NULL values from the index".


    demo at db<>fiddle

    select setseed(.42);
    
    create table tbl(col,r)as 
      select case when .5<random() then n::int end as col--50% null `col` values
            ,random() as r
      from generate_series(1,7e5)n 
      where .1<random()
      order by random();
    
    create index idx1_normal_no_deduplication on tbl(col)
      with(deduplicate_items=false);
    create index idx2_partial_no_deduplication on tbl(col)
      with(deduplicate_items=false)
      where(col is not null);
    create index idx3_normal_deduplicated on tbl(col)
      with(deduplicate_items=true);
    create index idx4_partial_deduplicated on tbl(col)
      with(deduplicate_items=true)
      where(col is not null);
    

    You can check how much smaller the partial index is compared to the normal one:

    select indexrelname
         , pg_relation_size(indexrelid)
         , pg_size_pretty(pg_relation_size(indexrelid))
    from pg_stat_all_indexes i 
      join pg_class c 
      on i.relid=c.oid 
    where i.relname='tbl'
    order by 2;
    
    indexrelname pg_relation_size pg_size_pretty
    idx2_partial_no_deduplication 7118848 6952 kB
    idx4_partial_deduplicated 7118848 6952 kB
    idx3_normal_deduplicated 9289728 9072 kB
    idx1_normal_no_deduplication 14180352 14 MB

    Which makes sense considering 50% of col are null, and all other values are unique. It also shows that even with deduplication enabled, the partial index can still have much smaller footprint than the deduplicated, non-partial. With deduplication off, you get 50% smaller index and there's still 25% to be saved even with it on. After all, equal values aren't skipped entirely, they are still there but better compacted.
    It goes without saying that with less null values the difference will be less pronounced.

    Leaving only the partial, deduplicated variant:

    explain analyze verboseselect from tbl where col<8e4;
    
    Index Only Scan using idx4_partial_deduplicated on public.tbl (cost=0.42..9804.20 rows=209863 width=0) (actual time=0.060..90.552 rows=36110 loops=1)
    explain analyze verbose select from tbl where col=8e4;
    
    Gather (cost=1000.42..7904.08 rows=3148 width=0) (actual time=81.817..82.158 rows=0 loops=1)
    Workers Planned: 3
    Workers Launched: 3
    -> Parallel Index Only Scan using idx4_partial_deduplicated on public.tbl (cost=0.42..6589.28 rows=1015 width=0) (actual time=50.420..50.421 rows=0 loops=4)

    Considering col is null disables index use since those cases are precisely what the index ignores (also, because it might be easier to seq scan if the conditions describe big enough part of the set):

    explain analyze verbose select from tbl where col=8e4 or col is null;
    
    Seq Scan on public.tbl (cost=0.00..12539.83 rows=314781 width=0) (actual time=0.014..102.891 rows=314082 loops=1)