postgresqltsvector

Should I store empty tsvector values or NULL values?


When storing a tsvector value in a column, for records with no search terms, should I store an empty tsvector or a NULL value?

Does it matter?

Is there any difference in terms of performance, or storage overhead from storing empty vectors?

In other words, when updating the vector based on a value of, say, a nullable title column, do I need to always compute this as to_tsvector(coalesce(title,'')) (since to_tsvector returns NULL when given a NULL argument) or is it enough to do to_tsvector(title)?


Solution

  • Logic aspects of your question

    First off, the semantics of SQL NULL is that of UNKNOWN, whereas some data types also have an "empty" value. Those data types include:

    There are many more, including TSVECTOR. The semantics of an empty collection of something is always subtly different from that of a NULL value, which is an UNKNOWN collection (often just used as an absent collection, though). The distiction manifests specifically when it comes to using operators, e.g.

    In that sense, the decision should be foremost a logic one, not a storage one, based on this question: Will you still work with the TSVECTOR value of a record, even if the record doesn't have any search terms (pro empty TSVECTOR)? Or does the feature not apply at all to that particular record (pro NULL value)? For the @@ operator, it may not be so relevant, but it definitely is for the || operator, and others.

    The answer isn't obvious, nor is there a clear right / wrong way in general.

    Performance aspects of your question

    If this is a highly performance sensitive situation in your application (e.g. you have a lot of empty TSVECTOR values), then maybe, this benchmark could help you with the decision?

    I ran the below benchmark on PostgreSQL 14.1 in Docker to get this result:

    RUN 1, Statement 1: 2.91145
    RUN 1, Statement 2: 1.00000 -- The fastest run is 1. The others are multiples of 1
    RUN 2, Statement 1: 2.80509
    RUN 2, Statement 2: 1.05232
    RUN 3, Statement 1: 2.78001
    RUN 3, Statement 2: 1.00202
    RUN 4, Statement 1: 2.74319
    RUN 4, Statement 2: 1.00524
    RUN 5, Statement 1: 2.75808
    RUN 5, Statement 2: 1.00045
    

    The fact that NULL is involved probably leads to a shortcut in the @@ operator's algorithm, which produces a 2.7x performance improvement over querying an empty TSVECTOR in the benchmark. So, there do seem to be benefits of using NULL in terms of performance.

    Obviously, that's just a benchmark, which doesn't necessarily reflect real-world use-cases, but it should give you a hint of a potential difference.

    Benchmark code

    For reproduction or adaptations, here's a benchmark, based on this technique.

    DO $$
    DECLARE
      v_ts TIMESTAMP;
      v_repeat CONSTANT INT := 10000;
      rec RECORD;
      run INT[];
      stmt INT[];
      elapsed DECIMAL[];
      min_elapsed DECIMAL;
      i INT := 1;
    
      -- Store the vector in a local variable to avoid re-computing it in the benchmark
      v tsvector = to_tsvector('');
    BEGIN
    
      -- Repeat the whole benchmark several times to avoid warmup penalty
      FOR r IN 1..5 LOOP
        v_ts := clock_timestamp();
    
        FOR i IN 1..v_repeat LOOP
          FOR rec IN (
            -- Statement 1
            SELECT v @@ to_tsquery('cat & rat')
          ) LOOP
            NULL;
          END LOOP;
        END LOOP;
    
        run[i] := r;
        stmt[i] := 1;
        elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP)) 
                     - EXTRACT(EPOCH FROM v_ts));
        i := i + 1;
        v_ts := clock_timestamp();
    
        FOR i IN 1..v_repeat LOOP
          FOR rec IN (
            -- Statement 2
            SELECT NULL @@ to_tsquery('cat & rat') 
          ) LOOP
            NULL;
          END LOOP;
        END LOOP;
    
        run[i] := r;
        stmt[i] := 2;
        elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP))
                     - EXTRACT(EPOCH FROM v_ts));
        i := i + 1;
      END LOOP;
    
      SELECT min(t.elapsed)
      INTO min_elapsed
      FROM unnest(elapsed) AS t(elapsed);
    
      FOR i IN 1..array_length(run, 1) LOOP
        RAISE INFO 'RUN %, Statement %: %', run[i], stmt[i], 
          CAST(elapsed[i] / min_elapsed AS DECIMAL(10, 5));
      END LOOP;
    END$$;