postgresqln-gramtext-analysis

How to extract n-gram word sequences from text in Postgres


I am hoping to use Postgres to extract sequences of words from Text. For example the whole word trigrams for the following sentence

"ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium"

would be

I have been doing this with R but I am hoping Postgres would be able to handle it more efficiently.

I have seen a similar question asked here n-grams from text in PostgreSQL but I don't understand how to use pg_trgm to extract word sequences


Solution

  • The function below assumes that a word consists of alphanumeric characters (any others are removed) and a space works as a separator.

    create or replace function word_ngrams(str text, n int)
    returns setof text language plpgsql as $$
    declare
        i int;
        arr text[];
    begin
        arr := regexp_split_to_array(str, '[^[:alnum:]]+');
        for i in 1 .. cardinality(arr)- n+ 1 loop
            return next array_to_string(arr[i : i+n-1], ' ');
        end loop;
    end $$;
    

    Find all three-word phrases:

    select word_ngrams('ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium', 3)
    
            word_ngrams         
    ----------------------------
     ed ut perspiciatis
     ut perspiciatis unde
     perspiciatis unde omnis
     unde omnis iste
     omnis iste natus
     iste natus error
     natus error sit
     error sit voluptatem
     sit voluptatem accusantium
    (9 rows)
    

    Find all six-word phrases:

    select word_ngrams('ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium', 6)
    
                     word_ngrams                 
    ---------------------------------------------
     ed ut perspiciatis unde omnis iste
     ut perspiciatis unde omnis iste natus
     perspiciatis unde omnis iste natus error
     unde omnis iste natus error sit
     omnis iste natus error sit voluptatem
     iste natus error sit voluptatem accusantium
    (6 rows)