I am hoping to use Postgres to extract sequences of words from Text. For example the whole word trigrams for the following sentence
"ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium"
would be
"ed ut perspiciatis"
"ut perspiciatis unde"
"perspiciatis unde omnis"
...I have been doing this with R but I am hoping Postgres would be able to handle it more efficiently.
I have seen a similar question asked here n-grams from text in PostgreSQL but I don't understand how to use pg_trgm to extract word sequences
The function below assumes that a word consists of alphanumeric characters (any others are removed) and a space works as a separator.
create or replace function word_ngrams(str text, n int)
returns setof text language plpgsql as $$
declare
i int;
arr text[];
begin
arr := regexp_split_to_array(str, '[^[:alnum:]]+');
for i in 1 .. cardinality(arr)- n+ 1 loop
return next array_to_string(arr[i : i+n-1], ' ');
end loop;
end $$;
Find all three-word phrases:
select word_ngrams('ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium', 3)
word_ngrams
----------------------------
ed ut perspiciatis
ut perspiciatis unde
perspiciatis unde omnis
unde omnis iste
omnis iste natus
iste natus error
natus error sit
error sit voluptatem
sit voluptatem accusantium
(9 rows)
Find all six-word phrases:
select word_ngrams('ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium', 6)
word_ngrams
---------------------------------------------
ed ut perspiciatis unde omnis iste
ut perspiciatis unde omnis iste natus
perspiciatis unde omnis iste natus error
unde omnis iste natus error sit
omnis iste natus error sit voluptatem
iste natus error sit voluptatem accusantium
(6 rows)