postgresqlpg-trgm

How is the similarity calculated in Postgres pg_trgm module


Can somebody explain to me exactly how the similarity function is calculated in Postgres pg_trgm module.

e.g. similarity('sage', 'message') = 0.3

1) "  s"," sa",age,"ge ",sag
2) "  m"," me",age,ess,"ge ",mes,sag,ssa

n1: cardinality(1) = 5
n2: cardinality(2) = 8
nt: cardinality(1 intersect 2) = 3

I can't see how we get a formula from these 3 quantities which is equal to 0.3. I would have expected it to be based on a common string similarity metric (e.g. Dice-Sorensen)

i.e. 2*nt / (n1 + n2) = 6/13 = 0.46

pg_trgm similarity score seems to be unusually low to me


Solution

  • The formula can be found in contrib/pg_trgm/trgm.h (see the macro CALCSML) and is as follows:

    nt / (n1 + n2 - nt)
    

    In your case that is 3 / (5+8-3) = 0.3.