Can somebody explain to me exactly how the similarity function is calculated in Postgres pg_trgm module.
e.g. similarity('sage', 'message') = 0.3
1) " s"," sa",age,"ge ",sag
2) " m"," me",age,ess,"ge ",mes,sag,ssa
n1: cardinality(1) = 5
n2: cardinality(2) = 8
nt: cardinality(1 intersect 2) = 3
I can't see how we get a formula from these 3 quantities which is equal to 0.3. I would have expected it to be based on a common string similarity metric (e.g. Dice-Sorensen)
i.e. 2*nt / (n1 + n2) = 6/13 = 0.46
pg_trgm similarity score seems to be unusually low to me
The formula can be found in contrib/pg_trgm/trgm.h
(see the macro CALCSML
) and is as follows:
nt / (n1 + n2 - nt)
In your case that is 3 / (5+8-3) = 0.3
.