How is the similarity calculated in Postgres pg_trgm module
Asked Answered
D

1

5

Can somebody explain to me exactly how the similarity function is calculated in Postgres pg_trgm module.

e.g. similarity('sage', 'message') = 0.3

1) "  s"," sa",age,"ge ",sag
2) "  m"," me",age,ess,"ge ",mes,sag,ssa

n1: cardinality(1) = 5
n2: cardinality(2) = 8
nt: cardinality(1 intersect 2) = 3

I can't see how we get a formula from these 3 quantities which is equal to 0.3. I would have expected it to be based on a common string similarity metric (e.g. Dice-Sorensen)

i.e. 2*nt / (n1 + n2) = 6/13 = 0.46

pg_trgm similarity score seems to be unusually low to me

Dander answered 19/2, 2018 at 19:3 Comment(0)
S
6

The formula can be found in contrib/pg_trgm/trgm.h (see the macro CALCSML) and is as follows:

nt / (n1 + n2 - nt)

In your case that is 3 / (5+8-3) = 0.3.

Sniffy answered 20/2, 2018 at 9:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.