Function inconsistent with description
Related subject on pgsql-bugs mailing list.
The substring similarity algorithm described by the author compares trigram arrays of a query string and a text. The problem is that a trigram array is optimised (duplicated trigrams are eliminated) and loses information about individual words of the text.
The query illustrates the issue:
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select
t as "text",
show_trgm(t) as "text trigrams",
show_trgm('sage') as "string trigrams",
cardinality(array_intersect(show_trgm(t), show_trgm('sage'))) as "common trgms"
from data;
text | text trigrams | string trigrams | common trgms
------------------+-----------------------------------------------------------+-----------------------------+--------------
message | {" m"," me",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 3
message s | {" m"," s"," me"," s ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 4
message sag | {" m"," s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 5
message sag sag | {" m"," s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 5
message sag sage | {" m"," s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 5
(5 rows)
The trigram arrays in last three rows are the same and contain all trigrams of the query string.
Obviously, the implementation is not consistent with the description of the function (the description was changed in later releases of the documentation):
Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring.
My function used in the above query:
create or replace function public.array_intersect(anyarray, anyarray)
returns anyarray language sql immutable
as $$
select case
when $1 is null then $2
else
array(
select unnest($1)
intersect
select unnest($2)
)
end;
$$;
Workaround
You can easily write your own function to get more expected results:
create or replace function my_word_similarity(text, text)
returns real language sql immutable as $$
select max(similarity($1, word))
from regexp_split_to_table($2, '[^[:alnum:]]') word
$$;
Compare:
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, word_similarity('sage', t), my_word_similarity('sage', t)
from data;
t | word_similarity | my_word_similarity
------------------+-----------------+--------------------
message | 0.6 | 0.3
message s | 0.8 | 0.3
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
(5 rows)
New function in Postgres 11+
There is a new function in Postgres 11+ strict_word_similarity()
which gives results expected by the author of the question:
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, word_similarity('sage', t), strict_word_similarity('sage', t)
from data;
t | word_similarity | strict_word_similarity
------------------+-----------------+------------------------
message | 0.6 | 0.3
message s | 0.8 | 0.36363637
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
(5 rows)
sage
is a substring ofmessage test sag
if I'm not mistaken. – Hydrousword_similarity('sage', 'message sag sag')
return? – Kalakalaazar