Postgres word_similarity not comparing words
Asked Answered
M

1

9

"Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the first string is identical to one of the words of the second string)."

That's the definition of word_similarity(a,b), as I understand it, it will look for the WORD a inside the text b, splitting b by words and getting the score of the highest match word.

However, I'm seeing some inconsistencies where the word matching is not really by word, looks like all trigrams are scrambled and compared?

Example:

select word_similarity('sage', 'message sag')

Returns 1, clearly neither 'message' or 'sag' should match with 'sage', but if we combine the possible trigrams from 'message sag', we would then find that all the trigrams from 'sage' would match, but that's not really what should happen since the function description talks about word by word... Is it because both words are next to each other?

The following, will return a 0.6 score:

select word_similarity('sage', 'message test sag') 

Edit: Fiddle to play around http://sqlfiddle.com/#!17/b4bab/1

Mcgovern answered 27/10, 2017 at 1:54 Comment(4)
I don't know this function at all, but sage is a substring of message test sag if I'm not mistaken.Hydrous
Yes, but the word_similarity states it checks against WORDS. Also, word_similarity('sage', 'message') --> 0.6 word_similarity('sage', 'message s') --> 0.8 word_similarity('sage', 'message sag') --> 1.0 word_similarity('sage', 'message test sag') --> 0.6 looks odd and wrong, not comparing word trigrams at all.Mcgovern
What does word_similarity('sage', 'message sag sag') return?Kalakalaazar
Returns 1 as well. sqlfiddle.com/#!17/b4bab/1Mcgovern
U
13

Function inconsistent with description

Related subject on pgsql-bugs mailing list.

The substring similarity algorithm described by the author compares trigram arrays of a query string and a text. The problem is that a trigram array is optimised (duplicated trigrams are eliminated) and loses information about individual words of the text.

The query illustrates the issue:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select 
    t as "text", 
    show_trgm(t) as "text trigrams", 
    show_trgm('sage') as "string trigrams", 
    cardinality(array_intersect(show_trgm(t), show_trgm('sage'))) as "common trgms"
from data;

       text       |                       text trigrams                       |       string trigrams       | common trgms 
------------------+-----------------------------------------------------------+-----------------------------+--------------
 message          | {"  m"," me",age,ess,"ge ",mes,sag,ssa}                   | {"  s"," sa",age,"ge ",sag} |            3
 message s        | {"  m","  s"," me"," s ",age,ess,"ge ",mes,sag,ssa}       | {"  s"," sa",age,"ge ",sag} |            4
 message sag      | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
 message sag sag  | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
 message sag sage | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
(5 rows)    

The trigram arrays in last three rows are the same and contain all trigrams of the query string.

Obviously, the implementation is not consistent with the description of the function (the description was changed in later releases of the documentation):

Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring.


My function used in the above query:

create or replace function public.array_intersect(anyarray, anyarray)
returns anyarray language sql immutable
as $$
    select case 
        when $1 is null then $2
        else
            array(
                select unnest($1)
                intersect
                select unnest($2)
            )
        end;
$$;

Workaround

You can easily write your own function to get more expected results:

create or replace function my_word_similarity(text, text)
returns real language sql immutable as $$
    select max(similarity($1, word))
    from regexp_split_to_table($2, '[^[:alnum:]]') word
$$;

Compare:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select t, word_similarity('sage', t), my_word_similarity('sage', t)
from data;

        t         | word_similarity | my_word_similarity
------------------+-----------------+--------------------
 message          |             0.6 |                0.3
 message s        |             0.8 |                0.3
 message sag      |               1 |                0.5
 message sag sag  |               1 |                0.5
 message sag sage |               1 |                  1
(5 rows)

New function in Postgres 11+

There is a new function in Postgres 11+ strict_word_similarity() which gives results expected by the author of the question:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select t, word_similarity('sage', t), strict_word_similarity('sage', t)
from data;

        t         | word_similarity | strict_word_similarity
------------------+-----------------+------------------------
 message          |             0.6 |                    0.3
 message s        |             0.8 |             0.36363637
 message sag      |               1 |                    0.5
 message sag sag  |               1 |                    0.5
 message sag sage |               1 |                      1
(5 rows)
Unrighteous answered 27/10, 2017 at 12:45 Comment(9)
That's a good solution, I would expect the original function to work that way, I was hoping to find some explanation of it's behavior, even when comparing 1 word to another word_similarity returns a different value from similarity, where they should return the same value if both strings have 1 word... I'm also expecting performance to be worse when using a custom function instead of the native/extension function?Mcgovern
In my ad-hoc benchmark the custom function is up to 2 times slower than the original one. Is this a good price for correct results? Anyway, I think the issue may be reported as a bug.Unrighteous
Yes I suppose it's not that bad, I just refuse to think it's really a bug since the extension has been out for quite long, but sadly there's no useful info about the function behavior.Mcgovern
I've been using pg_trgm for a long time and it's a good product in my opinion. However, word_similarity() is a relatively new function (from Postgres 9.6). To be honest, I did not know about its existence until yesterday.Unrighteous
word_similarity is also strictly related to the "%>" and "<%" operators to perform the same kind of word/text search, while taking advantage of gin/gist indexes, and it probably suffers from the same issue which is harder to work around. Your work around is good, I hope you don't mind if I leave the question open for a while just in case someone knows what's wrong with the functionMcgovern
It's surely ok, I'm curious too.Unrighteous
Reported as a bug and seems to be a bug after all! postgresql-archive.org/…Mcgovern
if i have a gin index on some column can the new function my_word_similarity take advantage of that ? i guess not but i want to make sureLeatheroid
This is now fixed as of PostgreSQL 11 using the new function strict_word_similarity(). The fix is based on this SO post as submitted by @CristianoCoelhoMogerly

© 2022 - 2024 — McMap. All rights reserved.