PostgreSQL, trigrams and similarity
Asked Answered
P

3

21

Just testing PostgreSQL 9.6.2 on my Mac and playing with Ngrams. Assuming there is a GIN trigram index on winery field.

The limit for similarity (I know this is deprecated):

SELECT set_limit(0.5);

I'm building a trigram search on 2,3M row table.

My select code:

SELECT winery, similarity(winery, 'chateau chevla blanc') AS similarity 
FROM usr_wines 
WHERE status=1 AND winery % 'chateau chevla blanc'  
ORDER BY similarity DESC;

My results (329 ms on my mac):

Chateau ChevL Blanc 0,85
Chateau Blanc   0,736842
Chateau Blanc   0,736842
Chateau Blanc   0,736842
Chateau Blanc   0,736842
Chateau Blanc,  0,736842
Chateau Blanc   0,736842
Chateau Cheval Blanc    0,727273
Chateau Cheval Blanc    0,727273
Chateau Cheval Blanc    0,727273
Chateau Cheval Blanc (7)    0,666667
Chateau Cheval Blanc Cbo    0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64

Well, I don't understand how can "Chateau blanc" have a similarity > to "Chateau Cheval Blanc" in this case ? I understand that the 2 words are the exact same "chateau" and "blanc", but there is no other word "cheval".

Also why "Chateau ChevL Blanc" is first ? A letter "a" is missing !

Well, my goal is to match all possible duplicates when I give a winery name, even if it's mispelled. What did I miss ?

Physiotherapy answered 1/4, 2017 at 12:40 Comment(1)
With regards to "Chateau ChevL Blanc" out-scoring "Chateau Cheval Blanc", the pg_trgm definition of similarity as the size of the intersection with respect to that of the union is much harsher on mis-spellings than omissions or additions, as they both shrink the intersection and grow the union. In practice if you are implementing your own scoring method it is probably worth considering somthing like the size of the intersection with respect to the larger of the two sets.Pavonine
O
70

The concept of trigram similarity relies on having any sentence divided into "trigrams" (sequences of three consecutive letters), and treating the result as a SET (i.e.: the order doesn't matter, and you don't have repeated values). Before the sentence is considered, two blank spaces are added at the beginning, and one at the end, and single spaces are replaced by double ones.

Trigrams are a special case of N-grams.

The trigram set corresponding to "Chateau blanc" is found by finding all sequences of three letters that appear on it:

  chateau  blanc
---                 => '  c'
 ---                => ' ch'
  ---               => 'cha'
   ---              => 'hat'
    ---             => 'ate'
     ---            => 'tea'
      ---           => 'eau'
       ---          => 'au '
        ---         => 'u  '
         ---        => '  b'
          ---       => ' bl'
           ---      => 'bla'
            ---     => 'lan'
             ---    => 'anc'
              ---   => 'nc '

Sorting them, and taking out repetitions gets you:

'  b'
'  c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'

This can be computed by PostgreSQL by means of the function show_trgm:

SELECT show_trgm('Chateau blanc') AS A

A = [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

... which has 14 trigrams. (Check pg_trgm).

And the trigram set corresponding to "Chateau Cheval Blanc" is:

SELECT show_trgm('Chateau Cheval Blanc') AS B 

B = [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

... which has 19 trigrams

If you count how many trigrams have both sets in common, you find that they have the following ones:

A intersect B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

and the ones they have in total are:

A union B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

That is, both sentences have 14 trigrams in common, and 19 in total.
The similarity is computed as:

 similarity = 14 / 19

You can check it with:

SELECT 
    cast(14.0/19.0 as real) AS computed_result, 
    similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg

and you'll see that you get: 0.736842

... which explains how similarity is computed, and why you get the values you get.


NOTE: You can compute the intersection and union by means of:

SELECT 
   array_agg(t) AS in_common
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    INTERSECT 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
    ORDER BY t
) AS trigrams_in_common ;

SELECT 
   array_agg(t) AS in_total
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    UNION 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;

And this is a way to explore the similarity of different pair of sentences:

WITH p AS
(
    SELECT 
      'This is just a sentence I''ve invented'::text AS f1,
      'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
    SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
    SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
    SELECT
        (SELECT count(*) FROM 
            (SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
        (SELECT count(*) FROM 
            (SELECT * FROM t1 UNION     SELECT * FROM t2) AS s0)::integer AS total,
        similarity(f1, f2) AS sim_2
FROM
    p 
)
SELECT
    same, total, same::real/total::real AS sim_1, sim_2
FROM
    x ;

You can check it at Rextester

Ogre answered 1/4, 2017 at 15:21 Comment(8)
It's a very nice and detailled explanation joanolo. Thanks ! So I will continue my tests to match duplicates.Physiotherapy
Does full text search with vectors can be my friend to match duplicates OR do I need to continue using trigrams ?Physiotherapy
Full text search can help you find word by word duplicates (not necessarily in the same order); but won't allow for misspellings.Ogre
Why is 'u ' removed from the trigrams?Ovate
Stephan, I really don't know.Ogre
I tried to find the corresponding part in the pg_trgm code but could not find it.Ovate
Your "Chateau Cheval Blanc" example is incorrect, it excludes the trigrams "al ", "eva" and "val" and erroneously includes "evl", "la " and "vla" Did you run the function with the typo "Chateau Chevla Blanc"?Tletski
Some answers on this platform are just so amazing. Thank you @OgreBouffant
S
6

The trigram algorithm should be the more accurate the less is the difference in length of compared strings. You can modify the algorithm to compensate the effect of length difference.

The following exemplary function reduces the similarity by 1% for the difference of 1 character in string lenghts. This means that it favors strings of the same (similar) length.

create or replace function corrected_similarity(str1 text, str2 text)
returns float4 language sql as $$
    select similarity(str1, str2)* (1- abs(length(str1)-length(str2))/100.0)::float4
$$;

select 
    winery, 
    similarity(winery, 'chateau chevla blanc') as similarity,
    corrected_similarity(winery, 'chateau chevla blanc') as corrected_similarity
from usr_wines 
where winery % 'chateau chevla blanc'  
order by corrected_similarity desc;

          winery          | similarity | corrected_similarity 
--------------------------+------------+----------------------
 Chateau ChevL Blanc      |       0.85 |               0.8415
 Chateau Cheval Blanc     |   0.727273 |             0.727273
 Chateau Cheval Blanc     |   0.727273 |             0.727273
 Chateau Cheval Blanc     |   0.727273 |             0.727273
 Chateau Blanc,           |   0.736842 |             0.692632
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Cheval Blanc (7) |   0.666667 |                 0.64
 Chateau Du Cheval Blanc  |       0.64 |               0.6208
 Chateau Du Cheval Blanc  |       0.64 |               0.6208
 Chateau Cheval Blanc Cbo |       0.64 |               0.6144
(14 rows)

In a similar way you can correct the standard similarity by, for example, how many initial characters are identical (thought the function will be a bit more complicated).

Sailesh answered 1/4, 2017 at 19:24 Comment(3)
Thanks for this suggestion !!! It's very well done. I tried too and it works. However it's currently 10 times slower than the basic search. I suppose set_limit() can't be used with a custom function.Physiotherapy
Naturally, the custom function is an additional cost so it seems that it can be used to correct earlier obtained results (in such a way that it is executed against a small dataset).Sailesh
I've played on occasion with a sllight variation of your custom function: I would change the correction factor, and not use a /100.0, but have (1.0 - abs(length(str1) - length(str2))::float4 / (length(str1) + length(str2))::float4)::float4Ogre
J
5

Sometimes you want the opposite of klin's answer. There are applications where a large difference in string lengths should not lead to such a significant score penalty.

For example, imagine an autocomplete results form with trigram-match suggestions that improve as you type.

Here's an alternative way to score matching that still uses trigrams but favors substring matches a bit more than usual.

The formula for similarity instead starts with

the number of common trigrams
-------------------------------------------
the number of trigrams in the shortest word    <-- key difference

and can go up from there based on a good standard similarity score.

CREATE OR REPLACE FUNCTION substring_similarity(string_a TEXT, string_b TEXT) RETURNS FLOAT4 AS $$
DECLARE
  a_trigrams TEXT[];
  b_trigrams TEXT[];
  a_tri_len INTEGER;
  b_tri_len INTEGER;
  common_trigrams TEXT[];
  max_common INTEGER;
BEGIN
  a_trigrams = SHOW_TRGM(string_a);
  b_trigrams = SHOW_TRGM(string_b);
  a_tri_len = ARRAY_LENGTH(a_trigrams, 1);
  b_tri_len = ARRAY_LENGTH(b_trigrams, 1);
  IF (NOT (a_tri_len > 0) OR NOT (b_tri_len > 0)) THEN
    IF (string_a = string_b) THEN
      RETURN 1;
    ELSE
      RETURN 0;
    END IF;
  END IF;
  common_trigrams := ARRAY(SELECT UNNEST(a_trigrams) INTERSECT SELECT UNNEST(b_trigrams));
  max_common = LEAST(a_tri_len, b_tri_len);
  RETURN COALESCE(ARRAY_LENGTH(common_trigrams, 1), 0)::FLOAT4 / max_common::FLOAT4;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION corrected_similarity(string_a TEXT, string_b TEXT) 
RETURNS FLOAT4 AS $$
DECLARE
  base_score FLOAT4;
BEGIN
  base_score := substring_similarity(string_a, string_b);
  -- a good standard similarity score can raise the base_score
  RETURN base_score + ((1.0 - base_score) * SIMILARITY(string_a, string_b));
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION is_minimally_substring_similar(string_a TEXT, string_b TEXT) RETURNS BOOLEAN AS $$
BEGIN
  RETURN corrected_similarity(string_a, string_b) >= 0.5;
END;
$$ LANGUAGE plpgsql;

CREATE OPERATOR %%% (
  leftarg = TEXT,
  rightarg = TEXT,
  procedure = is_minimally_substring_similar,
  commutator = %%%
);

Now you can use this in the same way as the standard similarity query:

SELECT * FROM table WHERE name %%% 'chateau'
ORDER BY corrected_similarity(name, 'chateau') DESC;

Performance

Performance is acceptable for a search space of 100K records but probably wouldn't be great for a search space of millions. For that, you may want to use a modified build of the pg_trgm module, code on github.

Jephthah answered 9/1, 2019 at 1:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.