Compare Names using Levenshtein distance
Asked Answered
L

0

7

In my application i need to identify a person by searching for their lastname and firstname. One requirement is to accept spelling errors to a certain degree.

My attempts to identify a person given a firstname and lastname were:

  1. sql query using soundex
  2. sql query using levenshtein-distance (LD) which was calculated with this LD-function

The screenshot contains some test records and the result of my sql-query which includes the soundex-value for each column and the LD

compare records using levenshtein distance

My current query looks like this

SELECT t2.*
        , t1.Firstname + ' ' + t1.Lastname as SourceName
        , 'Torsten Mueller' as TargetName
        , dbo.FUNC_LEVENSHTEIN(t1.Firstname +' '+ t1.Lastname
                            , 'Torsten Mueller', 8) as LEVENSHTEIN_Distance     
 FROM #TestSoundex t1
 LEFT JOIN #TestSoundex t2 ON t1.Id = t2.Id
 WHERE t1.Soundex_Firstname = SOUNDEX('Torsten')
       AND t1.Soundex_Lastname = SOUNDEX('Mueller')

As you can see i filter the result by soundex first and calculate the levenshtein-distance for the remaining records. In this sample below the levenshtein-distance ranges from 0 (both strings are equal) to 3.

SourceName TargetName Levenshtein Distance
Thorsten Müller Torsten Mueller 3
Torsten Müller Torsten Mueller 2
Thorsten Mueller Torsten Mueller 1
Torsten Mueller Torsten Mueller 0

In this talk from a stanford professor (video no longer public) the calculation for the distance is explained:

I N T E * N TION 
| | | | | | | 
* E X E C U TION
d s s   i s

Each deletion d, insertion i adds 1 point, a substitution s adds 2 points. The LD-function i use returns 5 points for the sample above, but only 3 instead of 4 for the distance between Thorsten Müller and Torsten Mueller. I

+1 point to delete h, 
+1 point instead of 2 to substitute ü 
+1 point to insert e

So i added some samples

Samples with Umlaut

Questions

My impression is that neither soundex nor LD are sufficient to uniquely identify a person-record given firstname and lastname and taking into account that there might be spelling mismatches.

  • Can you explain how this LD-Function handles Umlaute ü,ö,äso i can better understand the calculation?
  • What would you recommend as a max value for the distance to find the right match for firstname and lastname given the string s and t, should it be based on the length of both strings numberOrCharacters(s+t)/2 = max?

Source Code

This is the function i use from the linked answer. I only changed the name of the function from edit_distance_within to FUNC_LEVENSHTEIN

    SET QUOTED_IDENTIFIER ON 
    GO
    SET ANSI_NULLS ON 
    GO
    
    CREATE FUNCTION FUNC_LEVENSHTEIN(@s nvarchar(4000), @t nvarchar(4000), @d int)
    RETURNS int
    AS
    BEGIN
      DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
        @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
      SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
      WHILE @j <= @tl
        SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
      WHILE @i <= @sl
      BEGIN
        SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
        WHILE @j <= @tl
        BEGIN
          SET @c = @c + 1
          SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
          IF @c > @c1 SET @c = @c1
          SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
          IF @c > @c1 SET @c = @c1
          IF @c < @cmin SET @cmin = @c
          SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
        END
        IF @cmin > @d BREAK
        SELECT @cv1 = @cv0, @i = @i + 1
      END
      RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
    END
    GO

Source to test function above

This is another test

    CREATE TABLE #TestLevenshteinDistance(
        Id int  IDENTITY(1,1) NOT NULL,
        SourceName nvarchar(100) NULL,  
        Soundex_SourceName varchar(4) NULL,    
        Targetname nvarchar(100) NULL, 
        Soundex_TargetName varchar(4) NULL, 
        );      
       
    INSERT INTO #TestLevenshteinDistance 
        (    SourceName,          
             Soundex_SourceName,
             Targetname,
             Soundex_TargetName) 
    VALUES 
       ('Intention',SOUNDEX('Intention'), 'Execution', SOUNDEX('Execution')),    
       ('Karsten' , SOUNDEX('Karsten'), 'Torsten', SOUNDEX('Torsten')); 
             
    
    SELECT t1.*
            , dbo.FUNC_LEVENSHTEIN(t1.SourceName, t1.Targetname, 8) as LEVENSHTEIN_Distance
            FROM #TestLevenshteinDistance t1

Update 2024

Based on the comment i tested DIFFERENCE in this sql online demo

Demo compare string with difference and Levenshtein Distance

The Difference function seems to be less strict than LD. The Difference value for two very

  • similar Soundex strings is 4.
  • different Soundex strings is 0.

I added 'Foo' and 'Bar' and did not expect that the difference is just 2. I assumed a difference of 1 (less similar).

Luxor answered 24/4, 2018 at 21:2 Comment(3)
Is Full-Text Index an option? My experience of trying to use either SOUNDEX or the Levenshtein distance was ... not great. It worked in some cases, and not in others. Full-Text Index was far more reliable.Soil
I have no experience with Full-Text search. It seems difference is similar to LDLuxor
See my Update 2024 aboveLuxor

© 2022 - 2024 — McMap. All rights reserved.