In my application i need to identify a person by searching for their lastname and firstname. One requirement is to accept spelling errors to a certain degree.
My attempts to identify a person given a firstname and lastname were:
- sql query using soundex
- sql query using levenshtein-distance (LD) which was calculated with this LD-function
The screenshot contains some test records and the result of my sql-query which includes the soundex-value for each column and the LD
My current query looks like this
SELECT t2.*
, t1.Firstname + ' ' + t1.Lastname as SourceName
, 'Torsten Mueller' as TargetName
, dbo.FUNC_LEVENSHTEIN(t1.Firstname +' '+ t1.Lastname
, 'Torsten Mueller', 8) as LEVENSHTEIN_Distance
FROM #TestSoundex t1
LEFT JOIN #TestSoundex t2 ON t1.Id = t2.Id
WHERE t1.Soundex_Firstname = SOUNDEX('Torsten')
AND t1.Soundex_Lastname = SOUNDEX('Mueller')
As you can see i filter the result by soundex first and calculate the levenshtein-distance for the remaining records. In this sample below the levenshtein-distance ranges from 0 (both strings are equal) to 3.
SourceName | TargetName | Levenshtein Distance |
---|---|---|
Thorsten Müller | Torsten Mueller | 3 |
Torsten Müller | Torsten Mueller | 2 |
Thorsten Mueller | Torsten Mueller | 1 |
Torsten Mueller | Torsten Mueller | 0 |
In this talk from a stanford professor (video no longer public) the calculation for the distance is explained:
I N T E * N TION
| | | | | | |
* E X E C U TION
d s s i s
Each deletion d
, insertion i
adds 1 point, a substitution s
adds 2 points.
The LD-function
i use returns 5 points for the sample above, but only 3 instead of 4 for the distance between Thorsten Müller
and Torsten Mueller
.
I
+1 point to delete h,
+1 point instead of 2 to substitute ü
+1 point to insert e
So i added some samples
Questions
My impression is that neither soundex nor LD are sufficient to uniquely identify a person-record given firstname
and lastname
and taking into account that there might be spelling mismatches.
- Can you explain how this LD-Function handles Umlaute
ü,ö,ä
so i can better understand the calculation? - What would you recommend as a max value for the
distance
to find the right match for firstname and lastname given the strings
andt
, should it be based on the length of both stringsnumberOrCharacters(s+t)/2 = max
?
Source Code
This is the function i use from the linked answer. I only changed the name of the function from edit_distance_within
to FUNC_LEVENSHTEIN
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO
CREATE FUNCTION FUNC_LEVENSHTEIN(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
@cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
WHILE @j <= @tl
SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
WHILE @i <= @sl
BEGIN
SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
WHILE @j <= @tl
BEGIN
SET @c = @c + 1
SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
IF @c > @c1 SET @c = @c1
SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
IF @c > @c1 SET @c = @c1
IF @c < @cmin SET @cmin = @c
SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
END
IF @cmin > @d BREAK
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
GO
Source to test function above
This is another test
CREATE TABLE #TestLevenshteinDistance(
Id int IDENTITY(1,1) NOT NULL,
SourceName nvarchar(100) NULL,
Soundex_SourceName varchar(4) NULL,
Targetname nvarchar(100) NULL,
Soundex_TargetName varchar(4) NULL,
);
INSERT INTO #TestLevenshteinDistance
( SourceName,
Soundex_SourceName,
Targetname,
Soundex_TargetName)
VALUES
('Intention',SOUNDEX('Intention'), 'Execution', SOUNDEX('Execution')),
('Karsten' , SOUNDEX('Karsten'), 'Torsten', SOUNDEX('Torsten'));
SELECT t1.*
, dbo.FUNC_LEVENSHTEIN(t1.SourceName, t1.Targetname, 8) as LEVENSHTEIN_Distance
FROM #TestLevenshteinDistance t1
Update 2024
Based on the comment i tested DIFFERENCE
in this sql online demo
The Difference function seems to be less strict than LD. The Difference value for two very
- similar Soundex strings is 4.
- different Soundex strings is 0.
I added 'Foo' and 'Bar' and did not expect that the difference is just 2. I assumed a difference of 1 (less similar).