Most efficient way to calculate Levenshtein distance
Asked Answered
H

6

32

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same.

Here's my implementation:

public int calculate(String root, String query)
{
  int arr[][] = new int[root.length() + 2][query.length() + 2];

  for (int i = 2; i < root.length() + 2; i++)
  {
    arr[i][0] = (int) root.charAt(i - 2);
    arr[i][1] = (i - 1);
  }

  for (int i = 2; i < query.length() + 2; i++)
  {
    arr[0][i] = (int) query.charAt(i - 2);
    arr[1][i] = (i - 1);
  }

  for (int i = 2; i < root.length() + 2; i++)
  {
    for (int j = 2; j < query.length() + 2; j++)
    {
      int diff = 0;
      if (arr[0][j] != arr[i][0])
      {
        diff = 1;
      }
      arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
    }
  }
  return arr[root.length() + 1][query.length() + 1];
}

public int min(int n1, int n2, int n3)
{
  return (int) Math.min(n1, Math.min(n2, n3));
}
Homogenous answered 6/7, 2010 at 2:27 Comment(3)
yes, the other things are efficient enough. I profiled my code and found that the bottleneck was calculating the levenshtein distance which is why I'm trying to optimize that bit right now. I'm implementing the improvement mentioned in the wikipedia article and I'll follow it up with an implementation of the VP-tree to see which one is more efficient.Homogenous
about "using a 2-D array, which makes the implementation an O(n^2) operation": calculating a Levenshtein distance between two sequences with no constraints is already an O(n^2) operation irrespective of how much memory you use -- using a 2-D array just slows you down and wastes memory; only O(n) memory is required.Tinstone
@John Machin I know this is ancient but could you provide an example or some link to how that O(n) space solution would be implemented?Contravallation
C
28

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).

Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.

I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

Classy answered 6/7, 2010 at 2:46 Comment(12)
I just found the same and was about to post it here. Thanks!Homogenous
You can always create with a small set of matches using whatever "approximate distance" metric you come up with, then use real Levenshtein to rank those.Apostolic
I read up on the optimization that the above post talks of, but I can't honestly say I get it. I'm not sure how to implement it either. Can someone offer some help?Homogenous
@Dharmesh, not in comments -- way too cramped -- but a separate question on finding closest neighbor (the main optimization you can perform being to not fully compute the distance in most cases -- and that optimization is only possible because you want the nearest neighbor, not for all distance computation tasks!) ideally with some samples of typical/interesting roots and sets of queries, should elicit lots of help (be sure to tag it as "algorithm" too!).Classy
@Alex: that optimisation is also possible if the OP wanted to find all "likely" matches, "likely" being defined as e.g. distance(query,candidate)/max(len(query),len(candidate)) < some_threshold_fractionTinstone
@Alex: I realized that the trick is to not compute the entire distance. I just don't fully understand what part of the entire distance I should compute and that is what I wanted help with. I'll make the new question with the name you suggested. If you could offer some more help there, I'd really appreciate it.Homogenous
@John: I don't think I need all the likely matches. I just need the distance of the closest match.Homogenous
I posted the new question under the name "Finding closest neighbour using optimized Levenshtein Algorithm". Here's the link: #3195769Homogenous
@Dharmesh, I see you're getting excellent answers on the new question so hopefully your doubts about what to compute and how are now being quelled without the need for further input from me.Classy
@Alex: Yeah, I got a lot of good help there. Some more wouldn't hurt though, if you have the time.Homogenous
Thanks to your idea I just contributed to apache-commons-lang, how weird that they didn't do this already? github.com/apache/commons-lang/pull/118Coons
Hello :), I have exactly posted the question about how to put a bound at the Levenstein distance here (#59687489) and nobody has answered yet. Do you have any code which does this?Algiers
M
8

According to a comment on this blog, Speeding Up Levenshtein, you can use VP-Trees and achieve O(nlogn). Another comment on the same blog points to a python implementation of VP-Trees and Levenshtein. Please let us know if this works.

Monorail answered 6/7, 2010 at 2:53 Comment(2)
I realize this is an old thread, but you got me confused for a minute. You're not talking about the same n ! In the OP n is the word length and we're interested in the time to compute distance between 2 words. While for you n is the number of words in the dictionary, and n log n is the number of times you call the distance function. If you combine both it's W k D log D with D dict size, W word size, k distance threshold.Eli
As of 2023-01-01, the link "Speeding Up Levenshtein" is broken, and I don't see a good fix for it. The python code linked to shows an example of using VP-Trees to the spell check problem. It uses Levenshtein distance, as a way of applying a function to the nodes in the VP-Tree. The Levenshtein distance is still O(m*n) where m and n are the sizes of the words being compared.Kneeland
B
4

I modified the Levenshtein distance VBA function found on this post to use a one dimensional array. It performs much faster.

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)

Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j        'setup array positions 0,1,2,3,4,...

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost

        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance2 = D(LD)
End Function

I have tested this function with string 's1' of length 11,304 and 's2' of length 5,665 ( > 64 million character comparisons). With the above single dimension version of the function, the execution time is ~24 seconds on my machine. The original two dimensional function that I referenced in the link above requires ~37 seconds for the same strings. I have optimized the single dimensional function further as shown below and it requires ~10 seconds for the same strings.

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long         'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long                       'loop counters, loop step
Dim ssL As Long, cost As Long                               'loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long                      'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long                              'Length of S1 + 1, Length of S1 + 2
Dim sss1() As String, sss2() As String                      'Character arrays for string S1 & S2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                    'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j            'setup array positions 0,1,2,3,4,...

ReDim sss1(1 To L1)                                         'Size character array S1
ReDim sss2(1 To L2)                                         'Size character array S2
For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i    'Fill S1 character array
For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i    'Fill S2 character array

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost
        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance = D(LD)
End Function
Beauchamp answered 9/4, 2015 at 16:55 Comment(0)
T
3

The Wikipedia article discusses your algorithm, and various improvements. However, it appears that at least in the general case, O(n^2) is the best you can get.

There are however some improvements if you can restrict your problem (e.g. if you are only interested in the distance if it's smaller than d, complexity is O(dn) - this might make sense as a match whose distance is close to the string length is probably not very interesting ). See if you can exploit the specifics of your problem...

Tigon answered 6/7, 2010 at 2:47 Comment(0)
M
3

Commons-lang has a pretty fast implementation. See http://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htm.

Here's my translation of that into Scala:

// The code below is based on code from the Apache Commons lang project.
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with this
 * work for additional information regarding copyright ownership. The ASF
 * licenses this file to You under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance with the
 * License. You may obtain a copy of the License at
 * 
 * http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations
 * under the License.
 */
/**
* assert(levenshtein("algorithm", "altruistic")==6)
* assert(levenshtein("1638452297", "444488444")==9)
* assert(levenshtein("", "") == 0)
* assert(levenshtein("", "a") == 1)
* assert(levenshtein("aaapppp", "") == 7)
* assert(levenshtein("frog", "fog") == 1)
* assert(levenshtein("fly", "ant") == 3)
* assert(levenshtein("elephant", "hippo") == 7)
* assert(levenshtein("hippo", "elephant") == 7)
* assert(levenshtein("hippo", "zzzzzzzz") == 8)
* assert(levenshtein("hello", "hallo") == 1)
*
*/
def levenshtein(s: CharSequence, t: CharSequence, max: Int = Int.MaxValue) = {
import scala.annotation.tailrec
def impl(s: CharSequence, t: CharSequence, n: Int, m: Int) = {
  // Inside impl n <= m!
  val p = new Array[Int](n + 1) // 'previous' cost array, horizontally
  val d = new Array[Int](n + 1) // cost array, horizontally

  @tailrec def fillP(i: Int) {
    p(i) = i
    if (i < n) fillP(i + 1)
  }
  fillP(0)

  @tailrec def eachJ(j: Int, t_j: Char, d: Array[Int], p: Array[Int]): Int = {
    d(0) = j
    @tailrec def eachI(i: Int) {
      val a = d(i - 1) + 1
      val b = p(i) + 1
      d(i) = if (a < b) a else {
        val c = if (s.charAt(i - 1) == t_j) p(i - 1) else p(i - 1) + 1
        if (b < c) b else c
      }
      if (i < n)
        eachI(i + 1)
    }
    eachI(1)

    if (j < m)
      eachJ(j + 1, t.charAt(j), p, d)
    else
      d(n)
  }
  eachJ(1, t.charAt(0), d, p)
}

val n = s.length
val m = t.length
if (n == 0) m else if (m == 0) n else {
  if (n > m) impl(t, s, m, n) else impl(s, t, n, m)
}

}

Merras answered 24/8, 2011 at 6:57 Comment(0)
P
2

I know this is very late but it is relevant to the discussion at hand.

As mentioned by others, if all you want to do is check whether the edit distance between two strings is within some threshold k, you can reduce the time complexity to O(kn). A more precise expression would be O((2k+1)n). You take a strip which spans k cells either side of the diagonal cell (length of strip 2k+1) and compute the values of cells lying on this strip.

Interestingly, there's been an improvement by Li et. al. and this has been further reduced to O((k+1)n).

Peckham answered 30/12, 2012 at 19:54 Comment(3)
Please explain how this happens.Penley
You can look up the paper. There's a link in my answer.Peckham
That paper represents the cutting-edge of 2012. It is now 2016. Are there any practical reference implementations of this work, or has new research superceded it?Psychometry

© 2022 - 2024 — McMap. All rights reserved.