Similarity String Comparison in Java
Asked Answered
U

12

148

I want to compare several strings to each other, and find the ones that are the most similar. I was wondering if there is any library, method or best practice that would return me which strings are more similar to other strings. For example:

  • "The quick fox jumped" -> "The fox jumped"
  • "The quick fox jumped" -> "The fox"

This comparison would return that the first is more similar than the second.

I guess I need some method such as:

double similarityIndex(String s1, String s2)

Is there such a thing somewhere?

EDIT: Why am I doing this? I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks. Because the legacy system has a very limited field width, when the values are added the descriptions are abbreviated. I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys. It has drawbacks, as it has to be still manually checked, but it would save a lot of work

Ultun answered 5/6, 2009 at 9:54 Comment(0)
G
91

Yes, there are many well documented algorithms like:

  • Cosine similarity
  • Jaccard similarity
  • Dice's coefficient
  • Matching similarity
  • Overlap similarity
  • etc etc

A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)

Also check these projects:

Glyceric answered 5/6, 2009 at 9:59 Comment(5)
+1 The simmetrics site doesn't seem active anymore. However, I found the code on sourceforge: sourceforge.net/projects/simmetrics Thanks for the pointer.Epiphany
The "you can check this" link is broken.Teetotal
That's why Michael Merchant posted the correct link above.Politic
The jar for simmetrics on sourceforge is a bit outdated, github.com/mpkorstanje/simmetrics is the updated github page with maven artifactsDihydrostreptomycin
To add to @MichaelMerchant 's comment, the project is also available on github. Not very active there either though but a bit more recent than sourceforge.Gid
C
195

The common way of calculating the similarity between two strings in a 0%-100% fashion, as used in many libraries, is to measure how much (in %) you'd have to change the longer string to turn it into the shorter:

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below


Computing the editDistance():

The editDistance() function above is expected to calculate the edit distance between the two strings. There are several implementations to this step, each may suit a specific scenario better. The most common is the Levenshtein distance algorithm and we'll use it in our example below (for very large strings, other algorithms are likely to perform better).

Here's two options to calculate the edit distance:


Working example:

See online demo here.

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

Output:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"
Cephalopod answered 15/4, 2013 at 14:58 Comment(2)
Levenshtein distance method is available in org.apache.commons.lang3.StringUtils.Grassgreen
@Grassgreen Now it is part of commons-text: commons.apache.org/proper/commons-text/javadocs/api-release/org/…Guillema
G
91

Yes, there are many well documented algorithms like:

  • Cosine similarity
  • Jaccard similarity
  • Dice's coefficient
  • Matching similarity
  • Overlap similarity
  • etc etc

A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)

Also check these projects:

Glyceric answered 5/6, 2009 at 9:59 Comment(5)
+1 The simmetrics site doesn't seem active anymore. However, I found the code on sourceforge: sourceforge.net/projects/simmetrics Thanks for the pointer.Epiphany
The "you can check this" link is broken.Teetotal
That's why Michael Merchant posted the correct link above.Politic
The jar for simmetrics on sourceforge is a bit outdated, github.com/mpkorstanje/simmetrics is the updated github page with maven artifactsDihydrostreptomycin
To add to @MichaelMerchant 's comment, the project is also available on github. Not very active there either though but a bit more recent than sourceforge.Gid
R
16

I translated the Levenshtein distance algorithm into JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};
Ricard answered 1/11, 2010 at 15:33 Comment(0)
I
14

There are indeed a lot of string similarity measures out there:

  • Levenshtein edit distance;
  • Damerau-Levenshtein distance;
  • Jaro-Winkler similarity;
  • Longest Common Subsequence edit distance;
  • Q-Gram (Ukkonen);
  • n-Gram distance (Kondrak);
  • Jaccard index;
  • Sorensen-Dice coefficient;
  • Cosine similarity;
  • ...

You can find explanation and java implementation of these here: https://github.com/tdebatty/java-string-similarity

Iodate answered 7/8, 2015 at 11:26 Comment(0)
F
14

You can achieve this using the apache commons text library. Take a look at these two classes within it:


Deprecated version of the above:

apache commons java library -> getLevenshteinDistance getFuzzyDistance

Frontage answered 10/4, 2017 at 21:17 Comment(1)
As of october 2017, the linked methods are deprecated. Use the classes LevenshteinDistance and FuzzyScore from the commons text library insteadNewsreel
L
11

You could use Levenshtein distance to calculate the difference between two strings. http://en.wikipedia.org/wiki/Levenshtein_distance

Libertylibia answered 5/6, 2009 at 9:58 Comment(3)
Levenshtein is great for a few strings, but will not scale to comparisons between a large number of strings.Paris
I've used Levenshtein in Java with some success. I havent done comparisons over huge lists so there may be a performance hit. Also it's a bit simple and could use some tweaking to raise the threshold for shorter words (like 3 or 4 chars) which tend to be seen as more similar than the should (it's only 3 edits from cat to dog) Note that the Edit Distances suggested below are pretty much the same thing - Levenshtein is a particular implementation of edit distances.Knp
Here's an article showing how combine Levenshtein with an efficient SQL query: literatejava.com/sql/fuzzy-string-search-sqlAbstract
H
5

Thank to the first answerer, I think there are 2 calculations of computeEditDistance(s1, s2). Due to high time spending of it, decided to improve the code's performance. So:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}
Henebry answered 18/10, 2014 at 13:9 Comment(0)
L
3

Theoretically, you can compare edit distances.

Lagniappe answered 5/6, 2009 at 9:59 Comment(0)
O
3

This is typically done using an edit distance measure. Searching for "edit distance java" turns up a number of libraries, like this one.

Oreste answered 5/6, 2009 at 10:0 Comment(0)
F
3

Sounds like a plagiarism finder to me if your string turns into a document. Maybe searching with that term will turn up something good.

"Programming Collective Intelligence" has a chapter on determining whether two documents are similar. The code is in Python, but it's clean and easy to port.

Farthingale answered 5/6, 2009 at 10:1 Comment(0)
B
0

You can use this "Levenshtein Distance" algorithm without any library:

 public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
    if (s == null || t == null) {throw new IllegalArgumentException("Strings must not be null");}
    int n = s.length();
    int m = t.length();

    if (n == 0) {
            return m;
        }
    else if (m == 0) {
            return n;
        }

    if (n > m) {
            // swap the input strings to consume less memory
            final CharSequence tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

    final int[] p = new int[n + 1];
    // indexes into strings s and t
    int i; // iterates through s
    int j; // iterates through t
    int upper_left;
    int upper;

    char t_j; // jth character of t
    int cost;

    for (i = 0; i <= n; i++) {
            p[i] = i;
        }

    for (j = 1; j <= m; j++) {
            upper_left = p[0];
            t_j = t.charAt(j - 1);
            p[0] = j;

            for (i = 1; i <= n; i++) {
                    upper = p[i];
                    cost = s.charAt(i - 1) == t_j ? 0 : 1;
                    // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                    p[i] = Math.min(Math.min(p[i - 1] + 1, p[i] + 1), upper_left + cost);
                    upper_left = upper;
                }
        }

    return p[n];
   }

From Here

Bonds answered 14/8, 2022 at 17:21 Comment(0)
S
-1

You can also use z algorithm to find similarity in the string. Click here https://teakrunch.com/2020/05/09/string-similarity-hackerrank-challenge/

Shantae answered 10/5, 2020 at 9:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.