how to implement near matches of strings in java?
Asked Answered
H

3

6

Hello fellow programmers,

I would like to ask for some help with regards to near matches of strings.

Currently, I have a program that stores strings of description, users can search for description by typing it completely or partially.

I would like to implement a near match search. For example the actual description is "hello world" but user erroneously enter a search "hello eorld". The programs should be able to return "hello world" to user.

I've tried looking at pattern and matches to implement it, but it requires a regex to match strings, whereby my description does not have a regular pattern. I've also tried string.contains, but it doesn't seems to work either. Below is part of the code i tried to implement.

    ArrayList <String> list = new ArrayList<String>();
    list.add("hello world");
    list.add("go jogging at london");
    list.add("go fly kite");
    Scanner scan = new Scanner(System.in);

    for(int i = 0; i < list.size(); i++){
      if(list.get(i).contains(scan.next())) {
         System.out.println(list.get(i));
      }
    }

Could fellow programmers help me with this??

Haig answered 2/11, 2012 at 14:16 Comment(0)
A
2

You can use LCS(Longest Common Subsequence) see these: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

public class LCS {

    public static void main(String[] args) {
        String x = StdIn.readString();
        String y = StdIn.readString();
        int M = x.length();
        int N = y.length();

        // opt[i][j] = length of LCS of x[i..M] and y[j..N]
        int[][] opt = new int[M+1][N+1];

        // compute length of LCS and all subproblems via dynamic programming
        for (int i = M-1; i >= 0; i--) {
            for (int j = N-1; j >= 0; j--) {
                if (x.charAt(i) == y.charAt(j))
                    opt[i][j] = opt[i+1][j+1] + 1;
                else 
                    opt[i][j] = Math.max(opt[i+1][j], opt[i][j+1]);
            }
        }

        // recover LCS itself and print it to standard output
        int i = 0, j = 0;
        while(i < M && j < N) {
            if (x.charAt(i) == y.charAt(j)) {
                System.out.print(x.charAt(i));
                i++;
                j++;
            }
            else if (opt[i+1][j] >= opt[i][j+1]) i++;
            else                                 j++;
        }
        System.out.println();

    }

}

Other solution is Aho–Corasick string matching algorithm see this : Fast algorithm for searching for substrings in a string

Albertype answered 2/11, 2012 at 14:20 Comment(1)
Although I have no idea how this method going to work, i will go and look at it and figure my way to implement it. Thanks SjB :DHaig
A
3

The Levenshtein distance is able to qualify the difference between two strings

Here is an implementation taken form here:

public class LevenshteinDistance {
   private static int minimum(int a, int b, int c) {
      return Math.min(Math.min(a, b), c);
   }

   public static int computeLevenshteinDistance(
      CharSequence str1,
      CharSequence str2 )
   {
      int[][] distance = new int[str1.length() + 1][str2.length() + 1];

      for (int i = 0; i <= str1.length(); i++)
         distance[i][0] = i;
      for (int j = 1; j <= str2.length(); j++)
         distance[0][j] = j;

      for (int i = 1; i <= str1.length(); i++)
         for (int j = 1; j <= str2.length(); j++)
            distance[i][j] =
               minimum(
                  distance[i - 1][j] + 1,
                  distance[i][j - 1] + 1,
                  distance[i - 1][j - 1] +
                     ((str1.charAt(i - 1) == str2.charAt(j - 1)) ? 0 : 1));

      return distance[str1.length()][str2.length()];
   }
}
Almoner answered 2/11, 2012 at 14:26 Comment(2)
Regarding your implementation: I would add some empty-string tests. If str1 is empty, the distance is str2.length() (and the other way around)Expletive
To find "similar" strings, this seems to be the best solution. Making both strings lower case usually improves it even more.Ito
A
2

You can use LCS(Longest Common Subsequence) see these: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

public class LCS {

    public static void main(String[] args) {
        String x = StdIn.readString();
        String y = StdIn.readString();
        int M = x.length();
        int N = y.length();

        // opt[i][j] = length of LCS of x[i..M] and y[j..N]
        int[][] opt = new int[M+1][N+1];

        // compute length of LCS and all subproblems via dynamic programming
        for (int i = M-1; i >= 0; i--) {
            for (int j = N-1; j >= 0; j--) {
                if (x.charAt(i) == y.charAt(j))
                    opt[i][j] = opt[i+1][j+1] + 1;
                else 
                    opt[i][j] = Math.max(opt[i+1][j], opt[i][j+1]);
            }
        }

        // recover LCS itself and print it to standard output
        int i = 0, j = 0;
        while(i < M && j < N) {
            if (x.charAt(i) == y.charAt(j)) {
                System.out.print(x.charAt(i));
                i++;
                j++;
            }
            else if (opt[i+1][j] >= opt[i][j+1]) i++;
            else                                 j++;
        }
        System.out.println();

    }

}

Other solution is Aho–Corasick string matching algorithm see this : Fast algorithm for searching for substrings in a string

Albertype answered 2/11, 2012 at 14:20 Comment(1)
Although I have no idea how this method going to work, i will go and look at it and figure my way to implement it. Thanks SjB :DHaig
C
2

The Levenstein Distance might be useful for this problem. Apache Commons Lang StringUtils has an implementation for it.
Also, the difference method from StringUtils might be interesting, if you want to find out how the strings differ.

Centric answered 2/11, 2012 at 14:25 Comment(1)
I just started typing this :-)Expletive

© 2022 - 2024 — McMap. All rights reserved.