Fast algorithm for searching for substrings in a string

Asked 19/11, 2009 at 18:38 Answered 19/11, 2009 at 19:39

Solved java algorithm string search

I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.

What I would like to do is:

Given an input string - INSTR:

"BCDEFGH"

And a set of candidate strings - CAND:

"AB", "CDE", "FG", "H", "IJ"

Find any CAND strings that match as substrings within INSTR

In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")

There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.

I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.

Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.

Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.

Update I only need to know that there was a match - and i don't need to know what the match was.

Final Update I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this

In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:

AhoCorsick: 1

RabinKarp: 1.8

Naive Search (check each pattern & use string.contains): 50

*Some resources describing the algos mentioned in the answers below:

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*

Sible answered 19/11, 2009 at 18:38 Comment(6)

How long are the input strings in relation to the candidate strings? – Baudelaire 19/11, 2009 at 18:43

They are short. Input strings are typically less then 40 characters, as are the candidate strings. – Sible 19/11, 2009 at 18:47

but there could be many thousand candidate strings - and i want to do this repeatedly across many input strings millions of times. – Sible 19/11, 2009 at 18:47

Given the details a FSM is probably your best choice. – Wordy 19/11, 2009 at 18:53

Curious what you expect as output. Is knowing that one was found good enough or do you need to know which were found? Do you need to know how many times each substring was found? – Rossetti 19/11, 2009 at 19:34

Yes, one match is good enough - with a simple true/false and no other info required. – Sible 19/11, 2009 at 20:15

Read up on the Aho-Corasick algorithm and the Rabin-Karp algorithm.

If the input is not too large, you don't want to repeat the search many times and you do not have many patterns, it might be a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times.

Implementations:

Presentations:

http://www.slideshare.net/taka111/ahocorasick-string-matching-algorithm-15078438

Frustrated answered 19/11, 2009 at 18:42 Comment(1)

A collection of several algorithms (including Aho-Corasick) could be found on stringsearchalgorithms.amygdalum.net – Nates 31/3, 2017 at 5:19

Convert the set of candidate strings into a deterministic finite state automaton and then run through the input string in linear time. Converting a single string into a DFS is well-covered in the standard books. You can convert a set of strings by first constructing a non-deterministic automaton and then determinizing it. That can create exponential blow-up in the worst case in the size of the automaton but the search afterwards is fast; especially if the target string is long and the candidates short that's going to work well.

Bernete answered 19/11, 2009 at 18:39 Comment(6)

Is this as soln as relevant given that the input strings and candidate strings are all very short e.g. < 50 chars? – Sible 19/11, 2009 at 18:49

@Sible I think it depends on what you mean when you write above that "I want to do this repeatedly across many input strings". The DFS does not depend on the input string, so if the set of candidate strings is constant across multiple input strings then that's equivalent to one long input string and thus the solution is definitely relevant. If all strings are short AND the candidates change every time then it might not be the optimal solution. – Bernete 19/11, 2009 at 18:53

How does this compare to the Rabin-Karp multiple pattern search algo suggested above? Given the large number of possible substrings, and the relatively short inputstring length this would have seemed like a good solution? – Sible 19/11, 2009 at 19:9

@Sible Rabin-Karp is definitely another good solution, it just has different asymptotic behavior. The DFS solution has best asymptotic complexity for matching because it is O(n) in the length of the input string only. Rabin-Karp has extra terms in the complexity. However it is simpler to implement from scratch. – Bernete 19/11, 2009 at 20:1

plz, give a link or name of book where description how to convert single string to DFS is. I cannot find it.... – Torsi 6/5, 2012 at 23:54

It is possible to convert a string to a finite state machine, yet experts in string search would point out that there are several ways to do that. Knuth-Morris-Pratt (both single pattern searches) or Aho-Corasick (Multi-Pattern-Search) are all based on a deterministic finite automaton. Shift-And is based on a nondeterministic finite automaton (and is also efficient if being used with bit-parallelism). So this answer is strictly a duplicate of the accepted one. – Nates 15/8, 2016 at 6:39

This is what regular expressions are for. As noted above, finite state automata are what you need, but that is exactly how a standard regexp-matcher is implemented.

In java you could write something like:

StringBuilder sb = new StringBuilder();
bool first = true;
for (String subStr : substrings) {
    if (first)
        first = false;
    else
        sb.append('|');
    sb.append(escape(subStr));
}
Pattern p = Pattern.compile(sb.toString());

the method escape should escape any characters which have special meanings in a regexp.

Engels answered 19/11, 2009 at 18:53 Comment(8)

I can't speak for why it was down voted but I can say that due to the way Java regex is implemented, this regex can be way less efficient than searching for each substring individually. – Rossetti 19/11, 2009 at 19:32

I like this solution and upvoted it. However, I'd like to point out two potential problems: (1) Given a thousand or so search strings, it's possible the pattern compiler will explode. I worry memory use grows exponentially with the complexity of the match expression. (2) I believe the FSM/DFS built by the pattern compiler will on occasion back up. If it does, then one of the specialized algorithms that moves strictly forward may end up being faster yet. – Actin 19/11, 2009 at 19:41

I don't claim that my solution is perfect. It may very well be adequate though. YMMV. – Biliary 19/11, 2009 at 21:19

The way regex executes a pattern like "a|b|c|d" is to start at position 0, try all options, move to position 1, try all options, move to position 2, try all options, etc.. Faster ways to do this without a big regex OR are pretty trivial to write I'd think. It will never be any worse to search all of INSTR for CAND1, then search for CAND2, etc. and will often be much better. Furthermore, that kind of search is much easier to optimize. – Rossetti 20/11, 2009 at 2:0

@PSpeed: I can't imagine any real regexp library being implemented like that. Anyway, if speed is really important, you might try [this library][0]. [0]: brics.dk/automaton – Biliary 20/11, 2009 at 10:51

It is though. You can look at the code for Java's regex. The issue is that the OR nodes don't know what the "a", "b", "c", etc. nodes are doing. Also, you are implicitly asking it the wrong question since "a|b|c|d" is asking for the first match of one of those... which means at the very least you will have to search for all of them before you know (unless you hit on the first character). The poster wants to know if any match at all and it's faster to test "a", "b", "c", etc. separately as compared to the regex. It always will be. – Rossetti 20/11, 2009 at 14:58

@Jørgen, for what it's worth, I had the exact same reaction when our regex-guru explained this to me a while back. :) – Rossetti 20/11, 2009 at 15:4

Java's regex may be very poorly implemented; I'm not familiar with it. But barring that, regex should give the optimal solution. The compiled regex should grow linearly in the pattern length (with any luck, sublinear), certainly not exponentially, and search time is linear. – Philomena 6/7, 2010 at 3:54

Rabin-Karp multiple pattern search appears to be the fastest.

Ambush answered 19/11, 2009 at 18:48 Comment(0)

You might want to look into Aho-Corasick algorithm and related algorithms. I don't know of any libraries that implement this, offhand, but this is the classic way of solving this problem.

Jens answered 19/11, 2009 at 18:46 Comment(1)

Thx. Java implementation here: hkn.eecs.berkeley.edu/~dyoo/java/index.html – Sible 20/11, 2009 at 11:33

Also check the Boyer-Moore algorithm for single-string pattern matching.

Inordinate answered 19/11, 2009 at 18:50 Comment(0)

We can take advantage of the small size (< 50 char) of the strings to build a super fast algo for this case, at the cost of memory.

We can hash all possible substrings of INSTR in a hash one time that will cost O(n^2) time. Then regardless of the number of CAND strings, the lookup will be O(1). Worth it for a very large number of CAND strings.

If INSTR is large, then we can build a suffix array and not sort it, so that the top item is the longest (=N) and bottom item is the last char of INSTR. Now for each CAND string, only search from the top as long as length(CAND) <= length(suffix). Each of those comparisons will be O(n).

Posthorse answered 19/11, 2009 at 19:39 Comment(2)

I'm a little hazy on this so I could be off base here, but would the hash time be O(n+1)(n/2) instead of O(n^2) since thats how many different substrings that there should be? – Alyssa 21/1, 2010 at 0:1

Big-O ignores coefficients. Drop the 1 and 2 out of your expression and you are left with O((n)(n)) which is the same as O(n^2). – Craig 15/7, 2015 at 15:14

import java.util.Scanner;

public class StringMatch 
{
    static int temp,i=0,j=0; static boolean flag=true,matcher=false;

    static String str=null,mstr=null;static char astr[],amstr[];

    static void getter(){
        Scanner sc = new Scanner(System.in);
        str = sc.nextLine();
        //String str="today is Monday"; 
        astr=str.toCharArray();
         mstr = sc.nextLine();
        //String mstr="is"; 
         amstr=mstr.toCharArray();
    }

    static void stringMatch(){
        while(i<astr.length){
            if(astr[i]==amstr[j]){
            while((j!=amstr.length)&&flag){temp=i;
                if(astr[i]!=amstr[j]) {flag=false;matcher=false;}
                else{matcher=true;}
                i++;j++;
                //System.out.println(i+"\t"+j);
            }if(matcher==true)break;i=temp;}i++;j=0;flag=true;

        }
        if(matcher==true) {System.out.println("true");}
        else    {System.out.println("false");}
    }

    public static void main(String[] args) {

    StringMatch.getter();
    StringMatch.stringMatch();

    }
}

Unending answered 19/11, 2009 at 18:38 Comment(0)

Here are some implementation of fast String search algorithms in Java.

Negligee answered 19/11, 2009 at 18:38 Comment(2)

where? Did you forgot to copy paste the link? – Snaky 22/12, 2015 at 13:13

If you click on th "Here" text you will be redirected to the website with the algorithms. – Negligee 23/12, 2015 at 16:30

Another solution is to use a suffix array for the INSTR.
Since the INSTR is small you can sort it with bubble sort.

Afterwards you can search for a specific CAND string in O(logN) time,
where N = length(suffix_array) = length(INSTR).

Polloch answered 19/11, 2009 at 19:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags