Longest Non-Overlapping Substring

P

4

6

I wonder if anyone knows the (optimal?) algorithm for longest recurring non-overlapping sub string.

For example, in the string

ABADZEDGBADEZ

the longest recurring would be "BAD". Incidentally if there is no such result, the algorithm should alert that such a thing has occurred. My guess is that this involves suffix trees.

I'm sure this must exist somewhere already. Thanks for the help!

Penetralia answered 24/8, 2009 at 17:39 Comment(2)

just curious- what's the business application for this? – Intracellular 24/8, 2009 at 17:42

It isn't a business application. It's related to finding the theme in a song and is more of a curio at the moment until I get a project going involving this. =] – Penetralia 24/8, 2009 at 17:44

F

4

Since you're applying this to music, you're probably not looking for 100% matches; there will be expected discrepencies from one instance of the theme to another. You might try looking up gene sequence analysis algorithms - there's lots of this variety of stuff going on there. Try BLAST or other alignment algorithms.

You could also try the WinEPI family of algorithms, as well as various prefix tree implementations of this specific result set (these results allow gaps in the substring; for instance, ABCID and AFBCD both contain ABCD). I actually have a paper on an algorithm that might be worth looking at if you would be interested; I would need to attain distribution authorization, so let me know.

Note that this is actually a VERY complicated subject (to do efficiently) for large datasets.

Source: Research for a year or two on a comparable (theme-detection) topic.

Fabianfabianism answered 24/8, 2009 at 17:46 Comment(7)

If you could grant me access, it would be appreciated! – Penetralia 24/8, 2009 at 18:4

I think he's applying this to the lyrics, so I think he is looking for 100% matches. – Kunz 24/8, 2009 at 18:6

@Brandon - I've requested permission, I'll see what I can do. @Kunz - Not really. For instance: I was a silly man I was silly, man Example theme: Past-tense silliness. Strings are not an exact match. Plus, a lot of lyrics have different punctuation and whatnot. So I'm not sure he is looking for exact match. – Fabianfabianism 24/8, 2009 at 18:8

Bah formatting. Example was "I was a silly man" versus "I was silly, man" – Fabianfabianism 24/8, 2009 at 18:9

@Brandon - Turns out there are no distribution clauses, so I'll post a link tonight :) – Fabianfabianism 24/8, 2009 at 18:11

Yeah, about the lyrics thing. I'm talking about literally classifying music by notes. This has nothing to do with lyrics. You can transcode musical notes into a string of symbols. Thanks for checking out the distribution rights! =] – Penetralia 24/8, 2009 at 22:2

Here you go - lamegame.no-ip.org/public/PrefixTrees_WWWKWDistributable.pdf Note that the algorithms section is section 4, and the pseudo code there's pretty explicit. Feel absolutely free to ask additional questions about the content or how you might tweak it to work best. Just as a heads up, you'd be operating with an occurrence of a word as a "feature" probably, and for your "time" component just use integers that denote the position in the sentence. Hopefully that makes sense after you look at the paper. – Fabianfabianism 25/8, 2009 at 1:57

A

5

Suffix array is a good data structure for solving that problem.

There is a solution to this problem in Programming Pearls by Jon Bentley.

Apollonius answered 24/8, 2009 at 17:54 Comment(2)

@Nick I don't think that same solution in Programing Pearls can be directly applied here. The example of "BANANA" gives "ANA" which occurs twice, and is thus overlapping, contrary to condition mentioned by OP. Some modification might be required for non overlapping condition. What do you say? – Tetchy 11/11, 2012 at 10:29

@AbhijeetKashnia, you're right. Perhaps we can fix this if we make sure that the comparison of adjacent elements stops when the character offsets overlap, instead of comparing the whole strings. – Apollonius 11/11, 2012 at 14:55

F

4

Since you're applying this to music, you're probably not looking for 100% matches; there will be expected discrepencies from one instance of the theme to another. You might try looking up gene sequence analysis algorithms - there's lots of this variety of stuff going on there. Try BLAST or other alignment algorithms.

You could also try the WinEPI family of algorithms, as well as various prefix tree implementations of this specific result set (these results allow gaps in the substring; for instance, ABCID and AFBCD both contain ABCD). I actually have a paper on an algorithm that might be worth looking at if you would be interested; I would need to attain distribution authorization, so let me know.

Note that this is actually a VERY complicated subject (to do efficiently) for large datasets.

Source: Research for a year or two on a comparable (theme-detection) topic.

Fabianfabianism answered 24/8, 2009 at 17:46 Comment(7)

If you could grant me access, it would be appreciated! – Penetralia 24/8, 2009 at 18:4

I think he's applying this to the lyrics, so I think he is looking for 100% matches. – Kunz 24/8, 2009 at 18:6

@Brandon - I've requested permission, I'll see what I can do. @Kunz - Not really. For instance: I was a silly man I was silly, man Example theme: Past-tense silliness. Strings are not an exact match. Plus, a lot of lyrics have different punctuation and whatnot. So I'm not sure he is looking for exact match. – Fabianfabianism 24/8, 2009 at 18:8

Bah formatting. Example was "I was a silly man" versus "I was silly, man" – Fabianfabianism 24/8, 2009 at 18:9

@Brandon - Turns out there are no distribution clauses, so I'll post a link tonight :) – Fabianfabianism 24/8, 2009 at 18:11

Yeah, about the lyrics thing. I'm talking about literally classifying music by notes. This has nothing to do with lyrics. You can transcode musical notes into a string of symbols. Thanks for checking out the distribution rights! =] – Penetralia 24/8, 2009 at 22:2

Here you go - lamegame.no-ip.org/public/PrefixTrees_WWWKWDistributable.pdf Note that the algorithms section is section 4, and the pseudo code there's pretty explicit. Feel absolutely free to ask additional questions about the content or how you might tweak it to work best. Just as a heads up, you'd be operating with an occurrence of a word as a "feature" probably, and for your "time" component just use integers that denote the position in the sentence. Hopefully that makes sense after you look at the paper. – Fabianfabianism 25/8, 2009 at 1:57

M

1

A simple algorithm is to do this:

Create a data structure representing slices of a string; implement comparison / sorting as appropriate for the language
Create a list of every slice of the string starting with [first-character, last-character], [second-character, last-character], up to [last-character, last-character]
Sort this list - O(n log n)
This brings all string slices with common prefixes together. You can then iterate through the list, comparing each pair to see how many characters they share at the start, and if it's larger than your maximum then take a note of it and set it as the new maximum

(As the other reply just posted indicates, I'm describing a suffix array here.)

Morula answered 24/8, 2009 at 17:55 Comment(2)

This is still rather brute force. I'm wondering if there's a slightly more elegant approach? I believe a tree-based approach would preserve the structural information as well as provide some sort of depth information that can be quickly traversed to provide longest length information. – Penetralia 24/8, 2009 at 18:0

With an appropriate sort - see the suffix array wikipedia article - the running time is O(n log n) in the worst case and usually better. The iteration is O(n), so is dominated by sorting cost. The obvious brute force is going to be O(n^2) at least (comparing every possible pair). Building trees is likely going to cost a lot more memory, which will have adverse real-world effects on performance (consider cache etc.). – Morula 24/8, 2009 at 18:6

F

0

Okay, here's one crazy idea.

Create a function that determines if there is a recurring substring of length k in O(n) (where n is the length of the text). This can be done by using a rolling hash (see wikipedia for Rabin-Karp String Matching Algorithm) to generate all n hashes in linear time and using a hashtable/BST (or a map or dict or whatever your language calls it) to store the corresponding substring positions. Before inserting the current hash to the data structure, we check if we have seen it before. If it has been seen before, we simply look up the indices where this hash was generated and see if the corresponding substring is a non-overlapping match.

Now that we can find a k length substring in O(n) time, we simply run a binary search to find the largest k for which we can find a non-overlapping substring match using our function.

Code in Python follows

A=23
MOD=10007 # use a random value in the real world

def hsh(text):
    return sum([ord(c)*A**(len(text)-i-1) for i,c in enumerate(text)])%MOD

def k_substring(text, k):
    substrings={}
    cur=hsh(text[:k])
    pwr=(A**(k-1))%MOD
    substrings[cur]=[0]
    for i in xrange(k,len(text)):
        to_remove=(ord(text[i-k])*pwr)%MOD
        to_add=ord(text[i])
        cur-=to_remove
        if cur<0:
            cur+=MOD
        cur=(cur*A)%MOD
        cur=(cur+to_add)%MOD
        if cur in substrings:
            lst=substrings[cur]
            for index in lst:
                if index+k<=i-k+1 and text[index:index+k]==text[i-k+1:i+1]:
                    return index
            lst.append(i-k+1)
        else:
            substrings[cur]=[i-k+1]
    return -1

def longest_substring(text):
    hi,lo=len(text),0
    while hi>lo:
        mid=(hi+lo+1)>>1
        if k_substring(text,mid)==-1:
            hi=mid-1
        else:
            lo=mid
    index=k_substring(text,lo)
    return text[index:index+lo]

(Sorry if this is unclear. It is 6:30am here and I really need to go back to bed :) )

Flavius answered 24/8, 2009 at 23:34 Comment(0)

Recommended topics

Hot tags