Shortest uncommon substring: shortest substring of one string, that is not a substring of another string
Asked Answered
D

1

11

We need to find shortest uncommon substring between two strings i.e. if we have two strings a and b so we need to find the length of shortest substring of a that is not a substring of b.

How to solve this problem using suffix array ?

To be solved with complexity of not more than n*lg(n)

Deutoplasm answered 26/9, 2012 at 17:49 Comment(1)
Possible same on CS (newer): cs.stackexchange.com/questions/12037/…Derbent
I
13

This may be solved in O(N) time with Generalized suffix tree.

After constructing the generalized suffix tree in O(N) time, you need to perform breadth-first search and find the first node not belonging to both strings. The path from the root to this node gives the shortest uncommon substring.


The same thing may be done using the generalized suffix array for two input strings, also in O(N) time.

Construct the generalized suffix array along with LCP array (or construct the LCP array later from the suffix array). Add a single zero element as a prefix of the LCP array; add another zero element as a suffix. Find a pair of minimal LCP entries in such a way that there are suffixes of only one string delimited by these entries. This means you need to perform a linear scan of the LCP array, extracting two minimal values, but reset both minimal values to infinity every time you see a suffix of a different string or if you see a suffix belonging to both strings. The larger element of the best of these pairs (having the least value for the larger element in the pair) gives the length of the shortest uncommon substring. This works because this pair of minimal values delimits all descendants of the first node (closest to the root), not belonging to both strings in the corresponding suffix tree.

Irby answered 26/9, 2012 at 18:0 Comment(4)
Why pad the LCP with zeroes? And if we consistently scan the LCP for two minimal values, is that still O(n)? Sorry for all my questions, but could you also post an example?Table
@ChrisZhang: padding with zeros is not strictly necessary. It just allows to avoid handling special cases of beginning/ending of the array. Yes, it is still O(n); compare it to the case where you find 2 minimal values in an array, inserting each array element into max-heap of constant size 2. Could not invent simple enough example, sorry. May be this paper will help you: "Replacing suffix trees with enhanced suffix arrays". It explains how to transform almost any suffix tree algorithm to SA algorithm. Didn't know about it when answering this question.Irby
The solution with the Generalized Suffix Tree is not correct since the edges can be of arbitrary length in a suffix tree (probably it would work with a suffix trie). When finding a non-shared node in breadth-first search (actually depth-first search would work equally, I think) a candidate solution is the path up to the last shared node plus the first character of the newly found edge. The length of this can be compared to the previously shortest result and replaced if it is a shorter solution. (If the current shortest substring has length 1 we can stop the search.)Colemancolemanite
@uli_1973: actually what is needed here is single source shortest path to any non-shared node. This definitely could be done by inspecting every tree node and choosing one with shortest distance from root. But more natural algorithm would be to use BFS and stop as soon as non-shared node is found. Of course this BFS should be generalized to work on suffix trees: plain queue should be replaced by priority queue. (Probably it would be better to name it not "generalized BFS" but "simplified Dijkstra's algorithm").Irby

© 2022 - 2024 — McMap. All rights reserved.