Longest common subsequence of 3+ strings
Asked Answered
J

5

25

I am trying to find the longest common subsequence of 3 or more strings. The Wikipedia article has a great description of how to do this for 2 strings, but I'm a little unsure of how to extend this to 3 or more strings.

There are plenty of libraries for finding the LCS of 2 strings, so I'd like to use one of them if possible. If I have 3 strings A, B and C, is it valid to find the LCS of A and B as X, and then find the LCS of X and C, or is this the wrong way to do it?

I've implemented it in Python as follows:

import difflib

def lcs(str1, str2):
    sm = difflib.SequenceMatcher()
    sm.set_seqs(str1, str2)
    matching_blocks = [str1[m.a:m.a+m.size] for m in sm.get_matching_blocks()]
    return "".join(matching_blocks)

print reduce(lcs, ['abacbdab', 'bdcaba', 'cbacaa'])

This outputs "ba", however it should be "baa".

Jaye answered 20/2, 2011 at 13:16 Comment(6)
This is a very well known problem. Just google it. see http://www.springerlink.com/content/fu4t4442l7577712/ where two algorithms for this problem are presented. Also look at http://www.perlmonks.org/?node_id=548827 (many other links are available at google)Altogether
It's "baa". He wants longest common subsequence, not longest common substring.Tinea
The iterative approach you mention will not work. For example it will give wrong results for abc12, 12abc, 12xyz.Patronymic
Please note that there may be more than 1 lcs for any two strings. For example, for your first two strings ('abacbdab', 'bdcaba'), there are at least five lcs: "baa", "bab", "bca", "bcb", "dab". Does your code take that into account?Tinea
Oops, finding lcs by hand is not so trivial! There are three, "bcab", "bdab", "bcba" in the above example, which explains why your code can't find "baa". But the point is still valid, there can be more than one lcs for 2 strings.Tinea
@ypercube - Good point, my code doesn't take that in to account since difflib only gives you one LCS.Jaye
S
34

Just generalize the recurrence relation.

For three strings:

dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if A[i] = B[j] = C[k]
              max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise

Should be easy to generalize to more strings from this.

Scourings answered 20/2, 2011 at 13:41 Comment(10)
I adapted the LCS code from here as per your suggestion: code.activestate.com/recipes/…. It works but is extremely slow and I run into maximum recursion depth errors on longer sequences. I was hoping there is a library that implements this already.Jaye
@Jaye - it's not going to get much faster, because the problem has no better exact solution.Scourings
@del, it can be programmed iteratively instead of recursively. The pseudocode on Wikipedia even gives the iterative solution. Just extend this to 3 dimensions, or k dimensions if you feel like it. The dynamic programming algorithm is O(Π_{i=1,k} n_k) in time where n_i is the length of the ith of k input strings (O(n^k) if all strings same length). So, yes, it's slow. Space complexity can be O([Π_{i=1,k} n_k]/[argmax_{i=1,k} n_k]) (if all strings same length, O(n^{k-1})).Saddlecloth
I have posted a solution which uses dynamic programming to save immediate results for a faster solution and less memory usage.Fetus
@KentMuntheCaspersen - this uses dynamic programming as well (or CAN use it, if you implement the recurrence right). Yours seems to use backtracking? It's hard to tell from your code what you're doing exactly, but I doubt it's faster.Scourings
Miy solution uses recursion and works for n strings. It uses a n-dimensional array, where the entry [i, j, k, ...] stores the solution to the problem of finding the LCS between string1(0,i), string2(0,j), string3(0,k) ...Fetus
I wrote it fast so it is not very clean, but I decided to post it anyways since I could not find an implementation anywhere else. Its complexity is O(n^m), where n is maximum string length and m is number of strings. Your solution is simple and powerful, it perfectly demonstrates the technique.Fetus
Someone directly copy-pasted your answer here, without even mentioning your name ;-)Misplay
Aww, I am flattered :).Scourings
Is there any performant solution that uses O(n^2) memory ? This one is painful in memory usage as O(n^k) if there are k stringsVerminous
S
9

I just had to do this for a homework, so here is my dynamic programming solution in python that's pretty efficient. It is O(nml) where n, m and l are the lengths of the three sequences.

The solution works by creating a 3D array and then enumerating all three sequences to calculate the path of the longest subsequence. Then you can backtrack through the array to reconstruct the actual subsequence from its path.

So, you initialize the array to all zeros, and then enumerate the three sequences. At each step of the enumeration, you either add one to the length of the longest subsequence (if there's a match) or just carry forward the longest subsequence from the previous step of the enumeration.

Once the enumeration is complete, you can now trace back through the array to reconstruct the subsequence from the steps you took. i.e. as you travel backwards from the last entry in the array, each time you encounter a match you look it up in any of the sequences (using the coordinate from the array) and add it to the subsequence.

def lcs3(a, b, c):
    m = len(a)
    l = len(b)
    n = len(c)
    subs = [[[0 for k in range(n+1)] for j in range(l+1)] for i in range(m+1)]

    for i, x in enumerate(a):
        for j, y in enumerate(b):
            for k, z in enumerate(c):
                if x == y and y == z:
                    subs[i+1][j+1][k+1] = subs[i][j][k] + 1
                else:
                    subs[i+1][j+1][k+1] = max(subs[i+1][j+1][k], 
                                              subs[i][j+1][k+1], 
                                              subs[i+1][j][k+1])
    # return subs[-1][-1][-1] #if you only need the length of the lcs
    lcs = ""
    while m > 0 and l > 0 and n > 0:
        step = subs[m][l][n]
        if step == subs[m-1][l][n]:
            m -= 1
        elif step == subs[m][l-1][n]:
            l -= 1
        elif step == subs[m][l][n-1]:
            n -= 1
        else:
            lcs += str(a[m-1])
            m -= 1
            l -= 1
            n -= 1

    return lcs[::-1]
Saba answered 30/1, 2017 at 0:24 Comment(0)
F
4

To find the Longest Common Subsequence (LCS) of 2 strings A and B, you can traverse a 2-dimensional array diagonally like shown in the Link you posted. Every element in the array corresponds to the problem of finding the LCS of the substrings A' and B' (A cut by its row number, B cut by its column number). This problem can be solved by calculating the value of all elements in the array. You must be certain that when you calculate the value of an array element, all sub-problems required to calculate that given value has already been solved. That is why you traverse the 2-dimensional array diagonally.

This solution can be scaled to finding the longest common subsequence between N strings, but this requires a general way to iterate an array of N dimensions such that any element is reached only when all sub-problems the element requires a solution to has been solved.

Instead of iterating the N-dimensional array in a special order, you can also solve the problem recursively. With recursion it is important to save the intermediate solutions, since many branches will require the same intermediate solutions. I have written a small example in C# that does this:

string lcs(string[] strings)
{
    if (strings.Length == 0)
        return "";
    if (strings.Length == 1)
        return strings[0];
    int max = -1;
    int cacheSize = 1;
    for (int i = 0; i < strings.Length; i++)
    {
        cacheSize *= strings[i].Length;
        if (strings[i].Length > max)
            max = strings[i].Length;
    }
    string[] cache = new string[cacheSize];
    int[] indexes = new int[strings.Length];
    for (int i = 0; i < indexes.Length; i++)
        indexes[i] = strings[i].Length - 1;
    return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
    for (int i = 0; i < indexes.Length; i++ )
        if (indexes[i] == -1)
            return "";
    bool match = true;
    for (int i = 1; i < indexes.Length; i++)
    {
        if (strings[0][indexes[0]] != strings[i][indexes[i]])
        {
            match = false;
            break;
        }
    }
    if (match)
    {
        int[] newIndexes = new int[indexes.Length];
        for (int i = 0; i < indexes.Length; i++)
            newIndexes[i] = indexes[i] - 1;
        string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
        cache[calcCachePos(indexes, strings)] = result;
        return result;
    }
    else
    {
        string[] subStrings = new string[strings.Length];
        for (int i = 0; i < strings.Length; i++)
        {
            if (indexes[i] <= 0)
                subStrings[i] = "";
            else
            {
                int[] newIndexes = new int[indexes.Length];
                for (int j = 0; j < indexes.Length; j++)
                    newIndexes[j] = indexes[j];
                newIndexes[i]--;
                int cachePos = calcCachePos(newIndexes, strings);
                if (cache[cachePos] == null)
                    subStrings[i] = lcsBack(strings, newIndexes, cache);
                else
                    subStrings[i] = cache[cachePos];
            }
        }
        string longestString = "";
        int longestLength = 0;
        for (int i = 0; i < subStrings.Length; i++)
        {
            if (subStrings[i].Length > longestLength)
            {
                longestString = subStrings[i];
                longestLength = longestString.Length;
            }
        }
        cache[calcCachePos(indexes, strings)] = longestString;
        return longestString;
    }
}
int calcCachePos(int[] indexes, string[] strings)
{
    int factor = 1;
    int pos = 0;
    for (int i = 0; i < indexes.Length; i++)
    {
        pos += indexes[i] * factor;
        factor *= strings[i].Length;
    }
    return pos;
}

My code example can be optimized further. Many of the strings being cached are duplicates, and some are duplicates with just one additional character added. This uses more space than necessary when the input strings become large.

On input: "666222054263314443712", "5432127413542377777", "6664664565464057425"

The LCS returned is "54442"

Fetus answered 23/10, 2013 at 14:45 Comment(0)
Z
4

This below code can find the longest common subsequence in N strings. This uses itertools to generate required index combinations and then use these indexes for finding common substring.

Example Execution:
Input:
Enter the number of sequences: 3
Enter sequence 1 : 83217
Enter sequence 2 : 8213897
Enter sequence 3 : 683147

Output:
837

from itertools import product
import numpy as np
import pdb

def neighbors(index):
    N = len(index)
    for relative_index in product((0, -1), repeat=N):
        if not all(i == 0 for i in relative_index):
            yield tuple(i + i_rel for i, i_rel in zip(index, relative_index))

def longestCommonSubSequenceOfN(sqs):
    numberOfSequences = len(sqs);
    lengths = np.array([len(sequence) for sequence in sqs]);
    incrLengths = lengths + 1;
    lengths = tuple(lengths);
    inverseDistances = np.zeros(incrLengths);
    ranges = [tuple(range(1, length+1)) for length in lengths[::-1]];
    for tupleIndex in product(*ranges):
        tupleIndex = tupleIndex[::-1];
        neighborIndexes = list(neighbors(tupleIndex));
        operationsWithMisMatch = np.array([]);
        for neighborIndex in neighborIndexes:
            operationsWithMisMatch = np.append(operationsWithMisMatch, inverseDistances[neighborIndex]);
        operationsWithMatch = np.copy(operationsWithMisMatch);
        operationsWithMatch[-1] = operationsWithMatch[-1] + 1;
        chars = [sqs[i][neighborIndexes[-1][i]] for i in range(numberOfSequences)];
        if(all(elem == chars[0] for elem in chars)):
            inverseDistances[tupleIndex] = max(operationsWithMatch);
        else:
            inverseDistances[tupleIndex] = max(operationsWithMisMatch);
        # pdb.set_trace();

    subString = "";
    mainTupleIndex = lengths;
    while(all(ind > 0 for ind in mainTupleIndex)):
        neighborsIndexes = list(neighbors(mainTupleIndex));
        anyOperation = False;
        for tupleIndex in neighborsIndexes:
            current = inverseDistances[mainTupleIndex];
            if(current == inverseDistances[tupleIndex]):
                mainTupleIndex = tupleIndex;
                anyOperation = True;
                break;
        if(not anyOperation):
            subString += str(sqs[0][mainTupleIndex[0] - 1]);
            mainTupleIndex = neighborsIndexes[-1];
    return subString[::-1];

numberOfSequences = int(input("Enter the number of sequences: "));
sequences = [input("Enter sequence {} : ".format(i)) for i in range(1, numberOfSequences + 1)];
print(longestCommonSubSequenceOfN(sequences));
Zel answered 13/5, 2019 at 15:50 Comment(0)
R
2

Here is a link to the solution view explanation here output is Length of LCS is 2

 # Python program to find 
 # LCS of three strings 

 # Returns length of LCS 
 # for X[0..m-1], Y[0..n-1] 
 # and Z[0..o-1] 
def lcsOf3(X, Y, Z, m, n, o): 

    L = [[[0 for i in range(o+1)] for j in range(n+1)] 
        for k in range(m+1)] 

    ''' Following steps build L[m+1][n+1][o+1] in 
    bottom up fashion. Note that L[i][j][k] 
    contains length of LCS of X[0..i-1] and 
    Y[0..j-1] and Z[0.....k-1] '''
   for i in range(m+1): 
    for j in range(n+1): 
        for k in range(o+1): 
            if (i == 0 or j == 0 or k == 0): 
                L[i][j][k] = 0

            elif (X[i-1] == Y[j-1] and
                  X[i-1] == Z[k-1]): 
                L[i][j][k] = L[i-1][j-1][k-1] + 1

            else: 
                L[i][j][k] = max(max(L[i-1][j][k], 
                L[i][j-1][k]), 
                                L[i][j][k-1]) 

      # L[m][n][o] contains length of LCS for 
      # X[0..n-1] and Y[0..m-1] and Z[0..o-1] 
    return L[m][n][o] 

  # Driver program to test above function 

X = 'AGGT12'
Y = '12TXAYB'
Z = '12XBA'

m = len(X) 
n = len(Y) 
o = len(Z) 

print('Length of LCS is', lcsOf3(X, Y, Z, m, n, o)) 

# This code is contributed by Soumen Ghosh.      
Ravid answered 22/5, 2020 at 17:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.