How can I merge overlapping strings in python?
Asked Answered
M

6

5

I have some strings,

['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']

These strings partially overlap each other. If you manually overlapped them you would get:

SGALWDVPSPV

I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.

Does anyone have any advice on a nice way to do this in python? Thanks!

Mei answered 16/11, 2017 at 15:45 Comment(2)
What would ABC, B, ABC be merged to? ABC or ABCBABC?Alicea
@tobias_k, since the OP specified partial overlaps, I'd say the result should be ['ABC', 'B', 'ABC'], i.e. no partial overlaps.Russon
D
3

Here is a quick sorting solution:

s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b

Output:

'SGALWDVPSPV'

This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.

Dystopia answered 16/11, 2017 at 15:53 Comment(4)
This assumes that a) the order in the list does not matter, 2) the overlap is always 1, and iii) the number of strings is smaller than the number of characters, right? None of those I can read/interpret from the question.Alicea
@Alicea good point, however, I believe that the OP needs to clarify, particularly regarding the additional input that you posted above.Dystopia
The order of the list doesn't matter, the string can be build up in either direction. The overlap is not necessarily always 1. There may be more strings then characters. I'm currently trying to implement this and see how it works, sounds like it may need a little tweaking.Mei
Although the proposed solution works for current proposed "very short peptide" sequences and its order and orientation in the list, it disrespects gap-penalties, amino-acid substitution and reversed order alignment. The current answer violates the dream of a unicorn being real.Formulate
S
3

My proposed solution with a more challenging test list:

#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']

for repeat in range(0, len(strFrag)-1):
    bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
    for otherStr in strFrag[1:]:
        for x in range(0,len(otherStr)):
            if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
                if len(otherStr)-x > bestMatch[0]:
                    bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
            if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
                if x > bestMatch[0]:
                    bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
    if bestMatch[0] > 2:
        strFrag[0] = bestMatch[2]
        strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]

print(strFrag)       
print(strFrag[0])

Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.

Scaler answered 17/11, 2017 at 18:49 Comment(0)
A
2

To determine the overlap of two strings a and b, you can check if any prefix of b is a suffix of a. You can then use that check in a simple loop, aggregating the result and slicing the next string in the list according to the overlap.

lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']

def overlap(a, b):
    return max(i for i in range(len(b)+1) if a.endswith(b[:i]))

res = lst[0]
for s in lst[1:]:
    o = overlap(res, s)
    res += s[o:]
print(res) # SGALWDVPSPV

Or using reduce:

from functools import reduce # Python 3
print(reduce(lambda a, b: a + b[overlap(a,b):], lst))

This is probably not super-efficient, with complexity of about O(n k), with n being the number of strings in the list and k the average length per string. You can make it a bit more efficient by only testing whether the last char of the presumed overlap of b is the last character of a, thus reducing the amount of string slicing and function calls in the generator expression:

def overlap(a, b):
    return max(i for i in range(len(b)) if b[i-1] == a[-1] and a.endswith(b[:i]))
Alicea answered 16/11, 2017 at 16:11 Comment(5)
Addendum: This answer was assuming that the strings overlap in the order they appear in in the list, but it seems like this is not the case. This makes the problem way more complicated...Alicea
I've sorted the initial list now so that I can be sure they do overlap in the order they appear in the list. The current problem I'm facing is if there are two identical strings that differ by one character then I want to throw one away. So if 'WDVPSPV' and 'WDVPSPT' are both there, I want to throw one away, or at least remove the differing character.Mei
@AdamPrice I'm curious: How did you order them? Did you know the order, and they were just not in that order in the list? Determining the optimum order (resulting in the shortest overall string) would mean trying n! permutations, wouldn't it?Alicea
About that "differ in one character" problem: You could find the edit-distance of all pairs of consecutive strings in the list and check whether it's 1.Alicea
This is actually based on some protein data. We have some crude position data that allows us to order them in a way we think is correct. I am grouping them into neighboring sets, and then i can relatively sort them based on previous filtering that shows they are neighboring sequences. So we know they are in the same neighborhood, and then I can sort that small set based on the relative position information from when we calculated neighbors.Mei
R
1

Here's my solution which borders on brute force from the OP's perspective. It's not bothered by order (threw in a random shuffle to confirm that) and there can be non-matching elements in the list, as well as other independent matches. Assumes overlap means not a proper subset but independent strings with elements in common at the start and end:

from collections import defaultdict
from random import choice, shuffle

def overlap(a, b):
    """ get the maximum overlap of a & b plus where the overlap starts """

    overlaps = []

    for i in range(len(b)):
        for j in range(len(a)):
            if a.endswith(b[:i + 1], j):
                overlaps.append((i, j))

    return max(overlaps) if overlaps else (0, -1)

lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV', 'NONSEQUITUR']

shuffle(lst)  # to verify order doesn't matter

overlaps = defaultdict(list)

while len(lst) > 1:
    overlaps.clear()

    for a in lst:
        for b in lst:
            if a == b:
                continue

            amount, start = overlap(a, b)
            overlaps[amount].append((start, a, b))

    maximum = max(overlaps)

    if maximum == 0:
        break

    start, a, b = choice(overlaps[maximum])  # pick one among equals

    lst.remove(a)
    lst.remove(b)
    lst.append(a[:start] + b)

print(*lst)

OUTPUT

% python3 test.py
NONSEQUITUR SGALWDVPSPV
%

Computes all the overlaps and combines the largest overlap into a single element, replacing the original two, and starts process over again until we're down to a single element or no overlaps.

The overlap() function is horribly inefficient and likely can be improved but that doesn't matter if this isn't the type of matching the OP desires.

Russon answered 16/11, 2017 at 18:49 Comment(0)
F
1

Once the peptides start to grow to 20 aminoacids cdlane's code chokes and spams (multiple) incorrect answer(s) with various amino acid lengths.

Try to add and use AA sequence 'VPSGALWDVPS' with or without 'D' and the code starts to fail its task because the N-and C-terminus grow and do not reflect what Adam Price is asking for. The output is: 'SGALWDVPSGALWDVPSPV' and thus 100% incorrect despite the effort.

Tbh imo there is only one 100% answer and that is to use BLAST and its protein search page or BLAST in the BioPython package. Or adapt cdlane's code to reflect AA gaps, substitutions and AA additions.

Formulate answered 17/11, 2017 at 13:58 Comment(4)
Provided code from cdlane works for this short range but beyond it fails epicly. AA replacement (A->V or S>T) and gaps between AA are not counterfeit in cdlane or other examples (lack of background) which should have taken into account.Formulate
The OP says, "I have some simple short strings and just want to properly merge them in a simple way". I've no pretensions that my solution is biologically valid. It attempts to solve the stated problem. If you and the OP want to change the parameters of the problem after the fact, then that's a different problem and you should post it as a different question.Russon
You are correct about his question and your method is a good coffee break teaser. For that I've got it marked and upvoted because its a nice play with code. But from my profession point of view I had to flag it and provide comments for adaptation of the question. As OP brings the question I believe you did Adams "homework". I recall such type of question from about 13 years ago.... (sights... I'm getting old). So.. what you suggest is a good one so I'll suggest the "flag" reviewer to elaborate and deepen this question... but then Adam should come with at least one solution of his own too..?Formulate
It appears to be enough to have commented the technicalities here.Formulate
S
-2

Dredging up an old thread, but had to solve this myself today.

For this specific case, where the fragments are already in order, and each overlap by the same amount (in this case 1), the following fairly simply concatenation works, though might not be the worlds most robust solution:

lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
reference = "SGALWDVPSPV"
string = "".join([i[0] for i in lst] + [lst[-1][1:]])
reference == string
True
Streit answered 3/12, 2019 at 12:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.