How to find inverted repeated pattern in a FASTA sequence?

Asked 12/1, 2013 at 21:27 Answered 13/1, 2018 at 23:4

Suppose my long sequence looks like:

5’-AGGGTTTCCC**TGACCT**TCACTGC**AGGTCA**TGCA-3

The two italics subsequences (here within the two stars) in this long sequence are together called as inverted repeat pattern. The length and the combination of the four letters such as A,T,G,C in those two subsequences will be varying. But there is a relation between these two subsequence. Notice that, when you consider the first subsequence then its complementary subsequence is ACTGGA (according to A combines with T and G combine with C) and when you invert this complementary subsequence (i.e. last letter comes first) then it matches with the second subsequence.

There are large number of such patterns present in a FASTA sequence (contains 10 million ATGC letters) and I want to find such patterns and their start and end positions.

Hypermeter answered 12/1, 2013 at 21:27 Comment(3)

Are there any length limitations? This looks like a very computationally intensive task. – Bismuthic 12/1, 2013 at 21:40

1) How long are the inverted repeats (at minimum)? 2) How far apart can they be (at maximum)? – Alleen 12/1, 2013 at 21:41

Always two subsequences can form an inverted repeat unit. Suppose they are separated by 100 bases. – Hypermeter 14/1, 2013 at 2:46

I'm new to both Python and bioinformatics, but I'm working my way through the rosalind.info web site to learn some of both. You do this with a suffix tree. A suffix tree (see http://en.wikipedia.org/wiki/Suffix_tree) is the magical data structure that makes all things possible in bioinformatics. You quickly locate common substrings in multiple long sequences. Suffix trees only require linear time, so length 10,000,000 is feasible.

So first find the reverse complement of the sequence. Then put both into the suffix tree, and find the common substrings between them (of some minimum length).

The code below uses this suffix tree implementation: http://www.daimi.au.dk/~mailund/suffix_tree.html. It's written in C with Python bindings. It won't handle a large number of sequences, but two is no problem. However I can't say whether this will handle length 10,000,000.

from suffix_tree import GeneralisedSuffixTree

baseComplement = { 'A' : 'T', 'C' : 'G', 'G' : 'C', 'T' : 'A' }

def revc(seq):
    return "".join([baseComplement[base] for base in seq[::-1]])

data = "AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA"
# revc  TGCATGACCTGCAGTGAAGGTCAGGGAAACCCT
#       012345678901234567890123456789012
#                 1         2         3
minlength = 6

n = len(data)
tree = GeneralisedSuffixTree([data, revc(data)])
for shared in tree.sharedSubstrings(minlength):
    #print shared
    _, start, stop = shared[0]
    seq = data[start:stop]
    _, rstart, rstop = shared[1]
    rseq = data[n-rstop:n-rstart]
    print "Match: {0} at [{1}:{2}] and {3} at [{4}:{5}]".format(seq, start, stop, rseq, n-rstop, n-rstart)

This produces output

Match: AGGTCA at [23:29] and TGACCT at [10:16]
Match: TGACCT at [10:16] and AGGTCA at [23:29]
Match: CTGCAG at [19:25] and CTGCAG at [19:25]

It finds each match twice, once from each end. And there's a reverse palindrome in there, too!

Tifanytiff answered 12/1, 2013 at 22:4 Comment(6)

Thank you very much. I have put a comment below (a small version of the algorithm). Can you implement it? – Hypermeter 13/1, 2013 at 23:7

@user1964587: If this answers the question, please accept the answer by clicking on the checkmark. As for implementing your idea, I'll leave that to you. :-) However, your proposed solution sounds like it will require at least O(n^2) running time, whereas a suffix tree is O(n), and so will be faster. But for 10,000,000-long sequences, you'll want lots of RAM. – Tifanytiff 14/1, 2013 at 0:43

Yes, I agree and your suffix tree method is good idea for it. But if you modify your program for a small section of DNA of the whole genome then it will be more efficient. Suppose first consider 20 letters and generate its reverse complement and using the suffix tree method find the shared sequence of it. If there two shared sequences are found then go for the next section and repeat the whole process(and not consider the previous section for the next calculation)otherwise increase the length of the previous sequence.I am trying but can't figure out how to do it.Can you please implement it. – Hypermeter 14/1, 2013 at 3:1

That's a big request! Why don't you start working on it, piece by piece, and when you have a specific problem, post a new question on it. – Tifanytiff 14/1, 2013 at 17:8

I can't print the variable 'tree'. I want to see the structure of the tree variable. Simple 'print tree' is not working here. – Hypermeter 14/1, 2013 at 20:28

The "tree" variable is an instance of a complex data structure. Look at the documentation for it on the web site linked in my reply for the properties and methods it has. You should be able to print tree.sequences, tree.sharedSubstrings(), and other properties of tree. – Tifanytiff 14/1, 2013 at 21:51

Here's a brute force implementation, although it's probably not very useful on super long sequences.

def substrings(s, lmin, lmax):
    for i in range(len(s)):
        for l in range(lmin, lmax+1):
            subst = s[i:i+l]
            if len(subst) == l:
                yield i, l, subst

def ivp(s, lmin, lmax):
    mapping = {'A': 'T', 'G': 'C', 'T': 'A', 'C': 'G'}
    for i, l, sub in substrings(s, lmin, lmax):
        try:
            from string import maketrans
        except ImportError: # we're on Python 3
            condition = sub.translate(
                       {ord(k): v for k, v in mapping.items()})[::-1] in s
        else: # Python 2
            condition = sub.translate(maketrans('ATGC', 'TACG'))[::-1] in s
        if condition:
            yield i, l, sub

Let's find "inverted repeat patterns" of length 6 (and their start positions and lengths):

>>> list(ivp('AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA', 6, 6))
[(10, 6, 'TGACCT'), (19, 6, 'CTGCAG'), (23, 6, 'AGGTCA')]

This doesn't check if the two patterns overlap, though. For instance, 'CTGCAG' matches itself.

Bismuthic answered 12/1, 2013 at 22:22 Comment(11)

actual sequence and other from reverse complement sequence). The cost will be giving an idea which alignment is best. Now if the cost of the best alignment >=threshold cost then select that section and find the common region. The two common regions of this particular section will be one inverted repeat unit. Once find the unit then repeat it for the next section other wise increase the length of the sections continuously. Can anyone implement this algorithm. May be it will be a useful solution. – Hypermeter 13/1, 2013 at 22:49

I am getting this error message when I was running this scriot. Where is my wrong? Traceback (most recent call last): File "irc.py", line 13, in <module> print list(ivp('AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA', 6, 6)) File "irc.py", line 11, in ivp if sub.translate(mapping)[::-1] in s: TypeError: expected a character buffer object – Hypermeter 14/1, 2013 at 4:49

@Hypermeter Nothing you did wrong, I edited the code to work on both Python 2 and 3. – Bismuthic 14/1, 2013 at 6:3

but why is this error appearing? any clue. I am using 3.2 but tried with 2. still the error is there – Hypermeter 14/1, 2013 at 6:23

@Hypermeter Please try the edited code. The traceback is from the old code (or please show the new traceback) – Bismuthic 14/1, 2013 at 6:34

File "irc1.py", line 22 print list(ivp(fasta, 6, 6)) ^ SyntaxError: invalid syntax. this error was coming when I ran the script using python3 or python3.2....why it appears – Hypermeter 14/1, 2013 at 7:53

@Hypermeter Look at the previous line, maybe you forgot to close a bracket there or something. – Bismuthic 14/1, 2013 at 8:42

@Hypermeter Wait a sec, you can't use print statement in Python3: print is a function. – Bismuthic 14/1, 2013 at 17:44

thanks it works. but when I varying lmin and lmax for a long sequence, the output is weird. – Hypermeter 14/1, 2013 at 20:39

@Hypermeter Can you perhaps be more specific? :) – Bismuthic 14/1, 2013 at 20:55

I have an idea for finding the inverted palindrome sequence of a long sequence. Consider a section of a DNA sequence of the whole sequence and generate its complement. Then reversing the section of this complement sequence. Then perform the dynamical alignment between these two sections and calculate the cost of it(one from actual sequence and other from reverse complement sequence). The cost will be giving an idea which alignment is best. Now if the cost of the best alignment >=threshold cost then select that section and find the common region. The two common regions of this particular section will be one inverted repeat unit. Once find the unit then repeat it for the next section other wise increase the length of the sections continuously. Can anyone implement this algorithm. May be it will be a useful solution.

Hypermeter answered 14/1, 2013 at 2:22 Comment(0)

I took a shot at it using list comprehensions. I'm still new to python, but have been a C# dev for the last 5 or so years. This produces the output you desire, although it's definitely not optimized to be able to efficiently handle a 10 million character string.

Note: because we convert the list into a set in frequent_words, in order to remove duplicate entries, the results are not ordered.

def pattern_matching(text, pattern):
    """ Returns the start and end positions of each instance of the pattern  """
    return [[x, str(len(pattern) + x)] for x in range(len(text) - len(pattern) + 1) if
            pattern in text[x:len(pattern) + x]]


def frequent_words(text, k):
    """ Finds the most common k-mers of k """
    counts = [len(pattern_matching(text, text[x:x + k])) for x in range(len(text) - k)]
    return set([text[x:x + k] for x in range(len(text) - k) if counts[x] == max(counts)])


def reverse_complement(pattern):
    """ Formed by taking the complement of each nucleotide in Pattern """
    complements = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'}
    return ''.join([complements.get(c, c) for c in pattern][::-1])


def find_invert_repeats(text, pattern_size):
    """ Returns the overlap for frequent words and its reverse """
    freqs = frequent_words(text, pattern_size)
    rev_freqs = frequent_words(reverse_complement(text), pattern_size)
    return [[x, pattern_matching(text, x)] for x in set(freqs).intersection(rev_freqs)]

print(find_invert_repeats("AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA", 6))

[['TGACCT', [[10, '16']]], ['AGGTCA', [[23, '29']]], ['CTGCAG', [[19, '25']]]]

Fatherland answered 13/1, 2018 at 23:4 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags