"invalid sequence" error in seqio.write() of biopython
Asked Answered
L

1

6

This question is related to bioinformatics. I did not recieve any suggestions in corresponding forums, so I write it here.

I need to remove non-ACTG nucleotides in fasta file and write output to a new file using seqio from biopython.

My code is

import re
import sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC


seq_list=[]
for seq_record in SeqIO.parse("test.fasta", "fasta",IUPAC.ambiguous_dna):
        sequence=seq_record.seq
        sequence=sequence.tomutable()
        seq_record.seq = re.sub('[^GATC]',"",str(sequence).upper())
        seq_list.append(seq_record)
SeqIO.write(seq_list,"test_out","fasta")

Running this code gives errors:

Traceback (most recent call last):
  File "remove.py", line 18, in <module>
    SeqIO.write(list,"test_out","fasta")
  File "/home/ghovhannisyan/Software/anaconda2/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 481, in write
    count = writer_class(fp).write_file(sequences)
  File "/home/ghovhannisyan/Software/anaconda2/lib/python2.7/site-packages    /Bio/SeqIO/Interfaces.py", line 209, in write_file
    count = self.write_records(records)
  File "/home/ghovhannisyan/Software/anaconda2/lib/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 194, in write_records
    self.write_record(record)
  File "/home/ghovhannisyan/Software/anaconda2/lib/python2.7/site-packages/Bio/SeqIO/FastaIO.py", line 202, in write_record
    data = self._get_seq_string(record)  # Catches sequence being None
  File "/home/ghovhannisyan/Software/anaconda2/lib/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 100, in _get_seq_string
% record.id)
 TypeError: SeqRecord (id=CALB_TCONS_00001015) has an invalid sequence.

If I change this line

seq_record.seq = re.sub('[^GATC]',"",str(sequence).upper())

to for example seq_record.seq = sequence + "A" everything works fine. However, re.sub('[^GATC]',"",str(sequence).upper()) also should work in theory.

Thanks

Lugubrious answered 11/7, 2017 at 16:30 Comment(0)
I
6

Biopython's SeqIO expects the SeqRecord object's .seq to be a Seq object (or similar), not a plain string. Try:

seq_record.seq = Seq(re.sub('[^GATC]',"",str(sequence).upper()))

For FASTA output there is no need to set the sequence's alphabet.

Inerrable answered 13/7, 2017 at 14:45 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.