Biopython parse from variable instead of file
Asked Answered
P

2

6
import gzip
import io
from Bio import SeqIO

infile = "myinfile.fastq.gz"
fileout = open("myoutfile.fastq", "w+")
with io.TextIOWrapper(gzip.open(infile, "r")) as f:
    line = f.read()
fileout.write(line)
fileout.seek(0)

count = 0
for rec in SeqIO.parse(fileout, "fastq"): #parsing from file
    count += 1
print("%i reads" % count)

The above works when "line" is written to a file and that file is feed to the parser, but below does not work. Why can't line be read directly? Is there a way to feed "line" straight to the parser without having to write to a file first?

infile = "myinfile.fastq.gz"
#fileout = "myoutfile.fastq"
with io.TextIOWrapper(gzip.open(infile, "r")) as f:
    line = f.read()
#myout.write(line)

count = 0
for rec in SeqIO.parse(line, "fastq"): #line used instead of writing from file
    count += 1
print("%i reads" % count)
Pipe answered 13/7, 2016 at 17:28 Comment(0)
M
5

It's because SeqIO.parse only accepts a file handler or a filename as the first parameter.

If you want to read a gzipped file directly into SeqIO.parse just pass a handler to it:

import gzip
from Bio import SeqIO

count = 0
with gzip.open("myinfile.fastq.gz") as f:
    for rec in SeqIO.parse(f, "fastq"):
        count += 1

print("{} reads".format(count))
Matronna answered 13/7, 2016 at 19:43 Comment(1)
This worked. Just needed to add the io.TextIOWrapper making the "with" line... with io.TextIOWrapper(gzip.open(infile, "rb")) as f:Pipe
I
5

Just to add to the other answer, if your input sequence is being read from something other than a file (i.e. a web query), then you can use io.StringIO to simulate a file-like object. A StringIO object behaves like a file-handle, but reads/writes from a memory buffer. The input to StringIO() should be a string, not another file or filehandle.

from io import StringIO

infile = "myinfile.fastq.gz"
with io.TextIOWrapper(gzip.open(infile, "r")) as f:
    line = f.read()

fastq_io = StringIO(line)
records = SeqIO.parse(fastq_io, "fastq")
fastq_io.close()
#Do something to sequence records here

It is worth noting that a StringIO object needs to be closed to free up the memory space, so if you're using a lot of them then you will run into issues if you don't .close() them. With this in mind, it is probably best practice to use them within a with ... as ...: block:

with StringIO(line) as fastq_io:
    records = SeqIO.parse(fastq_io, "fastq")

#Do something to sequence records here

I've used this technique a fair bit when getting sequence data from web services, and don't want to write to a temporary file.

Inviting answered 14/7, 2016 at 12:34 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.