Reading in file block by block using specified delimiter in python
Asked Answered
D

3

6

I have an input_file.fa file like this (FASTA format):

> header1 description
data data
data
>header2 description
more data
data
data

I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1:

> header1 description
data data
data

Of course I could just read in the file like this and split:

with open("1.fa") as f:
    for block in f.read().split(">"):
        pass

But I want to avoid the reading the whole file into memory, because the files are often large.

I can read in the file line by line of course:

with open("input_file.fa") as f:
    for line in f:
        pass

But ideally what I want is something like this:

with open("input_file.fa", newline=">") as f:
    for block in f:
        pass

But I get an error:

ValueError: illegal newline value: >

I've also tried using the csv module, but with no success.

I did find this post from 3 years ago, which provides a generator based solution to this issue, but it doesn't seem that compact, is this really the only/best solution? It would be neat if it is possible to create the generator with a single line rather than a separate function, something like this pseudocode:

with open("input_file.fa") as f:
    blocks = magic_generator_split_by_>
    for block in blocks:
        pass

If this is impossible, then I guess you could consider my question a duplicate of the other post, but if that is so, I hope people can explain to me why the other solution is the only one. Many thanks.

Drum answered 29/7, 2016 at 9:25 Comment(2)
Have you tried using biopython.org/wiki/Biopython?Spaceship
@AshwiniChaudhary Thank you, good idea, that should help for this case, but ideally I'd also like a generic solution that would work beyond biological sequence data formats.Drum
S
11

A general solution here will be write a generator function for this that yields one group at a time. This was you will be storing only one group at a time in memory.

def get_groups(seq, group_by):
    data = []
    for line in seq:
        # Here the `startswith()` logic can be replaced with other
        # condition(s) depending on the requirement.
        if line.startswith(group_by):
            if data:
                yield data
                data = []
        data.append(line)

    if data:
        yield data

with open('input.txt') as f:
    for i, group in enumerate(get_groups(f, ">"), start=1):
        print ("Group #{}".format(i))
        print ("".join(group))

Output:

Group #1
> header1 description
data data
data

Group #2
>header2 description
more data
data
data

For FASTA formats in general I would recommend using Biopython package.

Spaceship answered 29/7, 2016 at 10:46 Comment(2)
Although this still isn't quite the idea I had in mind, I think it's a good practical solution to the issue (better than the one in the other post), so I'm going to mark as solved, thanks for your help.Drum
You can change data = [line] with data = [] and move data.append(line) outside the outer if, removing the elses thus avoiding the double call.Emlyn
E
3

One approach that I like is to use itertools.groupby together with a simple key fuction:

from itertools import groupby


def make_grouper():
    counter = 0
    def key(line):
        nonlocal counter
        if line.startswith('>'):
            counter += 1
        return counter
    return key

Use it as:

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        fasta_section = ''.join(group)   # or list(group)

You need the join only if you have to handle the contents of a whole section as a single string. If you are only interested in reading the lines one by one you can simply do:

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        # parse >header description
        header, description = next(group)[1:].split(maxsplit=1)
        for line in group:
            # handle the contents of the section line by line
Emlyn answered 29/7, 2016 at 16:18 Comment(2)
Thank you for your thoughtful response. I will have to look more carefully to fully understand the code. Do you think there is a particular advantage of this approach over the other ones suggested?Drum
@Drum This approach is basically equivalent to Ashwini answer, but uses itertools to avoid manually grouping the lines. It's a matter of taste.Emlyn
G
2
def read_blocks(file):
    block = ''
    for line in file:
        if line.startswith('>') and len(block)>0:
            yield block
            block = ''
        block += line
    yield block


with open('input_file.fa') as f:
    for block in read_blocks(f):
        print(block)

This will read in the file line by line and you will get back the blocks with the yield statement. This is lazy so you don't have to worry about large memory footprint.

Gudgeon answered 29/7, 2016 at 10:52 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.