Split a multifasta file to files with the same number of accesion numbers

Asked 25/7, 2021 at 19:36 Answered 30/7, 2021 at 14:29

I have a file that has thousands of accession numbers:

and looks like this..

>NC_033829.1 Kallithea virus isolate DrosEU46_Kharkiv_2014, complete genome
AGTCAGCAACGTCGATGTGGCGTACAATTTCTTGATTACATTTTTGTTCCTAACAAAATGTTGATATACT

>NC_020414.2 Escherichia phage UAB_Phi78, complete genome
TAGGCGTGTGTCAGGTCTCTCGGCCTCGGCCTCGCCGGGATGTCCCCATAGGGTGCCTGTGGGCGCTAGG

If want to split this to multiple files with one accession number each then I can use the following code

awk -F '|' '/^>/ {F=sprintf("%s.fasta",$2); print > F;next;} {print >> F;}' < yourfile.fa

I have a file with thousands of accession numbers (aka >NC_*) and want to split it such as each files contains ~ 5000 accession numbers. Since I am new to awk/bash/python i struggle to find a neat solution

Any idea or comment are appreciated

Bays answered 25/7, 2021 at 19:36 Comment(0)

It wasn't clear from your question that an "accession number" is unique per input block (don't assume the people reading your question know anything about your domain - it's all just lines of text to us). It would have been clearer if you had phrased your question to just say you want 5000 new-line-separated blocks per output file rather than 5000 accession numbers.

Having seen the answer you posted, it's now clear that this is what you should be using:

awk -v RS= -v ORS='\n\n' '
    (NR%5000) == 1 { close(out); out="myseq"(++n_seq)".fa" }
    { print > out }
' my_sequences.fa

Deliberative answered 27/7, 2021 at 12:39 Comment(1)

Thank you @Ed! This is so helpful and well explained. I appreciate your time – Bays 27/7, 2021 at 12:51

Assumptions: sections are separated by empty lines.

Algorithm:

split file on sections
extract accession number from section
output section to a filename named with accession number.

Awk terms: a "record" will be our section - part of file separated by empty line (i.e. two newline characters one after another. A "field" is usually separated by spaces - by separating by space or > character second field will be accession number.

Just set record separator to two newlines and field separator to > or space and then output the line to a filenamed named with second field:

awk -v RS='' -v FS='[> ]' '{f=($2 ".txt"); print >> f; close(f)}'

@edit changed > to >> and RS='\n\n' to RS=''

@edit and also added close

Diphenyl answered 25/7, 2021 at 20:11 Comment(4)

still I think I have not understood how to maintain several accession numbers in the same file. Lets say I had four NC how can I split my files so as to have two NCs to the one and two NC to the other. Which parameter I change. I am sorry for my stupid question – Bays 25/7, 2021 at 20:54

@Bays are you saying that the answer you accepted doesn't actually do what you want? It will do what you asked for in your question so you should ask a new question if what you asked for isn't what you actually wanted. – Deliberative 26/7, 2021 at 13:13

Dear @EdMorton, yes exactly. I worked to add a solution. However thank you so muck for your time. I have learnt a lot and upvoted your answer – Bays 27/7, 2021 at 8:35

@Bays you're welcome but I hadn't posted an answer. I have now. – Deliberative 27/7, 2021 at 12:46

Best to use Biopython's Bio.SeqIO to handle the reading and writing of FASTA files. Then all you need is some way to group the records (SeqRecord objects) as desired. My preference is to have the grouping function yield iterators:

from itertools import chain, islice

from Bio import SeqIO


def grouper(n, iterable):
    it = iter(iterable)
    while True:
        chunk_it = islice(it, n)
        try:
            first = next(chunk_it)
        except StopIteration:
            return
        yield chain((first,), chunk_it)


for idx, group in enumerate(grouper(5000, SeqIO.parse('input.fa', 'fasta')), 1):
    SeqIO.write(group, f'out-{idx}.fa', 'fasta')

Diannadianne answered 30/7, 2021 at 14:29 Comment(2)

Why do you need the grouper function here? can you use itertools.islice() directly? – Milt 4/8, 2021 at 18:53

@Milt Well itertools.islice() just returns an iterator over the selected elements. What's needed is some way to iterate over all elements, hence the function. – Diannadianne 5/8, 2021 at 11:29

Thank you a lot for your answers I have learnt a lot. What I really want to do is to split the multifasta file in files with the same number of accession numbers. Here is my answer after a long battle and the help of a colleague

awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%5000==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < my_sequences.fa

Here you create new fasta files where each file has 5000 accession numbers aka headers

thank you all

Bays answered 27/7, 2021 at 8:27 Comment(1)

You don't need the BEGIN section, you shouldn't have identical print statements with a next between them as that's redundant code, it will fail with "too many open files" in some awks if your input file is large and in gawk it'll slow down as you aren;'t closing the output files as you go and redirecting input from shell isn't necessary, awk is perfectly capable of opening an input file and then it can use FILENAME. – Deliberative 27/7, 2021 at 12:56

Recommended topics

Hot tags