extract sequences from multifasta file by ID in file using awk

Asked 9/4, 2018 at 11:1 Answered 13/6, 2024 at 16:18

Solved search awk bioinformatics multiline fasta

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.

FASTA file seq.fasta:

>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT

IDs file id.txt:

7P58X:01332:11636
7P58X:01334:11613

I want to get the fasta file with only those sequences matching the IDs in the id.txt file:

>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC

I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:

(1)

awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta

this code works well for the multiline sequences but IDs have to be inserted separately to the code.

(2)

awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta

this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.

I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?

Kalin answered 9/4, 2018 at 11:1 Comment(2)

awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data https://mcmap.net/q/1011130/-why-is-grep-so-slow-and-memory-intensive-with-w-word-regexp-flag – Momently 9/4, 2018 at 11:33

I would use bioawk, but my approach "inserts the variable separately", which is probably not optimal:

for seq_id in $(cat id.txt); do bioawk -c fastx -v seq_id="${seq_id}" '$name == seq_id {print ">"$name"\n"$seq}' seq.fasta; done

– Tangerine 11/4, 2018 at 15:31

$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC

Sanborn answered 9/4, 2018 at 16:47 Comment(2)

How should this be modified to accept a piped seq.fasta file from a previous step? – Godunov 8/4, 2020 at 15:16

whatever | awk '...' id.txt - – Sanborn 8/4, 2020 at 15:18

Following awk may help you on same.

awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt  fasta_file

Spiller answered 9/4, 2018 at 11:6 Comment(0)

I'm facing a similar problem. The size of my multi-fasta file is ~ 25G. I use sed instead of awk, though my solution is an ugly hack.

First, I extracted the line number of the title of each sequence to a data file.

grep -n ">" multi-fasta.fa > multi-fasta.idx

What I got is something like this:

1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009

Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.

seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`

idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`

idx1=`echo $idx0_idx1 | cut -d ":" -f 2`

idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`

sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta

For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:

7507:42520:>DM_0000016115

7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.

42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.

idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).

At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.

The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.

What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

Usurp answered 2/6, 2022 at 7:14 Comment(0)

Just a note here about the solution for those of you still having problems:

Be sure to first use dos2unix id.txt to get rid of the hidden ^M Windows inserts at the end of each line.

Also, this solution requires that the names on id.txt don't include > at the start of each line.

Handkerchief answered 13/6, 2024 at 16:18 Comment(0)

Recommended topics

Hot tags