Use AWK to search through fasta file, given a second file containing sequence names

Asked 20/7, 2016 at 21:45 Answered 20/7, 2016 at 22:27

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).

seq.fasta

>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC

name.txt

Clone_23
Clone_27-1

I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.

awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta

The problem is that I can only extract the sequence of first candidate in name.txt, like this

>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA

Can anyone help to fix one-line awk command above?

Rachellrachelle answered 20/7, 2016 at 21:45 Comment(5)

You radically changed the question after answers have been posted. I've rolled that back. Please post a new question in that case. (And show what you've tried to adapt from the answers you got here) – Wrecker 20/7, 2016 at 22:58

I can only post 1 question in 90 mins. Can I post new example in Answer session? – Rachellrachelle 20/7, 2016 at 23:6

Well, it actually recommends using comments or re-edit question (since I need to use format to show the example) – Rachellrachelle 20/7, 2016 at 23:7

No that would be deleted by moderators. What about using the 90mins to think about a solution on your own? I guess they are meant for that. Actually it's just 30mins, since you asked this an hour ago. – Wrecker 20/7, 2016 at 23:8

I'd recommend using the time to come up with a truly representative example so we don't waste more time trying to help you solve a problem you don't have. – Elysium 20/7, 2016 at 23:20

If it is ok or even desired to print the name as well, you can simply use grep:

grep -Ff name.txt -A1 a.fasta

-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line

If the names are not desired in output I would simply pipe to another grep:

above_command | grep -v '>'

An awk solution can look like this:

awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta

Better explained in a multiline version:

# True as long as we are reading the first file, name.txt
NR==FNR {
    # Store the names in the array 'n'
    n[$0]
    next
}

# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline

Wrecker answered 20/7, 2016 at 21:49 Comment(12)

Hmm, I wonder what that will do if/when the getline fails... ;-). If you MUST use getline (hint - you don't here) then at least protect yourself from failures with if ( (getline line) > 0 ) print line. See awk.freeshell.org/AllAboutGetline. – Elysium 20/7, 2016 at 22:22

Hey. I was thinking about that but I thought it is not an issue since if getline fails in this case, the fasta format is broken and it indeed does not contain the sequence. Isn't it? – Wrecker 20/7, 2016 at 22:31

Ok, I see. In that case it would print the name which is not desired.. Let's assume the fasta file is correct! ;) – Wrecker 20/7, 2016 at 22:34

getline can fail for more reasons than just reaching the end of the file, e.g. a simple one would be someone removes the input file in mid-processing. In a scenario like that you'd think the script would just stop processing and produce no more output but that's not what your script will do if you're using getline, it'll duplicate your previously read input. There's other weird stuff that makes it usually not the best choice. – Elysium 20/7, 2016 at 22:35

Wouldn't the kernel keep the file in memory as long as the file descriptor is open? When somebody overwrites the file it might be more problematic. But I'm not 100% certain about that. Need to refresh my knowledge there. But anyhow, your point is valid. Let me fix that. – Wrecker 20/7, 2016 at 22:36

Idk, maybe that's a bad example, I was just trying to think of something simple that could cause a failure. I know it CAN fail, I'm just not sure of the scenarios. You also now might have to deal with a case where the FASTA file has 2 consecutive >name lines (is that a "broken" file? idk..) - as written if the first name matches a target then your script will skip over the second name even if it matches one of the targets. You also have to consider what happens if you have a 2nd file to process and need an FNR==1 test - the getline can cause awk to jump right over it. Too much to consider... – Elysium 20/7, 2016 at 22:37

@EdMorton Updated it. Thanks for your help! I like to put the getline > 0 right into the condition. awk rocks! :) – Wrecker 20/7, 2016 at 22:43

Sounds good. My main aversion to getline is that when I see/write a script using it then I have to mentally go through all of the checklists of side-effects to see if any of them can happen with negative consequences in the current application. Usually I end up deciding "it'll probably be OK" and about 50% of the time I'm right about that :-) but mainly I'd rather just not have to think about it so find it easier not to use it except in a couple of applications where it's clearly the right solution (e.g. a recursive descent parser). – Elysium 20/7, 2016 at 22:46

Good point. If the programmer needs to think about too much things he might forget some. A solution with as few as possible side-effects is definitely often the better way. Especially if somebody who didn't wrote the initial code has to modify it later. Didn't had a strong opinion about getline so far. I'm using it rarely. But I see your points. – Wrecker 20/7, 2016 at 22:51

Looks like the OP just changed their requirements such that the getline solution would now be completely inappropriate - that's the other reason to avoid getline, the slightest requirements change and you're back at square 1. – Elysium 20/7, 2016 at 23:16

I see. Putting it in a loop, which would be my first intention here, would necessarily soak up the next >NAME line as well, meaning it doesn't really work in this form. – Wrecker 21/7, 2016 at 7:43

Exactly. You'd need to duplicate the test for > that's already in the main body after the call to getline. It's evil, that's all I'm sayin' ;-). – Elysium 21/7, 2016 at 12:1

$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC

Elysium answered 20/7, 2016 at 22:27 Comment(2)

If one sequence is broken into multiple lines, the command above can only read first line. That was why I changed default RS. See changed example above – Rachellrachelle 20/7, 2016 at 22:53

I don't see any changes in your example. It'd be trivial to tweak this solution to work for multi-line records but I'm not inclined to do so as I can't imagine why you wouldn't have included that info in your question and in your example in the first place though - I'd hope it's obvious that to parse a file we need to know the format of the contents of the file! Do you have any other surprises in store for use like multiple consecutive > lines or.... Post a new question with accurate input/output and we'll take it from there. – Elysium 20/7, 2016 at 23:15

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags