In short:
how to convert from fasta to "phylip"-like format (without the sequence and residu counts at the top of the file) using sed
?
A fasta format is like this:
>sequence1
AATCG
GG-AT
>sequence2
AGTCG
GGGAT
The number of lines of a sequence may vary.
I want to convert it to this:
sequence1 AATCG GG-AT
sequence2 AGTCG GGGAT
My question seems simple, but I am lacking a real understanding of the advanced commands in sed
, the multiline commands and the commands using the hold buffer.
Here is the implementation idea I had: fill the pattern space with sequence, and only print it when a new sequence label is encountered. To do this, I would:
- Search lines matching
^>
. If found:- print the previous pattern space
- append line to pattern space
- if
^>
not found:- append line to pattern space
I read this great manual, but I am still unsure about a few things, mostly the difference between the capitalized and little letters:
- when you use P instead of p: does it print the first line of the pattern space (in file order)? I am confused by the use of "up to the next newline".
- do I have to use a loop to read lines until the next sequence name, or are the multiline commands sufficient?
- do I have to use the hold space in this example?
I know python, perl and awk and I think they would be more "human-friendly" tools to achieve this, but I want to learn some advanced sed.
Nothing I tried worked now, but here are some pieces:
This script uses the line numbers, not trying to do pattern matching. It shoes what I want to do, and now I need to automate it using match addresses:
#!/bin/sed -nf
1h
2,3H
4{x; s/\n/ /g; p}
5H
6{H;x; s/\n/ /g; p}
sed -nf fa2phy.sed my.fasta
returns the expected output.