sed convert multiline bloc to single line (ex: fasta to phylip format)

Asked 24/10, 2017 at 10:40 Answered 24/10, 2017 at 12:38

In short:

how to convert from fasta to "phylip"-like format (without the sequence and residu counts at the top of the file) using sed ?

A fasta format is like this:

>sequence1
AATCG
GG-AT
>sequence2
AGTCG
GGGAT

The number of lines of a sequence may vary.

I want to convert it to this:

sequence1 AATCG GG-AT
sequence2 AGTCG GGGAT

My question seems simple, but I am lacking a real understanding of the advanced commands in sed, the multiline commands and the commands using the hold buffer.

Here is the implementation idea I had: fill the pattern space with sequence, and only print it when a new sequence label is encountered. To do this, I would:

Search lines matching ^>. If found:
- print the previous pattern space
- append line to pattern space
if ^> not found:
- append line to pattern space

I read this great manual, but I am still unsure about a few things, mostly the difference between the capitalized and little letters:

when you use P instead of p: does it print the first line of the pattern space (in file order)? I am confused by the use of "up to the next newline".
do I have to use a loop to read lines until the next sequence name, or are the multiline commands sufficient?
do I have to use the hold space in this example?

I know python, perl and awk and I think they would be more "human-friendly" tools to achieve this, but I want to learn some advanced sed.

Nothing I tried worked now, but here are some pieces:

This script uses the line numbers, not trying to do pattern matching. It shoes what I want to do, and now I need to automate it using match addresses:

#!/bin/sed -nf
1h
2,3H
4{x; s/\n/ /g; p}
5H
6{H;x; s/\n/ /g; p}

sed -nf fa2phy.sed my.fasta returns the expected output.

Tacy answered 24/10, 2017 at 10:40 Comment(2)

it's easier and more robust to do it with Awk ... good luck with sed – Ammoniate 24/10, 2017 at 10:42

also, you should probably use bio-related tools for such problems - bioperl/biopython/bioawk/etc... for ex: bioinformatics.stackexchange.com/questions/91/… and #15398172 – Mong 24/10, 2017 at 10:47

With sed

sed '/>/N;:A;/\n>/!{s/\n/ /;N;bA};h;s/\(.*\)\n.*/\1/p;x;s/.*\n//;bA' infile

Lancewood answered 24/10, 2017 at 12:38 Comment(3)

Thanks, great codegolf answer ;) However, as I begin to understand, s/$.*$\n.*/\1/p would be equivalent to P and s/.*\n// to D, so we could even write sed '/>/N;:A;/\n>/!{s/\n/ /;N;bA};h;P;x;D;bA' infile. – Tacy 26/10, 2017 at 9:30

And actually, from my tests, the h and x are not needed, as well as the starting />/N, so it seems that the command can be stripped down to: sed ':A;/\n>/!{s/\n/ /;N;bA};P;D;bA' infile – Tacy 26/10, 2017 at 10:9

Actually, I was wrong... P and D are not equivalent to the patterns I said, because of greedy matching. It works here because there are always exactly 2 lines in the pattern space. – Tacy 27/10, 2017 at 16:58

Following simple awk could help you in same.

Solution 1st:

awk '/^>/{sub(/>/,"");if(val){print val, val2};val=$0;val2="";next} {val2=val2?val2 FS $0:$0} END{print val, val2}'  Input_file

Solution 2nd:

awk -v RS=">" -v FS="\n" '{for(i=1;i<=NF;i++){printf("%s%s",$i,i==NF?"\n":" ")}}'   Input_file

Solution 3rd:

awk -v RS=">" '{gsub(/\n/," ");} NF'   Input_file

Musgrave answered 24/10, 2017 at 10:47 Comment(4)

you'll do it much easily by manipulating RS and re-aligning fields – Mong 24/10, 2017 at 10:48

@Sundeep, thanks sir, I added 2 more solutions too now(related to setting RS in code), cheers :) – Musgrave 24/10, 2017 at 10:57

no need gsub, default FS is white-space including newline and default OFS is single space... awk -v RS='>' 'NR>1{$1=$1; print}' – Mong 24/10, 2017 at 11:4

These are interesting, useful, and quite challenging lines of awk, thanks! I like the challenge of using sed though ;) – Tacy 24/10, 2017 at 12:2

Alright, I believe I managed to answer my own question.

Here is the script I made: fa2phy.sed:

#!/bin/sed -nf

:readseq
${H;b out}              # if last line, append to hold, and goto 'out'
1{h;n;b readseq}        # if first, overwrite hold, and start again at 'readseq'
/^>/!{H; n; b readseq}  # if not a sequence label, append to hold, read next line, start again at 'readseq'. Else, it continues to 'out'

:out
x;         # exchange hold content with pattern content
s/^>//;    # substitute the starting '>'
s/\n/  /g; # substitute each newline with 2 spaces
p;         # print pattern buffer

Although it works, if someone has a shorter or clearer solution, enlighten me! :)

Tacy answered 24/10, 2017 at 11:56 Comment(0)

Recommended topics

Hot tags