sed convert multiline bloc to single line (ex: fasta to phylip format)
Asked Answered
T

3

1

In short:

how to convert from fasta to "phylip"-like format (without the sequence and residu counts at the top of the file) using sed ?

A fasta format is like this:

>sequence1
AATCG
GG-AT
>sequence2
AGTCG
GGGAT

The number of lines of a sequence may vary.

I want to convert it to this:

sequence1 AATCG GG-AT
sequence2 AGTCG GGGAT

My question seems simple, but I am lacking a real understanding of the advanced commands in sed, the multiline commands and the commands using the hold buffer.

Here is the implementation idea I had: fill the pattern space with sequence, and only print it when a new sequence label is encountered. To do this, I would:

  1. Search lines matching ^>. If found:
    • print the previous pattern space
    • append line to pattern space
  2. if ^> not found:
    • append line to pattern space

I read this great manual, but I am still unsure about a few things, mostly the difference between the capitalized and little letters:

  • when you use P instead of p: does it print the first line of the pattern space (in file order)? I am confused by the use of "up to the next newline".
  • do I have to use a loop to read lines until the next sequence name, or are the multiline commands sufficient?
  • do I have to use the hold space in this example?

I know python, perl and awk and I think they would be more "human-friendly" tools to achieve this, but I want to learn some advanced sed.


Nothing I tried worked now, but here are some pieces:

This script uses the line numbers, not trying to do pattern matching. It shoes what I want to do, and now I need to automate it using match addresses:

#!/bin/sed -nf
1h
2,3H
4{x; s/\n/ /g; p}
5H
6{H;x; s/\n/ /g; p}

sed -nf fa2phy.sed my.fasta returns the expected output.

Tacy answered 24/10, 2017 at 10:40 Comment(2)
it's easier and more robust to do it with Awk ... good luck with sedAmmoniate
also, you should probably use bio-related tools for such problems - bioperl/biopython/bioawk/etc... for ex: bioinformatics.stackexchange.com/questions/91/… and #15398172Mong
L
1

With sed

sed '/>/N;:A;/\n>/!{s/\n/ /;N;bA};h;s/\(.*\)\n.*/\1/p;x;s/.*\n//;bA' infile
Lancewood answered 24/10, 2017 at 12:38 Comment(3)
Thanks, great codegolf answer ;) However, as I begin to understand, s/\(.*\)\n.*/\1/p would be equivalent to P and s/.*\n// to D, so we could even write sed '/>/N;:A;/\n>/!{s/\n/ /;N;bA};h;P;x;D;bA' infile.Tacy
And actually, from my tests, the h and x are not needed, as well as the starting />/N, so it seems that the command can be stripped down to: sed ':A;/\n>/!{s/\n/ /;N;bA};P;D;bA' infileTacy
Actually, I was wrong... P and D are not equivalent to the patterns I said, because of greedy matching. It works here because there are always exactly 2 lines in the pattern space.Tacy
M
0

Following simple awk could help you in same.

Solution 1st:

awk '/^>/{sub(/>/,"");if(val){print val, val2};val=$0;val2="";next} {val2=val2?val2 FS $0:$0} END{print val, val2}'  Input_file

Solution 2nd:

awk -v RS=">" -v FS="\n" '{for(i=1;i<=NF;i++){printf("%s%s",$i,i==NF?"\n":" ")}}'   Input_file

Solution 3rd:

awk -v RS=">" '{gsub(/\n/," ");} NF'   Input_file
Musgrave answered 24/10, 2017 at 10:47 Comment(4)
you'll do it much easily by manipulating RS and re-aligning fieldsMong
@Sundeep, thanks sir, I added 2 more solutions too now(related to setting RS in code), cheers :)Musgrave
no need gsub, default FS is white-space including newline and default OFS is single space... awk -v RS='>' 'NR>1{$1=$1; print}'Mong
These are interesting, useful, and quite challenging lines of awk, thanks! I like the challenge of using sed though ;)Tacy
T
0

Alright, I believe I managed to answer my own question.

Here is the script I made: fa2phy.sed:

#!/bin/sed -nf

:readseq
${H;b out}              # if last line, append to hold, and goto 'out'
1{h;n;b readseq}        # if first, overwrite hold, and start again at 'readseq'
/^>/!{H; n; b readseq}  # if not a sequence label, append to hold, read next line, start again at 'readseq'. Else, it continues to 'out'

:out
x;         # exchange hold content with pattern content
s/^>//;    # substitute the starting '>'
s/\n/  /g; # substitute each newline with 2 spaces
p;         # print pattern buffer

Although it works, if someone has a shorter or clearer solution, enlighten me! :)

Tacy answered 24/10, 2017 at 11:56 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.