I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:
>accession1
ATGGCCCATG
GGATCCTAGC
>accession2
GATATCCATG
AAACGGCTTA
I'd like to convert it into this:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
I found a potential solution on this site, which looks like this:
cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta
However, this places an extra line break between each entry, so file looks like this:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";}
was the culprit...potentially print "\n"
is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:
awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta
However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:
{empty line}
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Anyone have a solution to get my file in the correct format? Thanks!
>foobar
, the script will print an extra empty line. (it may not be the case, however). andprintf $0
is not safe if the line contains printf format string. – Cybernetics