Remove line breaks in a FASTA file

Asked 6/4, 2013 at 23:14 Answered 9/10, 2020 at 23:47

Solved unix awk newline bioinformatics fasta

I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:

>accession1
ATGGCCCATG
GGATCCTAGC
>accession2
GATATCCATG
AAACGGCTTA

I'd like to convert it into this:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

I found a potential solution on this site, which looks like this:

cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta

However, this places an extra line break between each entry, so file looks like this:

>accession1
ATGGCCCATGGGATCCTAGC

>accession2
GATATCCATGAAACGGCTTA

I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";} was the culprit...potentially print "\n" is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:

awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta

However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:

{empty line} 
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Anyone have a solution to get my file in the correct format? Thanks!

Arianna answered 6/4, 2013 at 23:14 Comment(0)

This awk program:

% awk '!/^>/ { printf "%s", $0; n = "\n" } 
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta

Will yield:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.

On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.

End with a newline, if required.

Note:

By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.

--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide

Homer answered 6/4, 2013 at 23:22 Comment(6)

if the input ends with line >foobar, the script will print an extra empty line. (it may not be the case, however). and printf $0 is not safe if the line contains printf format string. – Cybernetics 6/4, 2013 at 23:47

awesome, totally works. thanks!! just curious -- for the very first line of the fasta file, i would have expected it to throw an error when it reads /^>/ { print n $0 }, because n doesn't exist yet. however, it doesn't seem to care that n doesn't exist. why is this? – Arianna 6/4, 2013 at 23:53

@chimeric: I've added a note to address the "uninitialised" variable, I hope that helps. – Homer 7/4, 2013 at 3:16

@Homer that works just fine, but what if I have a not only fasta file ? I mean I do have a file with like 8 lines, then I have my fasta sequence, and then another 8 lines and so on, do you have any suggestion on how to linearize only the fasta sequences ? For the 8 lines, they all start with the same expression "my results". Thanks in advance – Namara 27/11, 2023 at 14:1

@Najoua: I suspect you'll need to tighten up those regular expressions to match the FASTA lines more precisely. But any issues with that are likely to be addressed with a new question rather than a comment on a ten-year-old answer. – Homer 28/11, 2023 at 20:40

@Homer Thank you. Here is the link to my question – Namara 29/11, 2023 at 10:11

The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:

 awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file

Explanation:

For lines beginning with >, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >, print the line without a trailing newline character. Since the last line in the file won't begin with >, use the END block to print a final newline character.

Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >. Try:

awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file

Sporogenesis answered 14/1, 2015 at 10:36 Comment(1)

There's nothing more satisfying than idiomatic AWK – Carbone 10/4, 2019 at 23:4

Do not reinvent the wheel. If the goal is simply removing newlines in multi-line fasta file (unwrapping fasta file), use any of the specialized bioinformatics tools, for example seqtk, like so:

seqtk seq -l 0 input_file

Example:

# Create the input for testing:

cat > test_unwrap_in.fa <<EOF

>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGT

ACGT

ACGT

>seq3 without blanks or newlines
ACGTACGTACGT

EOF

# Unwrap lines:

seqtk seq -l 0 test_unwrap_in.fa > test_unwrap_out.fa

cat test_unwrap_out.fa

Output:

>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGTACGTACGT
>seq3 without blanks or newlines
ACGTACGTACGT

SEE ALSO:

seqtk usage:

seqtk seq

Usage:   seqtk seq [options] <in.fq>|<in.fa>

Options: ...
         -l INT    number of residues per line; 0 for 2^32-1 [0]

To install this tool, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

REFERENCES:

seqtk: https://github.com/lh3/seqtk
conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
conda create: https://docs.conda.io/projects/conda/en/latest/commands/create.html

Rogozen answered 8/10, 2020 at 16:12 Comment(0)

I would use sed for this. Using GNU sed:

sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file

Results:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

Create a label, a. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >, perform the substitution s/\n$[^>]$/\1/. If the substitution was successful since the last input line was read, then branch to label a. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.

Sporogenesis answered 7/4, 2013 at 1:34 Comment(2)

@M.P.: I have added a quick explanation. HTH. – Sporogenesis 13/1, 2015 at 20:39

@M.P.: I have also added an AWK solution because this sed answer hasn't received much love. HTH. – Sporogenesis 14/1, 2015 at 10:40

There is another awk one-liner, should work for your case.

awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file

Cybernetics answered 6/4, 2013 at 23:45 Comment(0)

You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

Mesopause answered 19/10, 2018 at 5:1 Comment(0)

There have been great responses so far.

Here is an efficient way to do this in Python:

def read_fasta(fasta):
    with open(fasta, 'r') as fast:
        headers, sequences = [], []
        for line in fast:
            if line.startswith('>'):
                head = line.replace('>','').strip()
                headers.append(head)
                sequences.append('')
            else :
                seq = line.strip()
                if len(seq) > 0:
                    sequences[-1] += seq
    return (headers, sequences)


def write_fasta(headers, sequences, fasta):
    with open(fasta, 'w') as fast:
        for i in range(len(headers)):
            fast.write('>' + headers[i] + '\n' + sequences[i] + '\n')

You can use the above functions to retrieve sequences/headers from a fasta file without line breaks, manipulate them, and write back to a fasta file.

headers, sequences = read_fasta('input.fasta')
new_headers = do_something(headers)
new_sequences = do_something(sequences)
write_fasta(new_headers, new_sequences, 'input.fasta')

Julianejuliann answered 9/10, 2020 at 23:47 Comment(0)

Another variation :-)

awk '!/>/{printf( "%s", $0);next}
     NR>1{printf( "\n")} 
     END {printf"\n"}
     7' YourFile

Knavery answered 21/12, 2016 at 12:38 Comment(0)

Use this Perl one-liner, which does all of the common reformatting that is necessary in this and similar cases: removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. Note that unlike some of the other answers, this properly handles leading and trailing whitespace/newlines in the file:

# Create the input for testing:

cat > test_unwrap_in.fa <<EOF

>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGT

ACGT

ACGT

>seq3 without blanks or newlines
ACGTACGTACGT

EOF

# Reformat with Perl:

perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' test_unwrap_in.fa > test_unwrap_out.fa

Output:

>seq1 with blanks
ACGTACGTACGT
>seq2 with newlines
ACGTACGTACGT
>seq3 without blanks or newlines
ACGTACGTACGT

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.

chomp : Remove the input line separator (\n on *NIX).
if ( /^>/ ) : Test if the current line is a sequence header line.
$n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; } : Print the final newline after the last sequence.
s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.

Rogozen answered 19/9, 2020 at 19:54 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Explanation:

Note:

Recommended topics

Hot tags