Sequence length of FASTA file

Asked 2/6, 2014 at 10:44 Answered 12/12, 2018 at 11:7

I have the following FASTA file:

>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT

My desired output:

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

This is my code:

awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa

The output I get with this code is:

>header1
60
57
>header2
3
>header3
7

I need a small modification in order to deal with multiple sequence lines.

I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.

Nancynandor answered 2/6, 2014 at 10:44 Comment(0)

An awk / gawk solution can be composed by three stages:

Every time header is found these actions should be performed:
- Print previous seqlen if exists.
- Print tag.
- Initialize seqlen.
For the sequence lines we just need to accumulate totals.
Finally at the END stage we print the remnant seqlen.

Commented code:

awk '/^>/ { # header pattern detected
        if (seqlen){
         # print previous seqlen if exists 
         print seqlen
         }

         # pring the tag 
         print

         # initialize sequence
         seqlen = 0

         # skip further processing
         next
      }

# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa

A oneliner:

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa

For the totals:

awk '/^>/ { if (seqlen) {
              print seqlen
              }
            print

            seqtotal+=seqlen
            seqlen=0
            seq+=1
            next
            }
    {
    seqlen += length($0)
    }     
    END{print seqlen
        print seq" sequences, total length " seqtotal+seqlen
    }' file.fa

Bibulous answered 2/6, 2014 at 10:51 Comment(6)

Yeah, this works. But also I need a "final line" with the total of sequences and the total length, following the example: 3 sequences, total length 127. (Sorry, in the question this is behind a # ) – Nancynandor 2/6, 2014 at 10:57

Just minor formatting change. awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}' – Codee 2/6, 2014 at 10:57

A very old topic, but it is just what I need! Is it also possible to adjust the one liner awk command that it prints the number of nucleotides on the same line as the header, and not on a new line, like: >header1 60 – Adamandeve 17/5, 2017 at 10:37

Sure @Adamandeve just place ; printf $0" " ; instead of ; print ; – Bibulous 22/5, 2017 at 14:29

this will include newline characters in the count – Kirkman 8/9, 2020 at 6:45

sorry maybe its counting carriage return. just insert BEGIN{RS="\r\n"} at the beginning – Kirkman 8/9, 2020 at 7:8

I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.

It also parses out the sequence id from the header line, based on whitespace (chrM in >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable target like so: $ awk -f seqlen.awk -v target=chrM seq.fa.

BEGIN {
  OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
  if (seqlen) {
    # Only print info for this sequence if no target was given
    # or its id matches the target.
    if (! target || id == target) {
      print id, seqlen;
    }
  }
  # Get sequence id:
  # 1. Split header on whitespace (fields[1] is now ">id")
  split($0, fields);
  # 2. Get portion of first field after the starting ">"
  id = substr(fields[1], 2);
  seqlen = 0;
  next;
}
{
  seqlen = seqlen + length($0);
}
END {
  if (! target || id == target) {
    print id, seqlen;
  }
}

Woodworking answered 16/2, 2015 at 18:32 Comment(0)

A quick way with any awk, would be this:

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta

You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files

bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

Knowle answered 12/12, 2018 at 11:7 Comment(0)

Recommended topics

Hot tags