Word count for sequence length is wrong

E

5

7

I have a fasta file that looks like this :

>0011 my.header
CAAGTTTATCCACATAATGCGAATAACCAATAATCCTTTTCATAAGTCTATTCTTCATAATCTAAATCGT
TTTCAAGTACATAATTATCCTTTGCCTGTTCGTTAGTTTTATTAAAATTATACTGATCTTTCTTTTTCAT
CCCACGGGTTAAAATCTTCCTCAATCGGTGGGTTTTCTTCATGAAATTGTTTCATTTATTTGCTGTTTTT
AGTTCTCCGATTGTATAACACTTAGTTGTATTAGTGCCGGGTAGTCTATAATTAGCCTCTTTTATATACC
CACGCTTTAATAATCTGTTTACAGAATTATATAATTTGCTCTTAGACATAAAAGGAATAATTTCTCTAAG
TTTAGAAATCGTAATAAAAACGGTATTAGGTTCTTTCTTTACCCTACATCCCTTAAACTTATCCTTATAT
GTATCAGTACAAAGTATAAGAAACATAACTGAATATACTACTGAATCATCTAAACCGATTTCTTTTGCTA
AATCTTCATTTATAACCATAATTATAACGCTTTTAATTGAATTGACTCTTTAACATTTGATGTTTTAACG
AACTGATCGTATATTTCCGGATATTGTTCTTTCAGTGCTTTAGAATCAAGTGATTCACGGCTATACGCTT
TCTTCCTTGTGACTGAAATAAGTTCCCCTTTTATATTATCAGCTTTCGCCTCAGACATCAGACCTAACAA
CTGTTCTTTGAACTTGCCTAAATGTTCGTCTATCTTCTTTTGCATTTCAAGAAGTTCGTAAACGCCTTCT
TCGATATGTGCAACCTTTGCAGGCAACGACTCCAATTTAGCTACATAACTGTCTTTGCTTGCATTGTCTG
CATATCGAACTCCATTCTTACAGCAATTAAGGAATAATTCTATTTCGCTGTCCGGTATGCGTTCAACAGA
GAAAATTCCGTCCTTATCCTTGTCACCTCTTAGCCAAATTGCGATAAGTCCCTCTACTTTCAAATTTGGG
TTTTGTCTCTCGAAAAGATAGGCGTATATTGATAGCTGCCAAGACAAATAAAGCAAATCAAGTTTGTAGG
TAGTTTTAATGTCACCTAAAACGACTGATTTATCAGAGCTGCCCAAATATACTTTATCGGTCGGTGATGC
GATAAGCTCGTTATCAGTTAGAATATACTCAGATGCGATATGAATTAAACCGCTTCCGGCTTTTAAATTC
AAATAGTTCTCTCCGTAGACCGTTTCCGGTTCAATACCTTCTTTGTCGATCCTCTCAACTTCATCATGAA
CCGCTTTCCCTCTCTCAGTTGCCGATCTCAAAATATTATCCGGTATATTGTCAAGTTTGCCTGGAAATAA

and I want the length of the sequence (without the header). I tried this:

tail -n +2 my.file | wc -c

which gives me this output:

which is wrong, the real size is 1330.

I'm not sure what's going on. I'm thinking there's probably some sort of hidden characters but I don't know how to explore this.

Exposition answered 3/4, 2018 at 14:34 Comment(2)

The difference between 1330 and 1349 is 19. Your output is 19 lines long. Coincidence? – Pleo 3/4, 2018 at 16:10

Not to be pedantic, but Bash isn't doing any of the counting here. Shouldn't this be titled "wc word count for sequence is wrong"? – Paleolith 3/4, 2018 at 20:32

S

19

It is because wc is counting all the line breaks as well.

You may use awk to get this done:

awk 'NR>1{s+=length()} END{print s}' my.file

You may also use tail | tr | wc:

tail -n +2 my.file | tr -d '\n' | wc -c
1330

Stalk answered 3/4, 2018 at 14:36 Comment(0)

R

3

EDIT: Adding 1 more solution of awk here too.

awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");gsub(/ /,"");print length($0)}'  Input_file

OR

awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");print length($0)}' OFS=""  Input_file

OR

awk -v RS= '{gsub(/^[^\n]*|\n/, ""); print length()}'  Input_file

Following awk may help you on same.

awk '!/^>/{sum+=length($0)} END{print "Length is:" sum}'  Input_file

Ruy answered 3/4, 2018 at 14:36 Comment(2)

You can shorten it to awk -v RS= '{gsub(/^[^\n]*|\n/, ""); print length()}' file – Stalk 3/4, 2018 at 14:56

@anubhava, sure thanks sir, added now into solution, cheers :) – Ruy 3/4, 2018 at 14:59

L

2

perl:

perl -0777 -nE 's/^>.*$//m; say tr/A-Z/A-Z/' file

That reads the file into a single string, removes the first line, and counts the letters.

Leaseback answered 3/4, 2018 at 14:52 Comment(1)

This is really excessively "clever". I voted for anubhava's answer. – Leaseback 3/4, 2018 at 14:58

V

2

bash only, in a script, we have to talk about programming ;o)

tk="$(<my.file)"      # file in variable
tk="${tk#>*$'\n'}"    # suppression header '>...first\n'
tk="${tk//$'\n'}"     # suppression all \n

echo ": ${#tk}"       # 1330  \o/

Viator answered 3/4, 2018 at 15:30 Comment(4)

Double quotes are useless! – Wimbush 3/4, 2018 at 15:35

double quotes are useful because, apart from the given example, how to predict the contents of the input file? – Viator 3/4, 2018 at 17:10

No: when you are assigning something, curly brace will be enough! – Wimbush 3/4, 2018 at 20:50

yes, with a simple assignment (with or without curly braces), there is no IFS splitting (in posix, I imagine the same in bash) (but with local, export and some shells, it can be tricky, I believe). So I prefer always show with quotes to learn and to obtain a syntax highlighting more homogeneous. It does not damage performance? – Viator 4/4, 2018 at 5:22

D

1

Subtract the line count from the chars after removing the header:

tail -n +2  fasta.file | wc -lc | awk '{print $2-$1}'

Dissolvent answered 3/4, 2018 at 23:52 Comment(0)

Recommended topics

Hot tags