Word count for sequence length is wrong
Asked Answered
E

5

7

I have a fasta file that looks like this :

>0011 my.header
CAAGTTTATCCACATAATGCGAATAACCAATAATCCTTTTCATAAGTCTATTCTTCATAATCTAAATCGT
TTTCAAGTACATAATTATCCTTTGCCTGTTCGTTAGTTTTATTAAAATTATACTGATCTTTCTTTTTCAT
CCCACGGGTTAAAATCTTCCTCAATCGGTGGGTTTTCTTCATGAAATTGTTTCATTTATTTGCTGTTTTT
AGTTCTCCGATTGTATAACACTTAGTTGTATTAGTGCCGGGTAGTCTATAATTAGCCTCTTTTATATACC
CACGCTTTAATAATCTGTTTACAGAATTATATAATTTGCTCTTAGACATAAAAGGAATAATTTCTCTAAG
TTTAGAAATCGTAATAAAAACGGTATTAGGTTCTTTCTTTACCCTACATCCCTTAAACTTATCCTTATAT
GTATCAGTACAAAGTATAAGAAACATAACTGAATATACTACTGAATCATCTAAACCGATTTCTTTTGCTA
AATCTTCATTTATAACCATAATTATAACGCTTTTAATTGAATTGACTCTTTAACATTTGATGTTTTAACG
AACTGATCGTATATTTCCGGATATTGTTCTTTCAGTGCTTTAGAATCAAGTGATTCACGGCTATACGCTT
TCTTCCTTGTGACTGAAATAAGTTCCCCTTTTATATTATCAGCTTTCGCCTCAGACATCAGACCTAACAA
CTGTTCTTTGAACTTGCCTAAATGTTCGTCTATCTTCTTTTGCATTTCAAGAAGTTCGTAAACGCCTTCT
TCGATATGTGCAACCTTTGCAGGCAACGACTCCAATTTAGCTACATAACTGTCTTTGCTTGCATTGTCTG
CATATCGAACTCCATTCTTACAGCAATTAAGGAATAATTCTATTTCGCTGTCCGGTATGCGTTCAACAGA
GAAAATTCCGTCCTTATCCTTGTCACCTCTTAGCCAAATTGCGATAAGTCCCTCTACTTTCAAATTTGGG
TTTTGTCTCTCGAAAAGATAGGCGTATATTGATAGCTGCCAAGACAAATAAAGCAAATCAAGTTTGTAGG
TAGTTTTAATGTCACCTAAAACGACTGATTTATCAGAGCTGCCCAAATATACTTTATCGGTCGGTGATGC
GATAAGCTCGTTATCAGTTAGAATATACTCAGATGCGATATGAATTAAACCGCTTCCGGCTTTTAAATTC
AAATAGTTCTCTCCGTAGACCGTTTCCGGTTCAATACCTTCTTTGTCGATCCTCTCAACTTCATCATGAA
CCGCTTTCCCTCTCTCAGTTGCCGATCTCAAAATATTATCCGGTATATTGTCAAGTTTGCCTGGAAATAA

and I want the length of the sequence (without the header). I tried this:

tail -n +2 my.file | wc -c

which gives me this output:

1349

which is wrong, the real size is 1330.

I'm not sure what's going on. I'm thinking there's probably some sort of hidden characters but I don't know how to explore this.

Exposition answered 3/4, 2018 at 14:34 Comment(2)
The difference between 1330 and 1349 is 19. Your output is 19 lines long. Coincidence?Pleo
Not to be pedantic, but Bash isn't doing any of the counting here. Shouldn't this be titled "wc word count for sequence is wrong"?Paleolith
S
19

It is because wc is counting all the line breaks as well.

You may use awk to get this done:

awk 'NR>1{s+=length()} END{print s}' my.file

1330

You may also use tail | tr | wc:

tail -n +2 my.file | tr -d '\n' | wc -c
1330
Stalk answered 3/4, 2018 at 14:36 Comment(0)
R
3

EDIT: Adding 1 more solution of awk here too.

awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");gsub(/ /,"");print length($0)}'  Input_file

OR

awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");print length($0)}' OFS=""  Input_file

OR

awk -v RS= '{gsub(/^[^\n]*|\n/, ""); print length()}'  Input_file

Following awk may help you on same.

awk '!/^>/{sum+=length($0)} END{print "Length is:" sum}'  Input_file
Ruy answered 3/4, 2018 at 14:36 Comment(2)
You can shorten it to awk -v RS= '{gsub(/^[^\n]*|\n/, ""); print length()}' fileStalk
@anubhava, sure thanks sir, added now into solution, cheers :)Ruy
L
2

perl:

perl -0777 -nE 's/^>.*$//m; say tr/A-Z/A-Z/' file

That reads the file into a single string, removes the first line, and counts the letters.

Leaseback answered 3/4, 2018 at 14:52 Comment(1)
This is really excessively "clever". I voted for anubhava's answer.Leaseback
V
2

bash only, in a script, we have to talk about programming ;o)

tk="$(<my.file)"      # file in variable
tk="${tk#>*$'\n'}"    # suppression header '>...first\n'
tk="${tk//$'\n'}"     # suppression all \n

echo ": ${#tk}"       # 1330  \o/
Viator answered 3/4, 2018 at 15:30 Comment(4)
Double quotes are useless!Wimbush
double quotes are useful because, apart from the given example, how to predict the contents of the input file?Viator
No: when you are assigning something, curly brace will be enough!Wimbush
yes, with a simple assignment (with or without curly braces), there is no IFS splitting (in posix, I imagine the same in bash) (but with local, export and some shells, it can be tricky, I believe). So I prefer always show with quotes to learn and to obtain a syntax highlighting more homogeneous. It does not damage performance?Viator
D
1

Subtract the line count from the chars after removing the header:

tail -n +2  fasta.file | wc -lc | awk '{print $2-$1}'
Dissolvent answered 3/4, 2018 at 23:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.