Bash while read loop extremely slow compared to cat, why?
Asked Answered
T

4

14

A simple test script here:

while read LINE; do
        LINECOUNT=$(($LINECOUNT+1))
        if [[ $(($LINECOUNT % 1000)) -eq 0 ]]; then echo $LINECOUNT; fi
done

When I do cat my450klinefile.txt | myscript the CPU locks up at 100% and it can process about 1000 lines a second. About 5 minutes to process what cat my450klinefile.txt >/dev/null does in half a second.

Is there a more efficient way to do essentially this. I just need to read a line from stdin, count the bytes, and write it out to a named pipe. But the speed of even this example is impossibly slow.

Every 1Gb of input lines I need to do a few more complex scripting actions (close and open some pipes that the data is being feed to).

Terpsichore answered 7/12, 2012 at 11:56 Comment(5)
In addition to the differences between a bash script and a compiled tool (see answer of paxdiablo) your comparison is not fair: cat just reads while your script does some computation (line counting)Countryandwestern
replace LINECOUNT=$(($LINECOUNT+1)) with ((LINECOUNT++))Bagnio
also to real compare you need to remove condition from your script, now your question looks like: why my truck uses so much fuel when I trying to transport 20tonns of wood, when i run it without trailer it uses ten times less!Bagnio
your example counts not bytes but linesBagnio
Why did you not use wc (wordcount)? wc -l count lines.Clausewitz
W
26

The reason while read is so slow is that the shell is required to make a system call for every byte. It cannot read a large buffer from the pipe, because the shell must not read more than one line from the input stream and therefore must compare each character against a newline. If you run strace on a while read loop, you can see this behavior. This behavior is desirable, because it makes it possible to reliably do things like:

while read size; do test "$size" -gt 0 || break; dd bs="$size" count=1 of=file$(( i++ )); done

in which the commands inside the loop are reading from the same stream that the shell reads from. If the shell consumed a big chunk of data by reading large buffers, the inner commands would not have access to that data. An unfortunate side-effect is that read is absurdly slow.

Wantage answered 7/12, 2012 at 13:38 Comment(0)
A
6

It's because the bash script is interpreted and not really optimised for speed in this case. You're usually better off using one of the external tools such as:

awk 'NR%1000==0{print}' inputFile

which matches your "print every 1000 lines" sample.

If you wanted to (for each line) output the line count in characters followed by the line itself, and pipe it through another process, you could also do that:

awk '{print length($0)" "$0}' inputFile | someOtherProcess

Tools like awk, sed, grep, cut and the more powerful perl are far more suited to these tasks than an interpreted shell script.

Ashcan answered 7/12, 2012 at 12:1 Comment(3)
After ever 1Gb of input lines I need to do some more complex actions, close a couple of pipes and reopen them. Is awk able to allow me these more complex scripting actions?Terpsichore
awk, possibly not, but there are plenty of other tools, that's why you should ask your actual question rather than some example question :-)Ashcan
try to use perl for that taskBagnio
S
2

The perl solution for count bytes of each string:

perl -p -e '
use Encode;
print length(Encode::encode_utf8($_))."\n";$_=""' 

for example:

dd if=/dev/urandom bs=1M count=100 |
   perl -p -e 'use Encode;print length(Encode::encode_utf8($_))."\n";$_=""' |
   tail

works for me as 7.7Mb/s

to compare how much script used:

dd if=/dev/urandom bs=1M count=100 >/dev/null

run as 9.1Mb/s

seems script not so slow :)

Stationmaster answered 7/12, 2012 at 12:34 Comment(0)
R
0

Not really sure what your script is supposed to do. So this might not be an answer to your question but more of a generic tip.

Don't cat your file and pipe it to your script, instead when reading from a file with a bash script do it like this:

while read line    
do    
    echo $line
done <file.txt
Retaliate answered 7/12, 2012 at 12:3 Comment(2)
I'm taking input from curl on stdin via a pipeTerpsichore
Not using read -r is a problem and not quoting the variable in echo "$line" is doubly so. Don't use this. It's an extremely poor reimplementation of cat.Photothermic

© 2022 - 2024 — McMap. All rights reserved.