Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?
Asked Answered
M

1

1

I posted the following code and got scolded. Why is this not acceptable?

numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
  lineN=$(sed -n "$i!d;p;q" "$1")
  # ... do things with "$lineN"
done

We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.

The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?

Is there a better way to do this? Why would I want to avoid this particular approach?

Melanosis answered 2/1, 2021 at 12:11 Comment(1)
What is an anti-pattern?Tobacco
M
12

The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.

The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.

while IFS='' read -r lineN; do
    # do things with "$lineN"
done <"$1"

Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.

In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern. And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut, paste, nl, pr, etc etc. In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.

The q in the sed script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.

With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.

If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.

Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor

Another common variation is to find the line you want to modify with grep, only so you can find it with sed ... which already knows full well how to perform a regex search by itself. (See also useless use of grep.)

# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file

The correct way to do this would be to simply change the sed script to contain a search condition:

sed -i '/foo/s/.*/thing/' file

This also avoids the complications when the value of $line in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar* in a regular expression does not match the literal text itself.)

Also, if there is more than one matching line, the result from grep which you save in line which contain multiple lines, with line breaks between them, which will break your sed script (or at the very least require additional complications to cope correctly with).

sed -i by itself is also often an example of "have hammer, will find nails" syndrome. A common antipattern is this:

while read -r phrase replacement; do
    sed -i "s/$phrase/$replacement/g" file.txt
done <substitutions.txt

This will read and rewrite file.txt as many times as there are lines in substitutions.txt. But sed already knows how to perform multiple substitutions; this is completely unnecessary and wasteful.

sed 's%\([^ ]*\) \(.*\)%s/\1/\2/g%' substitutions.txt

will create a single sed script like

s/primero/first/g
s/segundo/second/g
:

which you can then feed to ... another instance of sed.

sed 's%\([^ ]*\) \(.*\)%s/\1/\2/g%' substitutions.txt |
sed -i -f - file.txt

This is much more efficient; it reads and processes each file only once. However, if substitutions.txt is huge, you could end up using a lot of memory in the second sed process - perhaps then chomp it up into smaller pieces.

(If your sed does not accept -f -, try -f /dev/stdin, or perhaps save the output of the first script to a temporary file, and use -f /tmp/path/tempfile.sed instead of a pipe.)

Melanosis answered 2/1, 2021 at 12:11 Comment(1)
Expanding a bit on the significant overhead. By doing the operations as presented in the OP, you perform various actions over and over. This includes, open the file, read the file up to the line of interest, close the file. This ensures that the complexity of the original program is O(N^2), while in this answer it is just O(N) (you open, read and close the file only ones) . The original process will become very slow for large files as well as for files stored on a network-based file-system.Tobacco

© 2022 - 2024 — McMap. All rights reserved.