The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed
already does this, so if the shell couldn't do it, you could loop over the output from sed
instead.
The proper way to loop over the lines in a file in the shell is with while read
. There are a couple of complications — commonly, you reset IFS
to avoid having the shell needlessly split the input into tokens, and you use read -r
to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read
, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed
script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed
(or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1
) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed
or Awk instead, that's going to be a lot more efficient than a native shell while read
loop, because of the way read
is implemented. This is why while read
is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut
, paste
, nl
, pr
, etc etc.
In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.
The q
in the sed
script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed
script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat
the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234
display something like line 1/approximately 10000000
?) ... or use an external tool like pv
.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for
loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor
Another common variation is to find the line you want to modify with grep
, only so you can find it with sed
... which already knows full well how to perform a regex search by itself. (See also useless use of grep
.)
# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file
The correct way to do this would be to simply change the sed
script to contain a search condition:
sed -i '/foo/s/.*/thing/' file
This also avoids the complications when the value of $line
in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar*
in a regular expression does not match the literal text itself.)
Also, if there is more than one matching line, the result from grep
which you save in line
which contain multiple lines, with line breaks between them, which will break your sed
script (or at the very least require additional complications to cope correctly with).
sed -i
by itself is also often an example of "have hammer, will find nails" syndrome. A common antipattern is this:
while read -r phrase replacement; do
sed -i "s/$phrase/$replacement/g" file.txt
done <substitutions.txt
This will read and rewrite file.txt
as many times as there are lines in substitutions.txt
. But sed
already knows how to perform multiple substitutions; this is completely unnecessary and wasteful.
sed 's%\([^ ]*\) \(.*\)%s/\1/\2/g%' substitutions.txt
will create a single sed
script like
s/primero/first/g
s/segundo/second/g
:
which you can then feed to ... another instance of sed
.
sed 's%\([^ ]*\) \(.*\)%s/\1/\2/g%' substitutions.txt |
sed -i -f - file.txt
This is much more efficient; it reads and processes each file only once. However, if substitutions.txt
is huge, you could end up using a lot of memory in the second sed
process - perhaps then chomp it up into smaller pieces.
(If your sed
does not accept -f -
, try -f /dev/stdin
, or perhaps save the output of the first script to a temporary file, and use -f /tmp/path/tempfile.sed
instead of a pipe.)