"grep -c" versus "wc -l"
Asked Answered
B

1

8

I am processing a number of large text files, ie. converting them all from one format to another. There are some small differences in the original formats of the files, but - with a bit of pre-processing in a few cases - they are mostly being successfully converted with a bash shellscript I have created.

So far so good, but one thing is puzzling me. At one point the script sets a variable called $iterations, so that it knows how many times to perform a particular for-loop. This value is determined by the number of empty lines in a temporary file that is created by the script.

Thus, the original version of my script contained the line:

    iterations=$(cat tempfile | grep '^$' | wc -l)

This has worked fine so far with all but one of the text files, which didn't seem to set the $iterations variable correctly, giving a value of '1' even though there appeared to be more than 20,000 empty lines in tempfile.

However, having discovered grep -c, I changed the line to:

    iterations=$(cat tempfile | grep -c '^$')

and the script suddenly worked, ie. $iterations was set correctly.

Can anyone explain why the two versions produce different results? And why the first version would work on some files and not others? Is there some upper limit value above which wc -l defaults to 1? The file which wouldn't work with the first version is one of the largest, but not the largest in the set (which converted correctly the first time).

Beekeeping answered 18/4, 2017 at 16:39 Comment(5)
Can you replicate this? That is, do you have a file for which grep -c '^$' produces output different than grep '^$' | wc -l?Eleneeleni
I wonder if the file contains something funny that confuses wc, would cat tempfile | grep '^$' | hexdump -C | head produce anything interesting?Patricio
printf 'foo\nbar\n\x00\n\n\n\n' | { cat > /tmp/file; grep -c '^$' < /tmp/file; grep '^$' < /tmp/file | wc -l; } Dmitri's got it. With a null character, wc produces 1, while grep -c counts 4.Eleneeleni
Of course, the problem is that grep is printing Binary file (standard input) matches, and wc is counting that line!Eleneeleni
Another reason could be that grep 2.13 wrongly treats some files as binary, e.g. large files stored on filesystems that implement deduplication. This was corrected in 2.14 (git log) and later versions.Patricio
E
11

If the input is not a text file, then grep will print the single line Binary file (standard input) matches, and wc -l will count that line! But grep -c will happily count the number of matches in the file.

Eleneeleni answered 18/4, 2017 at 16:58 Comment(2)
@dmitri: I see (I think)... somewhere in that large text file, there must be a fortuitous character sequence which grep (without -c) interprets as a null character? I'd never have thought of that. I've never come across the null character; I guess it must have its uses. :-)Beekeeping
Not necessarily a nul. Could be any character that causes grep to treat the file as a binary file.Eleneeleni

© 2022 - 2024 — McMap. All rights reserved.