I was wondering which of the following solutions was the "fastest" for "larger" files:
awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2 # awk2 by ghostdog74
comm -13 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
grep -v -F -x -f file1 file2
Results of my benchmarks in short:
- Do not use
grep -Fxf
, it's much slower (2-4 times in my tests).
comm
is slightly faster than join
.
- If file1 and file2 are already sorted,
comm
and join
are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
- awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for
comm
probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.
For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was
# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
321599 321599 8098710 file1
321603 321603 8098794 file2
Typical results of fastest runs
awk2: real 0m1.145s user 0m1.088s sys 0m0.056s user+sys 1.144
awk1: real 0m1.369s user 0m1.324s sys 0m0.044s user+sys 1.368
comm: real 0m0.980s user 0m1.608s sys 0m0.184s user+sys 1.792
join: real 0m1.080s user 0m1.756s sys 0m0.140s user+sys 1.896
grep: real 0m4.005s user 0m3.844s sys 0m0.160s user+sys 4.004
BTW, for the awkies: It seems that a[$0]=1
is faster than a[$0]++
, and (!($0 in a))
is faster than (!a[$0])
. So, for an awk solution I suggest:
awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2