Finding a uniq -c substitute for big files
Asked Answered
P

3

5

I have a large file (50 GB) and I could like to count the number of occurrences of different lines in it. Normally I'd use

sort bigfile | uniq -c

but the file is large enough that sorting takes a prohibitive amount of time and memory. I could do

grep -cfx 'one possible line'

for each unique line in the file, but this would mean n passes over the file for each possible line, which (although much more memory friendly) takes even longer than the original.

Any ideas?


A related question asks about a way to find unique lines in a big file, but I'm looking for a way to count the number of instances of each -- I already know what the possible lines are.

Ploughshare answered 2/9, 2015 at 22:22 Comment(1)
Arguably this is a degenerate case of #3502677; the https://mcmap.net/q/378800/-how-to-count-number-of-unique-values-of-a-field-in-a-tab-delimited-text-file answer is pretty much exactly what we already have, only picking out a column rather than using the whole line.Heth
F
10

Use awk

awk '{c[$0]++} END {for (line in c) print c[line], line}' bigfile.txt

This is O(n) in time, and O(unique lines) in space.

Forfend answered 2/9, 2015 at 22:31 Comment(2)
Arguably, this fills the "port this logic to awk" request in my bash version. :)Heth
bigfile.txt can be an awk command argument no input redirection is required.Redtop
B
3

Here is a solution using jq 1.5. It is essentially the same as the awk solution, both in approach and performance characteristics, but the output is a JSON object representing the hash. (The program can be trivially modified to produce output in an alternative format.)

Invocation:

$ jq -nR 'reduce inputs as $line ({}; .[$line] += 1)' bigfile.txt

If bigfile.txt consisted of these lines:

a
a
b
a
c

then the output would be:

{
  "a": 3,
  "b": 1,
  "c": 1
}
Brainbrainard answered 3/9, 2015 at 16:4 Comment(0)
H
1
#!/bin/bash
# port this logic to awk or ksh93 to make it fast

declare -A counts=( )
while IFS= read -r line; do
  counts[$line]=$(( counts[$line] + 1 )) # increment counter
done

# print results
for key in "${!counts[@]}"; do
  count=${counts[$key]}
  echo "Found $count instances of $key"
done
Heth answered 2/9, 2015 at 22:24 Comment(2)
How does grep help, it will match all the lines?Forfend
@Barmar, it matches only the lines in the known set. As I read the question, those are interspersed in other lines the OP doesn't care about.Heth

© 2022 - 2024 — McMap. All rights reserved.