how to aggregate counts in a bash one-liner
Asked Answered
M

3

6

I often use sort | uniq -c to make count statistics. Now, if I have two files with such count statistics, I would like to put them together and add the counts. (I know I could append the original files and count there, but lets assume only the count files are accessible).

For example given:

a.cnt:

   1 a
   2 c

b.cnt:

   2 b
   1 c

I would like to concatenate and get the following output:

   1 a
   2 b
   3 c

What's the shortest way to do this in the shell?

Edit:

Thanks for the answers so far!

Some possible side-aspects one might want to consider additionally:

  • what if a, b, c are arbritrary strings, containing arbitrary white-spaces?
  • what if the files are too big to fit in memory? Is there some sort | uniq -c-style command line option for this case that only looks at two lines at a time?
Misdirection answered 13/3, 2014 at 15:52 Comment(0)
S
9

This can work for any given number of files:

$ cat a.cnt b.cnt | awk '{a[$2]+=$1} END{for (i in a) print a[i],i}'
1 a
2 b
3 c

So if you have let's say 10 files, you just have to do cat f1 f2 ... and then pipe this awk.

If the file names happen to share a pattern, you can also do (thanks Adrian Frühwirth!):

awk '{a[$2]+=$1} END{for (i in a) print a[i],i}' *cnt

So for example this will take into consideration all the files whose extension is cnt.


Some possible side-aspects one might want to consider additionally:

  • what if a, b, c are arbritrary strings, containing arbitrary white-spaces?
  • what if the files are too big to fit in memory? Is there some sort | uniq -c-style command line option for this case that only looks at two lines at a time?

In that case, you can use the rest of the columns as indexes for the counter:

awk '{count=$1; $1=""; a[$0]+=count} END{for (i in a) print a[i],i}' *cnt

Note that in fact you don't need to sort | uniq -c and redirect to a cnt file and then perform this re-counting. You can do it all together with something like this:

awk '{a[$0]++} END{for (i in a) print a[i], i}' file

Example

$ cat a.cnt
   1 and some
   2 text here

$ cat b.cnt
   4 and some
   4 and other things
   2 text here
   9 blabla

$ cat *cnt | awk '{count=$1; $1=""; a[$0]+=count} END{for (i in a) print a[i],i}'
4  text here
9  blabla
4  and some
4  and other things

Regarding second comment:

$ cat b
and some
text here
and some
and other things
text here
blabla

$ awk '{a[$0]++} END{for (i in a) print a[i], i}' b
2 and some
2 text here
1 and other things
1 blabla
Switchblade answered 13/3, 2014 at 15:57 Comment(2)
Or, you know, just skip the pipe :-)Forbore
@dhokas absolutely. Updated the post with the change, thanks for reporting!Switchblade
S
5

Using awk:

awk 'FNR==NR{a[$2]=$1;next} $2 in a{a[$2]+=$1}1' a.cnt b.cnt
1 a
2 b
3 c
Stevenage answered 13/3, 2014 at 15:56 Comment(0)
S
5
$ awk '{a[$2]+=$1}END{for(i in a){print a[i], i}}' a.cnt b.cnt
1 a
2 b
3 c
Slier answered 13/3, 2014 at 15:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.