Why does "uniq" count identical words as different?
Asked Answered
S

4

8

I want to calculate the frequency of the words from a file, where the words are one by line. The file is really big, so this might be the problem (it counts 300k lines in this example).

I do this command:

cat .temp_occ | uniq -c | sort -k1,1nr -k2 > distribution.txt

and the problem is that it gives me a little bug: it considers the same words as different.

For example, the first entries are:

306 continua 
278 apertura 
211 eventi 
189 murah 
182 giochi 
167 giochi 

with giochi repeated twice as you can see.

At the bottom of the file it becomes even worse and it looks like this:

  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 winchester 
  1 wind 
  1 wind 

for all the words.

What am I doing wrong?

Sailing answered 8/8, 2012 at 8:20 Comment(0)
N
14

Try to sort first:

cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt
Neoprene answered 8/8, 2012 at 8:24 Comment(1)
I feel stupid, thanks a lot and sorry again for the noob questionSailing
L
6

Or use "sort -u" which also eliminates duplicates. See here.

Leupold answered 8/8, 2012 at 8:26 Comment(0)
A
4

The size of the file has nothing to do with what you're seeing. From the man page of uniq(1):

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.`

So running uniq on

a
b
a

will return:

a
b
a
Achievement answered 13/5, 2015 at 13:30 Comment(0)
G
1

Is it possible that some of the words have whitespace characters after them? If so you should remove them using something like this:

cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt
Gemination answered 8/8, 2012 at 8:26 Comment(1)
No I already checked that, before posting. This was what I though too, but the whitespaces are the same in all the words. The solution to use sort also before the uniq worked like a charm. Thanks for the help :)Sailing

© 2022 - 2024 — McMap. All rights reserved.