Suppose I have a file similar to the following:
123
123
234
234
123
345
I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc. So ideally, the output would be like:
123 3
234 2
345 1
Suppose I have a file similar to the following:
123
123
234
234
123
345
I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc. So ideally, the output would be like:
123 3
234 2
345 1
Assuming there is one number per line:
sort <file> | uniq -c
You can use the more verbose --count
flag too with the GNU version, e.g., on Linux:
sort <file> | uniq --count
sort
again like: sort <file> | uniq -c | sort -n
–
Andersonandert -d
I would have taken … | uniq -c | grep -v '^\s*1'
(-v
means inverse regexp, that denies matches (not verbose, not version :)) –
Cloakroom -c -d
and print duplicate lines and the count –
Morry This will print duplicate lines only, with counts:
sort FILE | uniq -cd
or, with GNU long options (on Linux):
sort FILE | uniq --count --repeated
on BSD and OSX you have to use grep to filter out unique lines:
sort FILE | uniq -c | grep -v '^ *1 '
For the given example, the result would be:
3 123
2 234
If you want to print counts for all lines including those that appear only once:
sort FILE | uniq -c
or, with GNU long options (on Linux):
sort FILE | uniq --count
For the given input, the output is:
3 123
2 234
1 345
In order to sort the output with the most frequent lines on top, you can do the following (to get all results):
sort FILE | uniq -c | sort -nr
or, to get only duplicate lines, most frequent first:
sort FILE | uniq -cd | sort -nr
on OSX and BSD the final one becomes:
sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr
| sort -n
or | sort -nr
to the pipe will sort the output by repetition count (ascending or descending respectively). This is not what you're asking but I thought it might help. –
Willaims | awk '$1>100'
–
Willaims sort FILE | uniq -cd
should work on OSX too –
Willaims sort FILE | uniq -c | grep -v '^ *1 '
–
Willaims awk '$1>1'
seems a lot better than grep -v '^ *1 '
to me. It allows us to change the minimum duplicate count with ease and works flawlessly even on macOS. :) –
Haeres sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr
is beautiful! –
Standley uniq -d
if you don't care about the count –
Branton To find and count duplicate lines in multiple files, you can try the following command:
sort <files> | uniq -c | sort -nr
or:
cat <files> | sort | uniq -c | sort -nr
Via awk:
awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' data
In awk 'dups[$1]++'
command, the variable $1
holds the entire contents of column1 and square brackets are array access. So, for each 1st column of line in data
file, the node of the array named dups
is incremented.
And at the end, we are looping over dups
array with num
as variable and print the saved numbers first then their number of duplicated value by dups[num]
.
Note that your input file has spaces on end of some lines, if you clear up those, you can use $0
in place of $1
in command above :)
uniq
? –
Salian sort | uniq
and the awk solution have quite different performance & resource trade-offs: if the files are large and the number of different lines is small, the awk solution is a lot more efficient. It is linear in the number of lines and the space usage is linear in the number of different lines. OTOH, the awk solution needs to keeps all the different lines in memory, while (GNU) sort can resort to temp files. –
Garnettgarnette In Windows, using "Windows PowerShell", I used the command mentioned below to achieve this
Get-Content .\file.txt | Group-Object | Select Name, Count
Also, we can use the where-object Cmdlet to filter the result
Get-Content .\file.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count
sort
, using ...| Sort -Top 15 -Descending Count | Select Name
–
Test To find duplicate counts, use this command:
sort filename | uniq -c | awk '{print $2, $1}'
Assuming you've got access to a standard Unix shell and/or cygwin environment:
tr -s ' ' '\n' < yourfile | sort | uniq -d -c
^--space char
Basically: convert all space characters to linebreaks, then sort the tranlsated output and feed that to uniq and count duplicate lines.
© 2022 - 2024 — McMap. All rights reserved.