Count unique values in one column based on a duplicate in another column
Asked Answered
I

7

8

I have a file with an identifier and a value:

ABC123 111111
ABC123 111111
ABCDEF 333333
ABCDEF 111111
CCCCCC 333333
ABC123 222222
DEF123 444444
DEF123 444444

Both columns contain duplicate values, but I need to count lines which have the same ID (first column) and a unique value (second column). This would make the output from the above input:

ABCDEF 2
ABC123 2
DEF123 1
CCCCCC 1

...where the first column is the ID and the second column is the count of unique values in the second column. In other words, I need to find out how many unique values exist for a given ID.

The closest I've come is this but all it does is count the first column's unique values:

cut -d " " -f1 "file.txt" | uniq -cd | sort -nr | head

How would I do something like this in Bash?

Irenairene answered 20/8 at 19:36 Comment(6)
Start with sorting: sort PHA-DC.txt | cut -d " " -f1 | uniq …Aaberg
how many unique values exist for a given ID would also print 2 for "ABC123". Do you really want to only count IDs where no duplicate occurs at all?Unclose
@AndreWildberg Good catch, you're right that ABC123 from the examples should also produce 2.Irenairene
Since we're sanitizing the test data, DEF123 does not have 2 distinct values, it occurs with 444444 twice.Fragmental
Why is DEF123 2 not in output now?Confluence
Thanks to you both, turns out writing good, exhaustive examples is hard or I'm just really tired.Irenairene
F
8

Is this close enough?

$ sort -u file.txt | cut -d' ' -f1 | uniq -c
   2 ABC123
   2 ABCDEF
   1 CCCCCC
   1 DEF123

You can further filter it with a | grep -vw '1' to mimic the HAVING COUNT(DISTINCT value) > 1 semantics and eliminate the last two rows from the output in this example (assuming that 1 is not a legal value for an identifier!).

And you can of course reverse the column order in several ways. E.g.

$ sort -u file.txt  |        # sort and eliminate multiple occurrences of the same '<identifier> <value>' pair
    cut -d' ' -f1   |        # keep only the identifier
    uniq -c         |        # collapse and count occurrences of the same identifier
    grep -vw '1'    |        # eliminate rows containing the word '1', assuming this can only be a count value, never an identifier!
    awk '{print $2 " " $1}'  # reverse column order to show '<identifier> <count>'
ABC123 2
ABCDEF 2
Fragmental answered 20/8 at 20:19 Comment(2)
Just tested and works perfectly thanks, especially useful being able to chain grep -v when I only want to look at lines with more than 1 or even 2 duplicates, although it's worth noting I also needed a sort -rn to emulate the sorting.Irenairene
uniq already has -d switch to report only duplicate lines, there is no need of hacking!Runkel
L
7

With any awk:

awk '
    !seen[$0]++ {++uniqs[$1]}
    END {for(id in uniqs) print id, uniqs[id]}
' file.txt

or if the white space between fields can vary then still using any awk:

awk '
    !seen[$1,$2]++ {++uniqs[$1]}
    END {for(id in uniqs) print id, uniqs[id]}
' file.txt

or with GNU awk (for multi-dim arrays):

awk '
    !seen[$1][$2]++ {++uniqs[$1]}
    END {for(id in uniqs) print id, uniqs[id]}
' file.txt
DEF123 1
ABC123 2
ABCDEF 2
CCCCCC 1
Lagerkvist answered 20/8 at 20:50 Comment(3)
With standard awk you can use seen[$1,$2] to emulate a 2-d array.Unctuous
You're right; now that you mention it even seen[$0] could be usedLagerkvist
@Lagerkvist : i haven't tested this idea, but since you're already using gawk specific features already, wouldn't for (id in seen) { print id, length(seen[id]) } achieve the same effect as counting uniques for each $1 ? If so, then you can consolidate it to just one array. Additionally, since you're just counting uniques, maybe you could just do :::::::::::::::::::: ::::::::::::::::: ::::::::::::::: { seen[$1][$2] } END { ... } and skip the incrementing altogether.Friable
K
7

Here is a Ruby to do that:

ruby -lane 'BEGIN{ cnt=Hash.new{|h,k| h[k]=[]} }
cnt[$F[0]]<<$F[1]
END{
    cnt.select{|k,v| v.length>1 }.
        each{|k,v| puts "#{k} #{v.uniq.length}"} 
}
' file.txt

Prints:

ABC123 2
ABCDEF 2
DEF123 1

It is unclear if CCCCCC 1 is supposed to be in the output. If so, no need to filter:

ruby -lane 'BEGIN{ cnt=Hash.new{|h,k| h[k]=[]} }
cnt[$F[0]]<<$F[1]
END{ cnt.each{|k,v| puts "#{k} #{v.uniq.length}"} }
' file.txt

Prints:

ABC123 2
ABCDEF 2
CCCCCC 1
DEF123 1

You can also do this POSIX pipe:

sort -u file.txt | awk '{cnt[$1]++} END{for (e in cnt) print e, cnt[e]}'

Or use awk only to eliminate the need for uniqifing with sort -u:

awk '!seen[$0]++ {cnt[$1]++} END{for (e in cnt) print e, cnt[e]}' file.txt

Either of these prints (perhaps in different order):

CCCCCC 1
ABCDEF 2
DEF123 1
ABC123 2
Keslie answered 20/8 at 21:25 Comment(0)
C
6

This awk should work for you:

awk '{
   uq[$0]                 # counts of full record
}
END {
   for (i in uq) {        # store frequency of uniques in fq
      sub(/ .*/, "", i)
      ++fq[i]
   }
   for (i in fq)          # print output from fq
      print i, fq[i]
}' file

CCCCCC 1
ABCDEF 2
DEF123 1
ABC123 2
Confluence answered 20/8 at 20:5 Comment(2)
Looks like question was edited to show a different expected output after I posted this answerConfluence
I can see multiple edits to the question.Colonist
S
5

I would harness GNU AWK for this task following way, let file.txt content be

ABC123 111111
ABC123 111111
ABCDEF 333333
ABCDEF 111111
CCCCCC 333333
ABC123 222222
DEF123 444444
DEF123 444444

then

awk '{arr[$1][$2]}END{for(i in arr){print i,length(arr[i])}}' file.txt

gives output

DEF123 1
ABC123 2
ABCDEF 2
CCCCCC 1

Explanation: I use 2D array arr but without storing any values (only keys), then I iterate over upper lever and for each I detected number of subkeys by using length function. If you need certain order in output set PROCINFO["sorted_in"] in BEGIN to one of Predefined Array Scanning Order for example

awk 'BEGIN{PROCINFO["sorted_in"]="@ind_str_asc"}{arr[$1][$2]}END{for(i in arr){print i,length(arr[i])}}' file.txt

will give output

ABC123 2
ABCDEF 2
CCCCCC 1
DEF123 1

i.e. lexicographical-ascending order

(tested in GNU Awk 5.1.0)

Salicin answered 21/8 at 7:36 Comment(0)
B
4

Assumptions:

  • each line has 2 space-delimited strings
  • duplicate lines are truly duplicate (eg, they have the same amount of leading, embedded and trailing white space)

Another awk approach:

awk '
    { lines[$0] }                            # capture unique lines
END { for (line in lines) {                  # loop through list of unique lines
          split(line,a)                      # split line on white space
          counts[a[1]]++                     # count number of times we see the first field (aka "id")
      }
      for (id in counts)                     # loop through list of id
          print id, counts[id]               # print id and count
    }
' file.txt

This generates:

ABC123 2
DEF123 1
ABCDEF 2
CCCCCC 1

If the output needs to be ordered then pipe the results to the appropriate sort command, eg:

$ awk '<see script from above>' file.txt | sort -k2,2nr -k1,1r
ABCDEF 2
ABC123 2
CCCCCC 1
DEF123 1
Belfry answered 20/8 at 21:16 Comment(0)
L
1

How would I do something like this in Bash?

With bash using asscoiative array which (in-my-opinion) is a poor mans version of mark markp-fuso's awk approach.


declare -A id_value id_count

# Read each line from the file, splitting into id and value
while read -r id value; do
  id_value["$id $value"]=1  # Store the unique ID-value pair
done < file.txt

# Count unique values for each ID
for key in "${!id_value[@]}"; do
  id="${key%% *}"            # Extract the ID from the key
  ((id_count["$id"]++))      # Increment the count for this ID
done

##: declare -p id_count

# Print the results
for id in "${!id_count[@]}"; do
  printf '%s %s\n' "$id" "${id_count["$id"]}"
done

• Will be very-very slow on large data/file size.

• See Issue with associative array arithmetic context

Lori answered 21/8 at 4:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.