Output whole line once for each unique value of a column (Bash)

A

4

10

This must surely be a trivial task with awk or otherwise, but it's left me scratching my head this morning. I have a file with a format similar to this:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> AIQLTGK        8   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR   2   genes ADUm.2146,ADUm.5750

I would like to print a line for each distinct value of the peptides in column 2, meaning the above input would become:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

This is what I've tried so far, but clearly neither does what I need:

awk '{print $2}' file | sort | uniq
# Prints only the peptides...
awk '{print $0, "\t", $1}' file |sort | uniq -u -f 4
# Altogether omits peptides which are not unique...

One last thing, It will need to treat peptides which are substrings of other peptides as distinct values (eg VSSILED and VSSILEDKILSR). Thanks :)

Adorne answered 21/8, 2012 at 10:9 Comment(0)

A

17

One way using awk:

awk '!array[$2]++' file.txt

Results:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

Alainaalaine answered 21/8, 2012 at 10:22 Comment(2)

Could you explain your approach, please? (like why array and ++) – Georginageorgine 13/2, 2023 at 13:48

@Mauri1313 Basically, this just uses an associative array (called 'array') to print only the unique lines based on the second field in the line. If the value of the key is undefined (which is the case for the first occurrence of the key), then the expression returns true and the line is printed. If the key has already been seen, the value of the key is set to a non-zero value, and the expression returns false, in which case the line is not printed. – Alainaalaine 13/2, 2023 at 15:47

P

21

Just use sort:

sort -k 2,2 -u file

The -u removes duplicate entries (as you wanted), and the -k 2,2 makes just the field 2 the sorting field (and so ignores the rest when checking for duplicates).

Perceptive answered 21/8, 2012 at 10:23 Comment(1)

Awesome.. ..and if you want to get the top X number of unique entries, once you've sorted the file using 'sort', instead of just getting only one unique entry, you can use a little app I created here: github.com/danieliversen/MiscStuff/blob/master/scripts/… – Wasp 25/2, 2016 at 10:30

A

17

One way using awk:

awk '!array[$2]++' file.txt

Results:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

Alainaalaine answered 21/8, 2012 at 10:22 Comment(2)

Could you explain your approach, please? (like why array and ++) – Georginageorgine 13/2, 2023 at 13:48

@Mauri1313 Basically, this just uses an associative array (called 'array') to print only the unique lines based on the second field in the line. If the value of the key is undefined (which is the case for the first occurrence of the key), then the expression returns true and the line is printed. If the key has already been seen, the value of the key is set to a non-zero value, and the expression returns false, in which case the line is not printed. – Alainaalaine 13/2, 2023 at 15:47

J

2

I would use Perl for this:

perl -nae 'print unless exists $seen{$F[1]}; undef $seen{$F[1]}' < input.txt

The n switch works line by line with the input, the a switch splits the line into the @F array.

Jazminejazz answered 21/8, 2012 at 10:20 Comment(2)

Same thing in awk: awk '{ if(!($2 in peptides)) { peptides[$2] = 1; print $_ } } ' > fp – Chapbook 21/8, 2012 at 10:24

I can see that this is where Perl really excels. Great answer, thank you. – Adorne 21/8, 2012 at 11:5

C

2

awk '{if($2==temp){next;}else{print}temp=$2}' your_file

tested below:

> awk '{if($2==temp){next;}else{print}temp=$2}' temp
pep> AEYTCVAETK         2       genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK            1       genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR      5       genes ADUm.367
pep> VSSILEDKTT         9       genes ADUm.1192,ADUm.2731
pep> AIQLTGK            10      genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR       3       genes ADUm.2146,ADUm.5750

Covin answered 21/8, 2012 at 10:35 Comment(2)

More verbose but very easy to understand. Thanks :) – Adorne 21/8, 2012 at 11:7

This returns AIQLTGK twice. – Rothberg 21/8, 2012 at 11:17

Recommended topics

Hot tags