Eliminate partially duplicate lines by column and keep the last one

Asked 25/3, 2011 at 7:52 Answered 23/6, 2011 at 6:28

I have a file that looks like this:

2011-03-21 name001 line1
2011-03-21 name002 line2
2011-03-21 name003 line3
2011-03-22 name002 line4
2011-03-22 name001 line5

for each name, I only want its last appearance. So, I expect the result to be:

2011-03-21 name003 line3
2011-03-22 name002 line4
2011-03-22 name001 line5

Could someone give me a solution with bash/awk/sed?

Hwu answered 25/3, 2011 at 7:52 Comment(0)

This code get uniq lines by second field but from the end of file or text (like in your result example)

tac temp.txt | sort -k2,2 -r -u

Bevins answered 25/3, 2011 at 8:8 Comment(4)

Make sure that the last line of your input file contains a \n otherwise tac will concatenate it with the last but one line – Aires 8/7, 2014 at 17:38

To specify another separator, use -t: tac temp.txt | sort -k1,1 -r -u -t@ – Fuegian 18/4, 2017 at 20:56

Would you mind explaining the sort parameters -k2,2? :) – Promotion 3/11, 2019 at 13:3

@Promotion There is good description in wiki here and here – Bevins 25/11, 2019 at 6:17

awk '{a[$2]=$0} END {for (i in a) print a[i]}' file

If order of appearance is important:

Based on first appearance:

awk '!a[$2] {b[++i]=$2} {a[$2]=$0} END {for (i in b) print a[b[i]]}' file

Based on last appearance:

tac file | awk '!a[$2] {b[++i]=$2} {a[$2]=$0} END {for (i in b) print a[b[i]]}'

Stryker answered 25/3, 2011 at 8:4 Comment(5)

This is good - simple and robust. The order of the output does not match the order of the output if that is important though. Is there an easy way to fix that? – Evenings 25/3, 2011 at 8:11

@Evenings yes, but this will result in a much more complex awk program. I'll edit my answer. – Stryker 25/3, 2011 at 8:12

Actually, I was meaning just reversing the printing of the array rather than which entry was selected. So that the output would be in time order: line 3, line 4, line 5 rather than line 5, line 4, line 3. +1 from me for the first simple answer. Oh wait, yeah - I see that is what you were doing - it does get stupidly complex. – Evenings 25/3, 2011 at 8:24

@Evenings oh, I misunderstood :) ... well, you can always pipe its output to sort. would be much simpler than trying to cram everything in awk. – Stryker 25/3, 2011 at 8:26

I used the simplest one, and add sort on time stamp field after that. Really a good solution, thanks! – Hwu 25/3, 2011 at 10:19

sort < bar > foo
uniq  < foo > bar

bar now has no duplicated lines

Risible answered 23/6, 2011 at 6:28 Comment(2)

Given the OP's example, all the lines would be counted as unique. He only wants the second field to be used to determine uniqueness. – Superincumbent 1/3, 2012 at 15:13

+1 ...but this answers the title ('bash eliminate duplicate lines' at the moment), which is what Google seemed to use to send me here! – Laboy 27/12, 2013 at 23:26

EDIT: Here's a version that actually answers the question.

sort -k 2 filename | while read f1 f2 f3; do if [ ! "$f2" = "$lf2" ]; then echo "$f1 $f2 $f3"; lf2="$f2"; fi; done

Chrysolite answered 25/3, 2011 at 7:54 Comment(1)

I believe awk script implementing the same logic should be more efficient. – Shepley 19/6, 2019 at 20:49

Recommended topics

Hot tags