Eliminate partially duplicate lines by column and keep the last one
Asked Answered
H

4

26

I have a file that looks like this:

2011-03-21 name001 line1
2011-03-21 name002 line2
2011-03-21 name003 line3
2011-03-22 name002 line4
2011-03-22 name001 line5

for each name, I only want its last appearance. So, I expect the result to be:

2011-03-21 name003 line3
2011-03-22 name002 line4
2011-03-22 name001 line5

Could someone give me a solution with bash/awk/sed?

Hwu answered 25/3, 2011 at 7:52 Comment(0)
B
39

This code get uniq lines by second field but from the end of file or text (like in your result example)

tac temp.txt | sort -k2,2 -r -u
Bevins answered 25/3, 2011 at 8:8 Comment(4)
Make sure that the last line of your input file contains a \n otherwise tac will concatenate it with the last but one lineAires
To specify another separator, use -t: tac temp.txt | sort -k1,1 -r -u -t@Fuegian
Would you mind explaining the sort parameters -k2,2? :)Promotion
@Promotion There is good description in wiki here and hereBevins
S
11
awk '{a[$2]=$0} END {for (i in a) print a[i]}' file

If order of appearance is important:

  • Based on first appearance:

    awk '!a[$2] {b[++i]=$2} {a[$2]=$0} END {for (i in b) print a[b[i]]}' file
    
  • Based on last appearance:

    tac file | awk '!a[$2] {b[++i]=$2} {a[$2]=$0} END {for (i in b) print a[b[i]]}'
    
Stryker answered 25/3, 2011 at 8:4 Comment(5)
This is good - simple and robust. The order of the output does not match the order of the output if that is important though. Is there an easy way to fix that?Evenings
@Evenings yes, but this will result in a much more complex awk program. I'll edit my answer.Stryker
Actually, I was meaning just reversing the printing of the array rather than which entry was selected. So that the output would be in time order: line 3, line 4, line 5 rather than line 5, line 4, line 3. +1 from me for the first simple answer. Oh wait, yeah - I see that is what you were doing - it does get stupidly complex.Evenings
@Evenings oh, I misunderstood :) ... well, you can always pipe its output to sort. would be much simpler than trying to cram everything in awk.Stryker
I used the simplest one, and add sort on time stamp field after that. Really a good solution, thanks!Hwu
R
6
sort < bar > foo
uniq  < foo > bar

bar now has no duplicated lines

Risible answered 23/6, 2011 at 6:28 Comment(2)
Given the OP's example, all the lines would be counted as unique. He only wants the second field to be used to determine uniqueness.Superincumbent
+1 ...but this answers the title ('bash eliminate duplicate lines' at the moment), which is what Google seemed to use to send me here!Laboy
C
3

EDIT: Here's a version that actually answers the question.

sort -k 2 filename | while read f1 f2 f3; do if [ ! "$f2" = "$lf2" ]; then echo "$f1 $f2 $f3"; lf2="$f2"; fi; done
Chrysolite answered 25/3, 2011 at 7:54 Comment(1)
I believe awk script implementing the same logic should be more efficient.Shepley

© 2022 - 2024 — McMap. All rights reserved.