Bash- is it possible to use -uniq for only one column of a line?
Asked Answered
H

3

7
    1.gui  Qxx  16
    2.gu   Qxy  23
    3.guT  QWS  18
    4.gui  Qxr  21

i want to sort a file depending a value in the 3rd column, so i use:

sort -rnk3 myfile

2.gu   Qxy  23
4.gui  Qxr  21
3.guT  QWS  18
1.gui  Qxx  16

now i have to output as: (the line starting with 3.gui is out because the line with 4.gui has a greater value)

2.gu   Qxy  23
4.gui  Qxr  21
1.guT  QWS  18

i can not use -head because i have millions of rows and i do not where to cut, i could not figure a way to use -uniq because it treats a line as whole and since i can not tell -uniq to look at first column, it counts a line which has unique it outputs it -which is normal-. i know -uniq can ignore a number of characters but as you can see from example first column might have various character count..

please advice..

Helban answered 27/11, 2012 at 11:35 Comment(1)
possible duplicate of Is there a way to 'uniq' by column?Roughneck
I
9

Try this:

sort -rnk3 myfile | awk -F"[. ]" '!a[$2]++'

awk removes the duplicates depending on the 2nd column. This is actually a famous awk syntax to remove duplicates. An array is maintained where the record of 2nd field is maintained. Every time before a record is printed, the 2nd field is checked in the array. If not present, it is printed, else its discarded since it is duplicate. This is achived using the ++. First time, when a record is encountered, this ++ will keep the count as 0 since its post-fix. SUbsequent occurences will increase the value which when negated becomes false.

Ion answered 27/11, 2012 at 11:43 Comment(1)
2nd column because we are splitting the file with . and space as delimiter, and hence 2nd column will give us gui,etc..Ion
S
2

Here you go:

sort -rnk3 file | awk -F'[. ]' '{ if (a[$2]++ == 0) print }' 

2.gu   Qxy  23
4.gui  Qxr  21
1.guT  QWS  18

This uses awk to check duplicate values in the second field where by the field separator is either a whitespace or a period. So this is what it treats the second field as:

$ awk -F'[. ]' '{ print $2 }' file

gu
gui
guT
gui

In awk the variable $0 represents the whole line, $1 represents the first field, and so on..

awk -F'[. ]' '{ if (a[$2]++ == 0) print }' the -F options let you specify the field separator, in this case it's either whitespace or a period.

Simony answered 27/11, 2012 at 11:51 Comment(1)
hey @sudo_O ..thanks again. can you please explain the -awk command a litle?Helban
P
0

So I found this by the all powerful and amazing Google -- My little script builds off @sudo_O 's answer, in that it shows you all the duplicate lines found...., not a file without duplicates.

The text I was finding all duplicates in the 3rd column (port) were in a file called master.txt

awk '{if (a[$3]++ > 0) print}' master.txt | while read site thread port
do
  grep $port master.txt
done
Planer answered 21/6, 2013 at 18:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.