calling uniq and sort in different orders in shell
Asked Answered
S

3

3

Is there a difference in the order of uniq and sort when calling them in a shell script? I’m talking here about time- and space-wise.

grep 'somePattern' | uniq | sort

vs.

grep 'somePattern' | sort | uniq

a quick test on a 140 k lines textfile showed a slight speed improvement (5.5 s vs 5.0 s) for the first method (get uniq values and then sort)

I don’t know how to measure memory usage though …

The question now is: does the order make a difference? Or is it dependent on the returned lines from grep (many/few duplicates)

Scutate answered 9/9, 2009 at 21:34 Comment(2)
I would humbly recommend accepting a different asnwer - sort -u is the correcter way of doing this than either of your alternatives.Cardamom
sure, but the accepted answer explains the why betterScutate
T
9

The only correct order is to call uniq after sort, since the man page for uniq says:

Discard all but one of successive identical lines from INPUT (or standard input), writing to OUTPUT (or standard output).

Therefore it should be

grep 'somePattern' | sort | uniq
Ternion answered 9/9, 2009 at 21:38 Comment(1)
I've used | uniq | sort | uniq when grepping gigabytes worth of stuff out of sorted files just to try to keep the sort from having to sort an excessive amount of data.Portiaportico
A
10

I believe that sort -u is suited to this exact scenario, and will both sort and uniquify things. Obviously, that'll be more efficient than calling sort and uniq individually in either order.

Allegory answered 9/9, 2009 at 21:37 Comment(2)
sort -u is a great hint, and no doubt, it’s more efficient than calling the two in either order. BUT, the order makes a difference (uniq | sort not working)Scutate
In a quick test, I found that sort -u is about 7% faster than sort|uniq.Worcestershire
T
9

The only correct order is to call uniq after sort, since the man page for uniq says:

Discard all but one of successive identical lines from INPUT (or standard input), writing to OUTPUT (or standard output).

Therefore it should be

grep 'somePattern' | sort | uniq
Ternion answered 9/9, 2009 at 21:38 Comment(1)
I've used | uniq | sort | uniq when grepping gigabytes worth of stuff out of sorted files just to try to keep the sort from having to sort an excessive amount of data.Portiaportico
E
3

uniq depends on the items being sorted to remove duplicates(since it compares the previous and current item), hence why sort is always run before uniq. Try it and see.

Excretion answered 9/9, 2009 at 21:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.