How to print only the unique lines in BASH?

C

3

59

How can I print only those lines that appear exactly once in a file? E.g., given this file:

mountain
forest
mountain
eagle

The output would be this, because the line mountain appears twice:

forest
eagle

The lines can be sorted, if necessary.

Convergence answered 19/5, 2014 at 14:37 Comment(2)

I think you can use dictionary. You can have a look on this link: #1494678 – Coliseum 19/5, 2014 at 14:41

Does this answer your question? Find unique lines – Bradski 18/10, 2021 at 13:3

E

22

Using awk:

awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file
eagle
forest

Emmieemmit answered 19/5, 2014 at 14:41 Comment(14)

No need of going so complex. simple uniq command will do the job as well. – Treed 19/5, 2014 at 14:45

1. Its not complex and 2. It avoids expensive sort for larger files. – Emmieemmit 19/5, 2014 at 14:46

@Emmieemmit Nice awk. +1. But for it is really simpler to use uniq. And keeping in the memory larger files - who knows - what is more expensive. Swapping or sorting. :) – Metrify 19/5, 2014 at 14:50

@Emmieemmit just tested on 300k lines. This awk solution is 8 times faster than sort|uniq. – Metrify 19/5, 2014 at 14:55

@jm666: Thanks so much for running the test and verifying that awk command is faster than sort|uniq. – Emmieemmit 19/5, 2014 at 15:6

Since we are iterating, we can quickly check and print only those which is seen just once. awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file but +1 none the less. – Mervinmerwin 19/5, 2014 at 15:41

Yes sure that can also be done, I just chose delete to free up some memory not sure how much will that help :) – Emmieemmit 19/5, 2014 at 15:46

@Emmieemmit Thats a valid point, but as the solution is right now, it will probably get confused when the number of dups are in odd numbers. For example, if you add another mountain row, it will print it as well. – Mervinmerwin 19/5, 2014 at 17:2

@jaypal: Ah that's very important point. I updated as you suggested, many thanks! – Emmieemmit 19/5, 2014 at 17:24

@Emmieemmit Thanks for the edit and you're always welcome. :) – Mervinmerwin 19/5, 2014 at 17:26

@jm666 I tried with my .xsession-errors.old file (129315 lines), and the sort | uniq solution is 5 times faster than this awk solution... – Beret 19/5, 2014 at 18:10

@Beret sort also has added benefit of writing the cache to disk if memory is not available. awk does not have that benefit. – Mervinmerwin 19/5, 2014 at 18:12

I created a 803200 lines text file. My awk command took: 1.946s whereas sort|uniq took 3.188s on my OSX. – Emmieemmit 19/5, 2014 at 18:32

my OS X is probably slow on IO, because i did: gsort -uR /usr/share/dict/* > words.txt (the gsort is the GNU version of sort - for getting randomly ordered file) - got 312123 lines. And tested both commands: time sort words.txt | uniq -u >/dev/null (got: 8.4 secs) and time awk .... words.txt >/dev/null got: 1.3 secs. So, for me (repeated few times) the awk is (nearly) 8 times faster than sort. – Metrify 19/5, 2014 at 19:20

R

128

Use sort and uniq:

sort inputfile | uniq -u

The -u option would cause uniq to print only unique lines. Quoting from man uniq:

   -u, --unique
          only print unique lines

For your input, it'd produce:

eagle
forest

Obs: Remember to sort before uniq -u because uniq operates on adjacent lines. So what uniq -u actually does is to print lines that don't have identical neighbor lines, but that doesn't mean they are really unique. When you sort, all the identical lines get grouped together and only the lines that are really unique in the file will remain after uniq -u.

Richelieu answered 19/5, 2014 at 14:42 Comment(5)

@jordan Don't know. Somebody didn't like it, perhaps. – Richelieu 19/5, 2014 at 14:44

@anubhava Did you try it? – Richelieu 19/5, 2014 at 14:44

Apologies I missed -u in copy/paste. – Emmieemmit 19/5, 2014 at 14:45

I like simple answer. A +1 for that simplicity. – Treed 19/5, 2014 at 14:49

Just a note. If someone is here trying to get unique lines with many columns then please refer to this question: #30895896 – Buckshot 1/8 at 16:27

E

22

Using awk:

awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file
eagle
forest