Remove duplicate lines without sorting [duplicate]

Asked 17/7, 2012 at 23:14 Answered 30/4, 2018 at 8:45

163

I have a utility script in Python:

#!/usr/bin/env python
import sys
unique_lines = []
duplicate_lines = []
for line in sys.stdin:
  if line in unique_lines:
    duplicate_lines.append(line)
  else:
    unique_lines.append(line)
    sys.stdout.write(line)
# optionally do something with duplicate_lines

This simple functionality (uniq without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn't it? Maybe a combination of filters in a pipe?

Reason for asking: needing this functionality on a system on which I cannot execute Python from anywhere.

Theomania answered 17/7, 2012 at 23:14 Comment(3)

Unrelated: you should really use a set rather than a list in that Python script; checking for membership in a list is a linear-time operation. – Liebfraumilch 17/7, 2012 at 23:18

I removed "Python" from your tags and title since this really has nothing to do with Python. – Bydgoszcz 17/7, 2012 at 23:20

if this had to be done in Python a better approach would involve using the uniq_everseen itertools recipe: docs.python.org/library/itertools.html#recipes – Felisha 23/7, 2012 at 17:2

392

The UNIX Bash Scripting blog suggests:

awk '!x[$0]++'

This command is telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x is incremented and the line printed if the content of that node was not (!) previously set.

Bydgoszcz answered 17/7, 2012 at 23:17 Comment(20)

For a short awk statement like this (no curly brackets involved), the command is simply telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, we are incrementing a node of the array named x and printing the line if the content of that node was not (!) previously set. – Dolph 17/12, 2012 at 14:59

I did a loop of 1000 runs with sort -u and that awk one, and both run in about 3s (awk took 0.15s more in avg). So I think it will work perfectly, thx! – Dahlberg 4/6, 2014 at 11:24

@AquariusPower But doesn't awk become faster than sort, if you increase the the size of the unordered input file? – Isolated 12/5, 2015 at 18:1

Perhaps this command would be easier to understand awk '!($0 in x){x[$0]++; print $0} – Isolated 12/5, 2015 at 20:2

!x[$0] does not test, if x[$0] is not set, but if x[$0] is zero or empty string. ($0 in x) tests if x[$0] is set. However unset variables have zero (or empty string) value in awk, when asked, so the test works. Besides, the post-fix operator ++ is performed after the logical not operator (!), which is crucial in the script. – Isolated 12/5, 2015 at 21:22

Most compact and finest scripts I tumbled across. Kudos! – Effeminize 21/12, 2015 at 5:46

Surely it would be less obfuscated to name that array e.g. seen instead of x, to avoid giving newbies the impression that awk syntax is line noise – Saxena 21/12, 2015 at 10:43

Keep in mind that this will load the entire file into memory, so don't try this on a 3GB text file without lots of RAM to spare. – Bladdernose 2/6, 2017 at 15:39

How to keep all the empty lines? – Siusan 21/5, 2018 at 12:18

@Bladdernose This won't necessarily load the whole file into memory, only the unique lines. This of course could end up being the whole file though if all the lines are unique. – Flour 11/7, 2018 at 17:33

Thank You! This is the finest and smartest solution to find unique elements within an array when I'm parsing tags in a delimited file. – Bridal 21/8, 2018 at 13:0

getting error as x[: event not found – Roede 1/9, 2018 at 16:40

@ChandanChoudhury The quotation marks are not optional. – Bydgoszcz 1/9, 2018 at 18:26

I had to use awk '!mem[$0]++ { print $0; fflush() }', because the buffering otherwise broke the point of the script I was developing. – Mary 8/9, 2018 at 16:53

https://mcmap.net/q/13366/-how-to-delete-duplicate-lines-in-a-file-without-sorting-it-in-unix with a detailed description of how it works. – Ebbie 26/10, 2018 at 14:58

Maybe a quick way to do this inline? – Rolanda 4/3, 2020 at 13:29

This will work if you also want to retain empty lines: awk 'length==0 || !x[$0]++'. – Drice 2/4, 2020 at 20:6

The Stackoverflow school of Bashcraft and AWKary! The trio would be so proud!!! – Kennedy 7/8, 2020 at 5:33

Worth putting in your .bash_aliases if you find yourself using it often. alias unique='awk "!seen[\$0]++"' Then you can just echo "$values" | unique – Wilmoth 15/1, 2021 at 20:2

this is pure magic! – Monotonous 3/4, 2022 at 1:13

105

A late answer - I just ran into a duplicate of this - but perhaps worth adding...

The principle behind @1_CR's answer can be written more concisely, using cat -n instead of awk to add line numbers:

cat -n file_name | sort -uk2 | sort -n | cut -f2-

Use cat -n to prepend line numbers
Use sort -u remove duplicate data (-k2 says 'start at field 2 for sort key')
Use sort -n to sort by prepended number
Use cut to remove the line numbering (-f2- says 'select field 2 till end')

Rashad answered 17/12, 2013 at 16:39 Comment(8)

Easy to understand, and this is often valuable. Any ideas of performance with big files against shortest Michael Hoffman's solution above? – Strongbox 1/1, 2015 at 2:50

More readable/maintainable. Needed the same but with a reverse sort to keep only the last occurrence of each unique value. Using both --reverse and --unique in the same sort command doesn't return the results one might expect. Apparently, sort does a premature optimization by 1st applying --unique on the input (in order to reduce processing in subsequent steps). This removes data needed for the --reverse step too early. To fix this, insert a sort --reverse -k2 as the 1st sort in the pipeline: cat -n file_name | sort -rk2 | sort -uk2 | sort -nk1 | cut -f2- – Phenomenology 24/4, 2017 at 9:36

Took just 60 seconds for a 900MB+ text file with so many (randomly placed) duplicate lines that the result is only 39KB. Sufficiently fast. – Tremml 24/7, 2019 at 14:9

"Pipe" version: cat file_name | cat -n | sort -uk2 | sort -nk1 | cut -f2-. – Jd 15/1, 2020 at 18:1

Redirecting the output of this answer's command to a file is blank. @victor-yarema 's version was able to redirect to a file as expected. (macOS zsh) – Downwash 30/1, 2023 at 17:0

sort -uk2 as used here preserves the first copy of a line. If you want to preserve the last copy of a line, reverse the line order with tac then unreverse at the end: tac file_name | cat -n | sort -uk2 | sort -n | cut -f2- | tac. – Arlenarlena 3/2, 2023 at 9:30

To remove duplicate from 2 files :

awk '!a[$0]++' file1.csv file2.csv

Cozen answered 22/8, 2017 at 3:32 Comment(0)

`uq`

uq is a small tool written in Rust. It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.

There are two advantages of this tool over the top-voted awk solution and other shell-based solutions:

uq remembers the occurence of lines using their hash values, so it doesn't use as much memory use when the lines are long.
uq can keep the memory usage constant by setting a limit on the number of entries to store (when the limit is reached, there is a flag to control either to override or to die), while the awk solution could run into OOM when there are too many lines.

Rattler answered 30/4, 2018 at 8:45 Comment(1)

Quite inconvenient and less portable, given awk already does this. – Grouch 9/2, 2020 at 22:27

Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash

awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'

Felisha answered 23/7, 2012 at 16:43 Comment(1)

this seems to be rather slow, though – Abrasion 24/8, 2015 at 6:43

Thanks 1_CR! I needed a "uniq -u" (remove duplicates entirely) rather than uniq (leave 1 copy of duplicates). The awk and perl solutions can't really be modified to do this, your's can! I may have also needed the lower memory use since I will be uniq'ing like 100,000,000 lines 8-). Just in case anyone else needs it, I just put a "-u" in the uniq portion of the command:

awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq -u --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'

Smiga answered 23/10, 2013 at 18:26 Comment(0)

-1

I just wanted to remove all duplicates on following lines, not everywhere in the file. So I used:

awk '{
  if ($0 != PREVLINE) print $0;
  PREVLINE=$0;
}'

Encumbrance answered 5/2, 2016 at 10:8 Comment(1)

doesn't uniq do just that... – Strunk 9/11, 2016 at 11:22

-3

the uniq command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html

Oleviaolfaction answered 6/10, 2017 at 11:3 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

uq

Recommended topics

Hot tags

`uq`