How get unique lines from a very large file in linux?

About

Asked 27/7, 2017 at 17:30 Answered 27/7, 2017 at 18:4

I have a very large data file (255G; 3,192,563,934 lines). Unfortunately I only have 204G of free space on the device (and no other devices I can use). I did a random sample and found that in a given, say, 100K lines, there are about 10K unique lines... but the file isn't sorted.

Normally I would use, say:

pv myfile.data | sort | uniq > myfile.data.uniq

and just let it run for a day or so. That won't work in this case because I don't have enough space left on the device for the temporary files.

I was thinking I could use split, perhaps, and do a streaming uniq on maybe 500K lines at a time into a new file. Is there a way to do something like that?

I thought I might be able to do something like

tail -100000 myfile.data | sort | uniq >> myfile.uniq && trunc --magicstuff myfile.data

but I couldn't figure out a way to truncate the file properly.

Kokaras answered 27/7, 2017 at 17:30 Comment(5)

Use sort -u, it's probably smart about it and will only use your estimated 10% of temporary space. – Panay 27/7, 2017 at 17:33

@thatotherguy oooh... I didn't know about that option. I'll give it a whirl. – Kokaras 27/7, 2017 at 17:33

I think the problem is with the sort command, because you need the file size of available space to sort it... – Newkirk 27/7, 2017 at 17:33

@DaniloFavato Yeah, I think that's the issue too, but I have to sort it for uniq to work... – Kokaras 27/7, 2017 at 17:35

@thatotherguy So far so good... I'm 16GB through the file (with 3:37:50 left to go) and it's used less than 1GB of space on the device for the sorting. It may be doing it all in RAM-- what I was hoping for. If you post this as an answer, I'll mark it. – Kokaras 27/7, 2017 at 17:59

Use sort -u instead of sort | uniq

This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

Panay answered 27/7, 2017 at 18:4 Comment(3)

This worked very well for me. It ended up using negligible disk space. – Kokaras 28/7, 2017 at 20:51

In my experiments it seemed like sort (GNU coreutils) 8.31 was not smart enough to remove sequential duplicates before sorting. To remove sequential duplicates before sorting you can always use uniq | sort -u. – Edette 3/9, 2019 at 8:57

GNU sort -u currently removes duplicates (sequential or not) between sorting and merging, so sort -u will use less temporary disk space. If there is a lot of sequential duplicates in the input, a first pass with uniq is a good idea. – Panay 3/9, 2019 at 16:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags