How get unique lines from a very large file in linux?
Asked Answered
K

1

6

I have a very large data file (255G; 3,192,563,934 lines). Unfortunately I only have 204G of free space on the device (and no other devices I can use). I did a random sample and found that in a given, say, 100K lines, there are about 10K unique lines... but the file isn't sorted.

Normally I would use, say:

pv myfile.data | sort | uniq > myfile.data.uniq

and just let it run for a day or so. That won't work in this case because I don't have enough space left on the device for the temporary files.

I was thinking I could use split, perhaps, and do a streaming uniq on maybe 500K lines at a time into a new file. Is there a way to do something like that?

I thought I might be able to do something like

tail -100000 myfile.data | sort | uniq >> myfile.uniq && trunc --magicstuff myfile.data

but I couldn't figure out a way to truncate the file properly.

Kokaras answered 27/7, 2017 at 17:30 Comment(5)
Use sort -u, it's probably smart about it and will only use your estimated 10% of temporary space.Panay
@thatotherguy oooh... I didn't know about that option. I'll give it a whirl.Kokaras
I think the problem is with the sort command, because you need the file size of available space to sort it...Newkirk
@DaniloFavato Yeah, I think that's the issue too, but I have to sort it for uniq to work...Kokaras
@thatotherguy So far so good... I'm 16GB through the file (with 3:37:50 left to go) and it's used less than 1GB of space on the device for the sorting. It may be doing it all in RAM-- what I was hoping for. If you post this as an answer, I'll mark it.Kokaras
P
15

Use sort -u instead of sort | uniq

This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

Panay answered 27/7, 2017 at 18:4 Comment(3)
This worked very well for me. It ended up using negligible disk space.Kokaras
In my experiments it seemed like sort (GNU coreutils) 8.31 was not smart enough to remove sequential duplicates before sorting. To remove sequential duplicates before sorting you can always use uniq | sort -u.Edette
GNU sort -u currently removes duplicates (sequential or not) between sorting and merging, so sort -u will use less temporary disk space. If there is a lot of sequential duplicates in the input, a first pass with uniq is a good idea.Panay

© 2022 - 2024 — McMap. All rights reserved.