How to sort a large file on two levels efficiently?

Asked 12/8, 2013 at 17:17 Answered 24/7, 2024 at 7:24

I have a very large file, over 100GB (many billions of lines), and I would like to conduct a two-level sort as quick as possible on a Unix system with limited memory. This will be one step in a large Perl script, so I'd like to use Perl if possible.

So, how can I do this?

My data looks like this:

... but for billions of lines. I need to first sort by letter, and then by number. Would it be easier to use a Unix sort, like

sort -k1,2 myfile

or can I do this all in Perl somehow? My system will have something like 16GB RAM, but the file is about 100GB.

Scepter answered 12/8, 2013 at 17:17 Comment(2)

Are all the numbers 3 digits? If not, are they right-aligned? If both these conditions hold (all 3 digits or right-aligned) then you only need a single-level textual sort. – Warner 12/8, 2013 at 17:20

@Jim, thanks for the comment. No, the numbers range from 1-100,000,000 and these are just two, non-adjacent columns of a larger spreadsheet (genome sequencing data) – Scepter 12/8, 2013 at 17:22

The UNIX sort utility can handle sorting large data (e.g. larger than your working 16GB of RAM) by creating temporary working files on disk space.

So, I'd recommend simply using UNIX sort for this as you've suggested, invoking the option -T tmp_dir, and making sure that tmp_dir has enough disk space to hold all of the temporary working files that will be created there.

By the way, this is discussed in a previous SO question.

Custommade answered 12/8, 2013 at 17:20 Comment(0)

The UNIX sort is the best option for sorting data of this scale. I would recommend use fast compression algorithm LZO for it. It is usually distributed as lzop. Set big sort buffer using -S option. If you have some disk faster than then where you have default /tmp set also -T. Also, if you want sort by a number you have to define sorting number sorting as second sorting field. So you should use line like this for best performance:

LC_ALL=C sort -S 90% --compress-program=lzop -k1,1 -k2n

Soneson answered 12/8, 2013 at 21:20 Comment(0)

I had the exact same issue! After searching a lot and since I did not want any dependency on the shell (UNIX) to make it portable to Windows I came up with the below solution:

#!/usr/bin/perl
use File::Sort qw(sort_file);
my $src_dic_name = 'C:\STORAGE\PERSONAL\PROJECTS\perl\test.txt';
sort_file({k => 1, t=>"    ", I => $src_dic_name, o => $src_dic_name.".sorted"});

I know this is a old post but updating it with the solution so that it is easy to find.

Documentation here

Baseman answered 13/1, 2016 at 13:43 Comment(0)

As Hynek already stated, GNU sort is incredibly good at handling huge amounts of data, they even use several different algorithms to presort smaller chunks and then recombining them with another algorithm.

But you can make it even better using the -S option for memory. By default GNU sort uses around 4-8 MByte of memory at most - a nice greeting from the 1990s.

On my 64 GByte machine I had to sort a 300 GByte file. Using -S 16G increased the speed 20 times and reduced the write-access around 60-80% which is nice if you want to reduce write load on your flash-memory.

There is only one drawback, at least on my 64 GByte Cygwin system, anything above 16 GByte had wonky results - running into out of memory suddenly, hanging, becoming very slow and so on. This could be a Windows thing as even some Windows software wasn't able to use the whole memory though (I tried xz with a 48 GByte dictionary but everthing above 32 GByte also was unreliable).

Dropforge answered 24/7, 2024 at 7:24 Comment(0)

Recommended topics

Hot tags