Sorting 20GB of data

Asked 3/4, 2014 at 19:36 Answered 4/4, 2014 at 8:9

In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM. Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.

My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.

So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.

Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.

I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/

So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.

PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.

Unmerciful answered 3/4, 2014 at 19:36 Comment(5)

While it's not Delphi-specific, there are dozens of posts here related to sorting large text files. Search for sort large files and look through the results; there are links to various techniques for doing so, such as the ones in the first result. As stated, your question is very broad in scope, and I'm not sure it can be specifically answered here without sample data. – Preselector 3/4, 2014 at 20:4

If you want to write your own I can dig up some old mergesort code from a backup CD and upload it somewhere. It's DOS command line stuff though. – Bullshit 4/4, 2014 at 17:31

@JanDoggen-It doesn't matter it is command line. I need it without GUI anyway. Already-made code is always welcome. Thanks a lot. – Unmerciful 5/4, 2014 at 9:59

@KenWhite-The problem may be general (probably many of the Delphi questions on StackOverflow may be categorized as 'general') but as always the Delphi solutions may be different that Java solutions. Just take a look at manlio's answer and you will understand what I mean. – Unmerciful 5/4, 2014 at 10:1

The poor tags are keeping non-Delphi eyeballs from seeing your question. Use tags like "sorting", "algorithm", etc. You don't mention your key "column" size, but if it's only 10 bytes + you only need the top 30% of 30m entries, that's only 10mb of keys to sort; any machine nowadays can handle that. I often sort 5GB files on small cheap PC's using Kernighan+Plauger's 1976 mergesort + those jobs usually take minutes, not hours. Note that "fewer or faster read/writes" is what really makes sorting faster, so a SSD instead of a hard drive is fast even when you have little RAM or a poor algorithm. – Emit 6/4, 2014 at 4:55

I think a way to go is Mergesort, it's a great algorithm for sorting a large amount of fixed records with limited memory.

General idea:

read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2

...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)

You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).

Boddie answered 3/4, 2014 at 20:5 Comment(3)

+1 for the database recommendation. At this scope (single GBs to single TBs) the relational model offers many, many advantages over any flat file scheme you're going to find or manage to invent. Much larger than that and things start getting tricky, but for the moment a database is easily the best way to go. – Gagliano 3/4, 2014 at 22:31

forum.lazarus.freepascal.org/index.php?topic=4159.0 discusses some embedded DBs for use with Lazarus, suggesting SQLite as Firebird has a requirement for .NET. According to sqlite.org/limits.html sqlite has size restrictions of 2^64 rows and several terabytes of data, so should cope OK with your limits. From memory, installation is restricted to a DLL to provide with your exe and the biggest restrictions are linked with threading and sharing your connection, neither of which you highlighted as concerns – Fanjet 4/4, 2014 at 16:42

I didn't liked the idea of a DB but since Firebird is embedded, I start to think about using it! Thanks! – Unmerciful 5/4, 2014 at 10:4

If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.

Kin answered 3/4, 2014 at 21:44 Comment(3)

Nice suggestion, but upto 30% of 20GB is still a big chunk of data to keep hold of, let alone blowing your 32-bit mem limit (acknowledge the mention of 64-bit in the OP) – Fanjet 4/4, 2014 at 16:30

@MattAllwood, of course it depends on the data actually needed, but as data is represented to the user, I doubt that 30% from 1 million entries are actually visible. I definitely dont't want to look at that. – Kin 4/4, 2014 at 16:59

@UweRaabe-About the GUI part: the data has to be presented to the user but this is not a problem at all. I can use a stringgrid and show like 100 lines (or whatever fits into the screen) and load the data in memory ONLY for those 100 lines. As the user scrolls, I unload the old data and load new data. So for visualization the mem footprint is close to nothing. – Unmerciful 5/4, 2014 at 10:11

Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.

If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.

I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.

Woodpecker answered 4/4, 2014 at 6:7 Comment(2)

any examples of such implementation? How do you obtain an index that would represent the data to be sorted? Say I want to sort alphabetically or sort by length of string in column then there will be 2 different sort values: 1. alphabetically, and 2. length. TIA – Theall 4/4, 2014 at 19:38

The simplest form is an array with the same amount of items that you have in your dataset. The array is a list of (value, index). This index itself is sorted. Based on this index, you can find the data that matches with it easily. When you need all items that start with a B, you search your sorted index, and you'll find out there where you can find the whole tuples on disk. If you do a lot of inserts and deletes, you might want to use a linked list instead of a plain array. Alternatively, instead of implementing all this yourself, you could use a database. 20GB is peanuts for the average db. – Woodpecker 4/4, 2014 at 20:53

Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.

Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.

Ikeikebana answered 4/4, 2014 at 8:9 Comment(4)

You have to register for free. It's the biggest german delphi-website, you don't get bothered with mail or such. Worth a visit and english questions are answered as well. You would be surprised to meet both english speeking germans there ;-) – Ikeikebana 5/4, 2014 at 9:58

Hmm. I wrote that class. Any chance to hand it out to you? – Ikeikebana 5/4, 2014 at 19:3

Was the code written for Delphi 7 (or other non-unicode version of delphi)? The program (as it is) crashes in function TcsStringSkipList.CompareKeys. Tonight is late. I will try tomorrow to figure out how is works and why it crashes. – Unmerciful 17/11, 2015 at 19:28

That explains it. I will try to upgrade it. – Unmerciful 18/11, 2015 at 21:37

If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.

Antecede answered 3/4, 2014 at 20:1 Comment(1)

I was sure there had to be a technical name (and algorithm) for that. Thanks a lot, I will take a look into that. – Unmerciful 3/4, 2014 at 20:6

Recommended topics

Hot tags