How could the UNIX sort command sort a very large file?
Asked Answered
S

8

119

The UNIX sort command can sort a very large file like this:

sort large_file

How is the sort algorithm implemented?

How come it does not cause excessive consumption of memory?

Soundless answered 30/5, 2009 at 16:18 Comment(8)
This is interesting. I don't really know how it works, but I have a guess. It probably puts the first character of each key into a binary tree, and when there is a collision, it uses the next character of the key also, so it doesn't save more of the key than it needs to. It may then save an offset into the file with each key so it can seek back and print each line in order.Fie
Actually, @ayaz it's more interesting if you aren't sorting a file on disk but rather in a pipe since it makes it obvious that you can't simply do multiple passes over the input data.Jit
Why does everyone on SO feel so impelled to guess all the time?Idiolect
You can do multiple passes on the input - you just need to read all the input, write it to disk, and then sort the disk file.Idiolect
@Neil - from the context it seemed obvious that he was trying to sort the contents of the file not the file name (which for one name is meaningless). I just wanted to improve the question without changing the context too much so that it would get answers instead of downvotes because of a simple mistake.Jit
@Neil -- my point was that using the pipe makes it obvious that you don't have access to the original file and naive implementations that make multiple passes over the input data won't work. That makes the question (and implementation) more interesting.Jit
@Jit this indeed is a mistake, I'm very sorry for this mistakeSoundless
unix.stackexchange.com/questions/120096/how-to-sort-big-filesMogilev
S
125

The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.

Stroganoff answered 30/5, 2009 at 16:26 Comment(0)
A
52

The sort command stores working data in temporary disk files (usually in /tmp).

Alexandro answered 30/5, 2009 at 16:26 Comment(1)
use -T to specify the temp dirQuibbling
T
12

WARNING: This script starts one shell per chunk, for really large files, this could be hundreds.


Here is a script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% !

#! /bin/ksh

MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted

usage ()
{
     echo Parallel sort
     echo usage: psort file1 file2
     echo Sorts text file file1 and stores the output in file2
     echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
     echo  and each chunk will be sorted in parallel
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE

#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX

for file in $CHUNK_FILE_PREFIX*
do
    sort $file > $file.sorted &
done
wait

#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null

See also: "Sorting large files faster with a shell script"

Tripody answered 2/3, 2010 at 11:31 Comment(8)
You can just use sort --parallel N as of GNU sort version 8.11Dibble
GNU coreutils 8.6 actuallyLoculus
This one did the trick for me. I have sort 8.4 version. Using sort directly on the file (190 million lines) was going no where. This program did it with just under 4 minutesMikelmikell
again, this answer has nothing to do with the questionLycia
This script is dangerous. My Linux machine lost response after launching hundreds of sort processes…Olericulture
@YongweiWu, that's what I was just looking at. If the input file is broken up in 100 files, then it will start 100 sort -u in the for loop!Exoteric
I use sort all the time, memory use / time taken has never been an issue. Completely mixed up list of 5,509,041 url's with parameter strings sorted uniquely in 0m10.539sBlemish
@Lycia It's called a subtle flex.Carson
S
12
#!/bin/bash

usage ()
{
    echo Parallel sort
    echo usage: psort file1 file2
    echo Sorts text file file1 and stores the output in file2
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

pv $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {} > $2
Sisyphus answered 23/10, 2012 at 7:46 Comment(2)
This is excellent. Wasn't aware that there was a parallel package ! Sort time improved by more that 50% after using the above. Thanks.Newfashioned
I tried to use comm for diff on the files generated by this and its giving me warning that files are not sorted.Lorollas
C
11

I'm not familiar with the program but I guess it is done by means of external sorting (most of the problem is held in temporary files while relatively small part of the problem is held in memory at a time). See Donald Knuth's The Art of Computer Programming, Vol. 3 Sorting and Searching, Section 5.4 for very in-depth discussion of the subject.

Columella answered 30/5, 2009 at 16:29 Comment(0)
C
7

Look carefully at the options of sort to speed performance and understand it's impact on your machine and problem. Key parameters on Ubuntu are

  • Location of temporary files -T directory_name
  • Amount of memory to use -S N% ( N% of all memory to use, the more the better but avoid over subscription that causes swapping to disk. You can use it like "-S 80%" to use 80% of available RAM, or "-S 2G" for 2 GB RAM.)

The questioner asks "Why no high memory usage?" The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.

Cormick answered 4/6, 2013 at 21:18 Comment(2)
trying this out on a 2.5GB file, on a box with 64GB of RAM with -S 80%, it is actually using that full percentage, even though the entire file is smaller than that. why is that? even if it doesn't use an in-place sort that seems gratuitousKruter
Probably sort -S pre-allocates the memory for the sort process before even reading the contents of file.Cormick
A
-2

How to use -T option for sorting large file

I have to sort a large file's 7th column.

I was using:

grep vdd  "file name" | sort -nk 7 |

I faced below error:

******sort: write failed: /tmp/sort1hc37c: No space left on device******

then I used -T option as below it worked:

grep vdda  "file name" | sort -nk 7  -T /dev/null/ |
Alberto answered 13/10, 2020 at 5:26 Comment(1)
Please use another example directory than /dev/null.Noctilucent
H
-3

Memory should not be a problem - sort already takes care of that. If you want make optimal usage of your multi-core CPU I have implementend this in a small script (similar to some you might find on the net, but simpler/cleaner than most of those ;)).

#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
# 
# psort largefile.txt 20m 4    
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
    let i++
    sort $fname > $fname.$suffix &
    mres=$(($i % $nthreads))
    test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix 
rm $1.part*
Halfdan answered 21/6, 2011 at 22:27 Comment(2)
Interesting script, but it does nothing to answer this question.Caren
split -b will split by bytes, thus truncating the lines at an arbitrary positionGounod

© 2022 - 2024 — McMap. All rights reserved.