The UNIX sort
command can sort a very large file like this:
sort large_file
How is the sort algorithm implemented?
How come it does not cause excessive consumption of memory?
The UNIX sort
command can sort a very large file like this:
sort large_file
How is the sort algorithm implemented?
How come it does not cause excessive consumption of memory?
The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.
The sort
command stores working data in temporary disk files (usually in /tmp
).
-T
to specify the temp dir –
Quibbling WARNING: This script starts one shell per chunk, for really large files, this could be hundreds.
Here is a script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% !
#! /bin/ksh
MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted
usage ()
{
echo Parallel sort
echo usage: psort file1 file2
echo Sorts text file file1 and stores the output in file2
echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
echo and each chunk will be sorted in parallel
}
# test if we have two arguments on the command line
if [ $# != 2 ]
then
usage
exit
fi
#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE
#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX
for file in $CHUNK_FILE_PREFIX*
do
sort $file > $file.sorted &
done
wait
#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE
#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
See also: "Sorting large files faster with a shell script"
sort -u
in the for loop! –
Exoteric #!/bin/bash
usage ()
{
echo Parallel sort
echo usage: psort file1 file2
echo Sorts text file file1 and stores the output in file2
}
# test if we have two arguments on the command line
if [ $# != 2 ]
then
usage
exit
fi
pv $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {} > $2
I'm not familiar with the program but I guess it is done by means of external sorting (most of the problem is held in temporary files while relatively small part of the problem is held in memory at a time). See Donald Knuth's The Art of Computer Programming, Vol. 3 Sorting and Searching, Section 5.4 for very in-depth discussion of the subject.
Look carefully at the options of sort to speed performance and understand it's impact on your machine and problem. Key parameters on Ubuntu are
The questioner asks "Why no high memory usage?" The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.
How to use -T option for sorting large file
I have to sort a large file's 7th column.
I was using:
grep vdd "file name" | sort -nk 7 |
I faced below error:
******sort: write failed: /tmp/sort1hc37c: No space left on device******
then I used -T option as below it worked:
grep vdda "file name" | sort -nk 7 -T /dev/null/ |
Memory should not be a problem - sort already takes care of that. If you want make optimal usage of your multi-core CPU I have implementend this in a small script (similar to some you might find on the net, but simpler/cleaner than most of those ;)).
#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
#
# psort largefile.txt 20m 4
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
let i++
sort $fname > $fname.$suffix &
mres=$(($i % $nthreads))
test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix
rm $1.part*
© 2022 - 2024 — McMap. All rights reserved.