How do I sort very large files
Asked Answered
G

10

35

I have some files that should be sorted according to id at the beginning of each line. The files are about 2-3 gb.

I tried to read all data into an ArrayList and sort them. But memory is not enough to keep them all. It does not work.

Lines look like

0052304 0000004000000000000000000000000000000041 John Teddy 000023
0022024 0000004000000000000000000000000000000041 George Clan 00013

How can I sort the files??

Glass answered 27/10, 2011 at 15:10 Comment(3)
If you use a recent version of Java 6 you will need about 4 GB of memory. I assume you don't have that much ??Trifid
What if you read just the ids into the ArrayList and sort them?Urbannal
The authors of all the sort packages mentioned in these answers don't seem to have ever heard of replacement selection or minimum-path-length merging. Ergo they are all suboptimal, some of them severely so. See Knuth, The Art of Computer Programming, vol. 3, section 5.4 "External Sorting". All this stuff was worked out over 60 years ago. A lost art apparently.Bodine
S
40

That isn't exactly a Java problem. You need to look into an efficient algorithm for sorting data that isn't completely read into memory. A few adaptations to Merge-Sort can achieve this.

Take a look at this: http://en.wikipedia.org/wiki/Merge_sort

and: http://en.wikipedia.org/wiki/External_sorting

Basically the idea here is to break the file into smaller pieces, sort them (either with merge sort or another method), and then use the Merge from merge-sort to create the new, sorted file.

Serviceable answered 27/10, 2011 at 15:15 Comment(3)
1) Are you saying that we sort each individual smaller files and write the sorted data back into each of them, then later read from each of these files again and write into a new bigger final file? 2) Will this scale for a 16GB flat file on a 8GB RAM or would we run into Memory issues in any scenario in this above solution?Anuria
Yes, that's the basic idea. It'll work, since you'll never load all the files into memory, you just need to look at the start of each file (similar to a merge sort).Serviceable
@Serviceable Suppose i have a file which 1 GB size and i want to split the file in MB, How can i do that in java??Paganini
D
24

Since your records are already in flat file text format, you can pipe them into UNIX sort(1) e.g. sort -n -t' ' -k1,1 < input > output. It will automatically chunk the data and perform merge sort using available memory and /tmp. If you need more space than you have memory available, add -T /tmpdir to the command.

It's quite funny that everyone is telling you to download huge C# or Java libraries or implement merge-sort yourself when you can use a tool that is available on every platform and has been around for decades.

Disembodied answered 17/8, 2016 at 17:53 Comment(2)
sorry for downvoting but original question has tag java and there is no information about using *nixOmaromara
I think this is the best answer here, even considering the Java tag. OP mentioned that he needs to sort some files, not that he is required to do it with Java. Even if OP is on Windows, he can still easily get the sort executable.Ditter
H
19

You need an external merge sort to do that. Here is a Java implementation of it that sorts very large files.

Hydrocellulose answered 27/10, 2011 at 15:15 Comment(2)
It seems to be the only available Java library that has this functionality. Have you used it in a production environment?Talley
I have just used this library to sort a 24GB csv file (roughly 850 million lines of text-data), and it worked very well. Straightforward to use with a custom comparator to specify how I want it to sort as well. So, I can definitely recommend this implementationBristol
T
7

Instead of loading all the data into memory at once, you could read just the keys and an index to where the line starts (and possibly the length as well) e.g.

class Line {
   int key, length;
   long start;
}

This would use about 40 bytes per line.

Once you have sorted this array, you can use RandomAccessFile to read the lines in the order they appear.

Note: since you will be randomly hitting the disk, instead of using memory this could be very slow. A typical disk takes 8 ms to randomly access data and if you have 10 million lines this will take about one day. (This is absolute worst case) In memory it would take about 10 seconds.

Trifid answered 27/10, 2011 at 15:31 Comment(0)
G
3

You need to perform an external sort. It's kind the driving idea behind Hadoop/MapReduce , just that it doesn't take distributed cluster into account and works on a single node.

For better performance, you should use Hadoop/Spark.

enter image description here Change this lines according to your system . fpath is your one big input file (tested with 20GB). shared path is where the execution log is stored. fdir is where the intermediate files will be stored and merged. Change these paths according to your machine.

public static final String fdir = "/tmp/";
    public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
    public static final String fPath = "/input/data-20GB.in";
    public static final String opLog = shared+"Mysort20GB.log";

Then run the following programme. Your final sorted file will be created with the name op401 in fdir path. the last line Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog); checks the output is sorted or not . Remove this line if you dont have valsort installed or the input file is not generated using gensort(http://www.ordinal.com/gensort.html) .

Also dont forget to change int totalLines = 200000000; to the total lines in your file. and thread count (int threadCount = 16) should be always in power of 2 and large enough so that (total size * 2 / no of thread) amount of data can reside in memory. Changing Thread count will change the name of final output file. Like for 16, it will be op401, for 32 it will be op501, for 8 it will be op301 etc.

Enjoy.

    import java.io.*;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    import java.util.ArrayList;
    import java.util.Comparator;
    import java.util.stream.Stream;


    class SplitFile extends Thread {
        String fileName;
        int startLine, endLine;

        SplitFile(String fileName, int startLine, int endLine) {
            this.fileName = fileName;
            this.startLine = startLine;
            this.endLine = endLine;
        }

        public static void writeToFile(BufferedWriter writer, String line) {
            try {
                writer.write(line + "\r\n");
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }

        public void run() {
            try {
                BufferedWriter writer = Files.newBufferedWriter(Paths.get(fileName));
                int totalLines = endLine + 1 - startLine;
                Stream<String> chunks =
                        Files.lines(Paths.get(Mysort20GB.fPath))
                                .skip(startLine - 1)
                                .limit(totalLines)
                                .sorted(Comparator.naturalOrder());

                chunks.forEach(line -> {
                    writeToFile(writer, line);
                });
                System.out.println(" Done Writing " + Thread.currentThread().getName());
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    class MergeFiles extends Thread {
        String file1, file2, file3;
        MergeFiles(String file1, String file2, String file3) {
            this.file1 = file1;
            this.file2 = file2;
            this.file3 = file3;
        }

        public void run() {
            try {
                System.out.println(file1 + " Started Merging " + file2 );
                FileReader fileReader1 = new FileReader(file1);
                FileReader fileReader2 = new FileReader(file2);
                FileWriter writer = new FileWriter(file3);
                BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
                BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
                String line1 = bufferedReader1.readLine();
                String line2 = bufferedReader2.readLine();
                //Merge 2 files based on which string is greater.
                while (line1 != null || line2 != null) {
                    if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
                        writer.write(line2 + "\r\n");
                        line2 = bufferedReader2.readLine();
                    } else {
                        writer.write(line1 + "\r\n");
                        line1 = bufferedReader1.readLine();
                    }
                }
                System.out.println(file1 + " Done Merging " + file2 );
                new File(file1).delete();
                new File(file2).delete();
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    public class Mysort20GB {
        //public static final String fdir = "/Users/diesel/Desktop/";
        public static final String fdir = "/tmp/";
        public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
        public static final String fPath = "/input/data-20GB.in";
        public static final String opLog = shared+"Mysort20GB.log";

        public static void main(String[] args) throws Exception{
            long startTime = System.nanoTime();
            int threadCount = 16; // Number of threads
            int totalLines = 200000000;
            int linesPerFile = totalLines / threadCount;
            ArrayList<Thread> activeThreads = new ArrayList<Thread>();

            for (int i = 1; i <= threadCount; i++) {
                int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
                int endLine = i * linesPerFile;
                SplitFile mapThreads = new SplitFile(fdir + "op" + i, startLine, endLine);
                activeThreads.add(mapThreads);
                mapThreads.start();
            }
            activeThreads.stream().forEach(t -> {
                try {
                    t.join();
                } catch (Exception e) {
                }
            });

            int treeHeight = (int) (Math.log(threadCount) / Math.log(2));

            for (int i = 0; i < treeHeight; i++) {
                ArrayList<Thread> actvThreads = new ArrayList<Thread>();

for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
                    int offset = i * 100;
                    String tempFile1 = fdir + "op" + (j + offset);
                    String tempFile2 = fdir + "op" + ((j + 1) + offset);
                    String opFile = fdir + "op" + (itr + ((i + 1) * 100));

                    MergeFiles reduceThreads =
                            new MergeFiles(tempFile1,tempFile2,opFile);
                    actvThreads.add(reduceThreads);
                    reduceThreads.start();
                }
                actvThreads.stream().forEach(t -> {
                    try {
                        t.join();
                    } catch (Exception e) {
                    }
                });
            }
            long endTime = System.nanoTime();
            double timeTaken = (endTime - startTime)/1e9;
            System.out.println(timeTaken);
            BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
            logFile.write("Time Taken in seconds:" + timeTaken);
            Runtime.getRuntime().exec("valsort  " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog);
            logFile.close();
        }
    }
Gershwin answered 15/4, 2018 at 8:5 Comment(0)
C
3

Use the java library big-sorter which can be used for sorting very large text or binary files.

Here's how your exact problem would be implemented:

// write the input to a file
String s = "0052304 0000004000000000000000000000000000000041   John Teddy   000023\n"
        + "0022024 0000004000000000000000000000000000000041   George Clan 00013";
File input = new File("target/input");
Files.write(input.toPath(),s.getBytes(StandardCharsets.UTF_8), StandardOpenOption.WRITE);

File output = new File("target/output");


//sort the input
Sorter
    .serializerLinesUtf8()
    .comparator((a,b) -> {
        String ida = a.substring(0, a.indexOf(' '));
        String idb = b.substring(0, b.indexOf(' '));
        return ida.compareTo(idb);
    }) 
    .input(input) 
    .output(output) 
    .sort();

// display the output
Files.readAllLines(output.toPath()).forEach(System.out::println);

output:

0022024 0000004000000000000000000000000000000041   George Clan 00013
0052304 0000004000000000000000000000000000000041   John Teddy   000023
Crumby answered 29/5, 2019 at 2:9 Comment(0)
B
2

You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.

Advantages: No need to worry about writing the best sorting algorithm.

Disadvantage: You will need disk space, slower processing.

https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files

Blanco answered 14/2, 2013 at 11:0 Comment(0)
D
1

What you need to do is to chunk the files in via a stream and process them separately. Then you can merge the files together as they will already be sorted, this is similar to how merge sort works.

The answer from this SO question will be of value: Stream large files

Danner answered 27/10, 2011 at 15:13 Comment(0)
A
0

Operating systems come with powerful file sorting utility. A simple function that calls a bash script should help.

public static void runScript(final Logger log, final String scriptFile) throws IOException, InterruptedException {
    final String command = scriptFile;
    if (!new File (command).exists() || !new File(command).canRead() || !new File(command).canExecute()) {
        log.log(Level.SEVERE, "Cannot find or read " + command);
        log.log(Level.WARNING, "Make sure the file is executable and you have permissions to execute it. Hint: use \"chmod +x filename\" to make it executable");
        throw new IOException("Cannot find or read " + command);
    }
    final int returncode = Runtime.getRuntime().exec(new String[] {"bash", "-c", command}).waitFor();
    if (returncode!=0) {
        log.log(Level.SEVERE, "The script returned an Error with exit code: " + returncode);
        throw new IOException();
    }

}
Appendicitis answered 19/7, 2014 at 5:50 Comment(1)
The main problems with external sort are: (1) lack of OS portability (2) it is hard to sort java objects, even using serialization.Scopoline
S
0

I used my own logic and sorted a BIG JSON file in format

{"name":"hoge.finance","address":"0xfAd45E47083e4607302aa43c65fB3106F1cd7607"}

Full source code is available on https://github.com/sitetester/token-sorter along with test case. Code is well documented, so easy to understand.

It splits input file into multiple smaller SORTED files (configurable) and then compare data.

Pasting some comments here...

// at this point, we have sorted data sets in respective files
// next, we will take first token from first file and compare it with tokens of all other files
// during comparison, if some token from other file is in sorted order, then we make it default/initial sorted token
// & jump to next file, since all remaining tokens in THAT file are already in sorted form
// at end of comparisons with all files, we remove it from specific file (so it's not compared next time) and put/append in final sorted file
// this process continues, until all entries are matched
// if some file has no entries, then we simply delete it (so it's not compared next time)
Subdual answered 2/11, 2021 at 18:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.