split very large text file by max rows
Asked Answered
R

2

9

I want to split a huge file containing strings into a set of new (smaller) file and tried to use nio2.

I do not want to load the whole file into memory, so I tried it with BufferedReader.

The smaller text files should be limited by the number of text rows.

The solution works, however I want to ask if someone knows a solution with better performance by usion java 8 (maybe lamdas with stream()-api?) and nio2:

public void splitTextFiles(Path bigFile, int maxRows) throws IOException{

        int i = 1;
        try(BufferedReader reader = Files.newBufferedReader(bigFile)){
            String line = null;
            int lineNum = 1;

            Path splitFile = Paths.get(i + "split.txt");
            BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);

            while ((line = reader.readLine()) != null) {

                if(lineNum > maxRows){
                    writer.close();
                    lineNum = 1;
                    i++;
                    splitFile = Paths.get(i + "split.txt");
                    writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
                }

                writer.append(line);
                writer.newLine();
                lineNum++;
            }

            writer.close();
        }
}
Rubeola answered 28/8, 2014 at 16:28 Comment(3)
Since you're reading the file only once and sequentially, I don't think any API is likely to give you considerably better performance. Lambdas can make the code look better but since your process is massively IO-bound, they won't affect performance at all.Alter
Thanks. In #25547250 nio2 was used with FileChannel which performs better than char based reader, however, I guess, for this case there is no way using FileChannel as I need access to the actual row of the file.Rubeola
Good point, yes, that's part of it too. If you wanted fixed size chunks (every file is exactly 1MB for example), you could definitely save the cost of converting bytes into characters.Alter
W
4

Beware of the difference between the direct use of InputStreamReader/OutputStreamWriter and their subclasses and the Reader/Writer factory methods of Files. While in the former case the system’s default encoding is used when no explicit charset is given, the latter always default to UTF-8. So I strongly recommend to always specify the desired charset, even if it’s either Charset.defaultCharset() or StandardCharsets.UTF_8 to document your intention and avoid surprises if you switch between the various ways to create a Reader or Writer.


If you want to split at line boundaries, there is no way around looking into the file’s contents. So you can’t optimize it the way like when merging.

If you are willing to sacrifice the portability you could try some optimizations. If you know that the charset encoding will unambiguously map '\n' to (byte)'\n' as it’s the case for most single byte encodings as well as for UTF-8 you can scan for line breaks on the byte level to get the file positions for the split and avoid any data transfer from your application to the I/O system.

public void splitTextFiles(Path bigFile, int maxRows) throws IOException {
    MappedByteBuffer bb;
    try(FileChannel in = FileChannel.open(bigFile, READ)) {
        bb=in.map(FileChannel.MapMode.READ_ONLY, 0, in.size());
    }
    for(int start=0, pos=0, end=bb.remaining(), i=1, lineNum=1; pos<end; lineNum++) {
        while(pos<end && bb.get(pos++)!='\n');
        if(lineNum < maxRows && pos<end) continue;
        Path splitFile = Paths.get(i++ + "split.txt");
        // if you want to overwrite existing files use CREATE, TRUNCATE_EXISTING
        try(FileChannel out = FileChannel.open(splitFile, CREATE_NEW, WRITE)) {
            bb.position(start).limit(pos);
            while(bb.hasRemaining()) out.write(bb);
            bb.clear();
            start=pos;
            lineNum = 0;
        }
    }
}

The drawbacks are that it doesn’t work with encodings like UTF-16 or EBCDIC and, unlike BufferedReader.readLine() it won’t support lone '\r' as line terminator as used in old MacOS9.

Further, it only supports files smaller than 2GB; the limit is likely even smaller on 32Bit JVMs due to the limited virtual address space. For files larger than the limit, it would be necessary to iterate over chunks of the source file and map them one after another.

These issues could be fixed but would raise the complexity of this approach. Given the fact that the speed improvement is only about 15% on my machine (I didn’t expect much more as the I/O dominates here) and would be even smaller when the complexity raises, I don’t think it’s worth it.


The bottom line is that for this task the Reader/Writer approach is sufficient but you should take care about the Charset used for the operation.

Webbing answered 29/8, 2014 at 10:55 Comment(0)
E
1

I made a slight modification to @nimo23 code, taking into account the option of adding a header and a footer for each of the split files, also it output the files into a directory with the same name as the original file with _split appended to it. the code below:

public static void splitTextFiles(String fileName, int maxRows, String header, String footer) throws IOException
    {
        File bigFile = new File(fileName);
        int i = 1;
        String ext = fileName.substring(fileName.lastIndexOf("."));

        String fileNoExt = bigFile.getName().replace(ext, "");
        File newDir = new File(bigFile.getParent() + "\\" + fileNoExt + "_split");
        newDir.mkdirs();
        try (BufferedReader reader = Files.newBufferedReader(Paths.get(fileName)))
        {
            String line = null;
            int lineNum = 1;
            Path splitFile = Paths.get(newDir.getPath() + "\\" +  fileNoExt + "_" + String.format("%03d", i) + ext);
            BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
            while ((line = reader.readLine()) != null)
            {
                if(lineNum == 1)
                {
                    writer.append(header);
                    writer.newLine();
                }
                writer.append(line);
                writer.newLine();
                lineNum++;
                if (lineNum > maxRows)
                {
                    writer.append(footer);
                    writer.close();
                    lineNum = 1;
                    i++;
                    splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%03d", i) + ext);
                    writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
                }
            }
            if(lineNum <= maxRows) // early exit
            {
                writer.append(footer);
            }
            writer.close();
        }

        System.out.println("file '" + bigFile.getName() + "' split into " + i + " files");
    }
Eucalyptol answered 16/4, 2017 at 12:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.