Merge huge files without loading whole file into memory?
Asked Answered
R

4

8

I want to merge huge files containing strings into one file and tried to use nio2. I do not want to load the whole file into memory, so I tried it with BufferedReader:

public void mergeFiles(filesToBeMerged) throws IOException{

Path mergedFile = Paths.get("mergedFile");
Files.createFile(mergedFile);

List<Path> _filesToBeMerged = filesToBeMerged;

try (BufferedWriter writer = Files.newBufferedWriter(mergedFile,StandardOpenOption.APPEND)) {
        for (Path file : _filesToBeMerged) {
// this does not work as write()-method does not accept a BufferedReader
            writer.append(Files.newBufferedReader(file));
        }
    } catch (IOException e) {
        System.err.println(e);
    }

}

I tried it with this, this works, hower, the format of the strings (e.g. new lines, etc is not copied to the merged file):

...
try (BufferedWriter writer = Files.newBufferedWriter(mergedFile,StandardOpenOption.APPEND)) {
        for (Path file : _filesToBeMerged) {
//              writer.write(Files.newBufferedReader(file));
            String line = null;


BufferedReader reader = Files.newBufferedReader(file);
            while ((line = reader.readLine()) != null) {
                    writer.append(line);
                    writer.append(System.lineSeparator());
             }
reader.close();
        }
    } catch (IOException e) {
        System.err.println(e);
    }
...

How can I merge huge Files with NIO2 without loading the whole file into memory?

Rozina answered 28/8, 2014 at 10:38 Comment(0)
U
23

If you want to merge two or more files efficiently you should ask yourself, why on earth are you using char based Reader and Writer to perform that task.

By using these classes you are performing a conversion of the file’s bytes to characters from the system’s default encoding to unicode and back from unicode to the system’s default encoding. This means the program has to perform two data conversion on the entire files.

And, by the way, BufferedReader and BufferedWriter are by no means NIO2 artifacts. These classes exists since the very first version of Java.

When you are using byte-wise copying via real NIO functions, the files can be transferred without being touched by the Java application, in the best case the transfer will be performed directly in the file system’s buffer:

import static java.nio.file.StandardOpenOption.*;

import java.io.IOException;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
import java.nio.file.Paths;

public class MergeFiles
{
  public static void main(String[] arg) throws IOException {
    if(arg.length<2) {
      System.err.println("Syntax: infiles... outfile");
      System.exit(1);
    }
    Path outFile=Paths.get(arg[arg.length-1]);
    System.out.println("TO "+outFile);
    try(FileChannel out=FileChannel.open(outFile, CREATE, WRITE)) {
      for(int ix=0, n=arg.length-1; ix<n; ix++) {
        Path inFile=Paths.get(arg[ix]);
        System.out.println(inFile+"...");
        try(FileChannel in=FileChannel.open(inFile, READ)) {
          for(long p=0, l=in.size(); p<l; )
            p+=in.transferTo(p, l-p, out);
        }
      }
    }
    System.out.println("DONE.");
  }
}
Underling answered 28/8, 2014 at 12:13 Comment(4)
Wow, this solution is really great - and the source code is so short. Thanks! Do you know a solution based on nio2 for SPLITTING A LARGE FILE into a set of smaller files? Actually, I am using something like that todayguesswhat.blogspot.de/2014/05/….Rozina
@nimo23: well, I think, when you try to understand the code of my answer, especially what FileChannel.transferTo does, you will realize how a solution for splitting can look like (read: very similar). If you have difficulties implementing it, you can open a new question.Underling
Okay, I will try it by my own and will provide a solution here!Rozina
Okay I have posted a solution: #25554173. I cannot find a solution with nio2, as with nio2 the size of the splitted files can only be reduced by filesize. However, I want to split the text files by row numbers. Do you find a (better) solution for the splitTextFiles()-Method with nio2?Rozina
P
3

With

Files.newBufferedReader(file).readLine()

you create a new Buffer everytime and it gets always reset in the first line.

Replace with

BufferedReader reader = Files.newBufferedReader(file);
while ((line = reader.readLine()) != null) {
  writer.write(line);
}

and .close() the reader when done.

Paramagnetic answered 28/8, 2014 at 10:45 Comment(4)
thanks, I made the changes in the source code. Do you know, how I can retain the format of the merged files into "mergedFile"-File? For example, the merged files have carriage returns or blank lines. When using the method above, all of this is not copied into the "mergedFile".Rozina
Not sure what you mean, but you can append manually new-line using writer.write(System.lineSeparator());Paramagnetic
I am wondering which is more performant. The above solution or the solution in programcreek.com/2012/09/merge-files-in-java. Do you know, which one is more performant?Rozina
@Rozina write a test for it. You've got a large file so execute copying couple of times and check how much time one method took and how much the other.Niela
B
1

readLine() does not yield the line ending ("\n" or "\r\n"). That was the error.

while ((line = reader.readLine()) != null) {
    writer.write(line);
    writer.write("\r\n"); // Windows
}

You might also disregard this filtering of (possibly different) line endings, and use

try (OutputStream out = new FileOutputStream(file);
    for (Path source : filesToBeMerged) {
        Files.copy(path, out);
        out.write("\r\n".getBytes(StandardCharsets.US_ASCII));
    }
}

This writes a newline explicitly, in the case that the last line does not end with a line break.

There might still be a problem with the optional, ugly Unicode BOM character to mark the text as UTF-8/UTF-16LE/UTF-16BE at the beginning of the file.

Beshore answered 28/8, 2014 at 11:9 Comment(0)
R
0

I try to merge files to a file with 3 ways. I test these ways but I do not know which one is best...still. I thought that FileChannel is faster than others but it was not for me. Please let me know if you guys have any concern.

  1. BufferedReader & BufferedWrtier
    private static void mergeFiles(List<Path> sources, Path destination) {
        try (BufferedWriter writer = new BufferedWriter(new FileWriter(destination.toFile(), true))) {
            for (Path path : sources) {
                try (BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(path.toFile())))) {
                    String line;
                    while ((line = reader.readLine()) != null) {
                        writer.write(line);
                        writer.newLine();
                    }
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
  1. InputStream & Files.copy
    private static void mergeFiles2(List<Path> sources, Path destination) {
        try {
            BinaryOperator<InputStream> sequenceInputStream = SequenceInputStream::new;
            List<InputStream> inputStreams = new ArrayList<>();

            for (Path path : sources) {
                InputStream is = Files.newInputStream(path, StandardOpenOption.READ);
                inputStreams.add(is);
            }

            InputStream streams = inputStreams.parallelStream().reduce(sequenceInputStream).orElseThrow(() -> new IllegalStateException("inputStreams reduce exception"));
            Files.copy(streams, destination, StandardCopyOption.REPLACE_EXISTING);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
  1. File Channel
    private static void mergeFiles3(List<Path> sources, Path destination) {
        try (FileChannel desChannel = FileChannel.open(destination, StandardOpenOption.WRITE, StandardOpenOption.CREATE)) {
            for (Path path : sources) {
                try (FileChannel srcChannel = FileChannel.open(path, StandardOpenOption.READ)) {
                    for (long position = 0, size = srcChannel.size(); position < size; ) {
                        position += srcChannel.transferTo(position, size - position, desChannel);
                    }
                }
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
Remediless answered 15/11, 2023 at 11:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.