How to copy large data files line by line?
Asked Answered
D

3

9

I have a 35GB CSV file. I want to read each line, and write the line out to a new CSV if it matches a condition.

try (BufferedWriter writer = Files.newBufferedWriter(Paths.get("source.csv"))) {
    try (BufferedReader br = Files.newBufferedReader(Paths.get("target.csv"))) {
        br.lines().parallel()
            .filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
            .forEach(line -> {
                writer.write(line + "\n");
        });
    }
}

This takes approx. 7 minutes. Is it possible to speed up that process even more?

Documentary answered 22/10, 2019 at 9:47 Comment(15)
Yes, you could try not doing this from Java but rather do it directly from your Linux/Windows/etc. operating system. Java is interpreted, and there will always be an overhead in using it. Besides this, no, I don't any obvious way to speed it up, and 7 minutes for 35GB seems reasonable to me.Jenicejeniece
Maybe removing the parallel makes it faster? And doesn't that shuffle the lines around?Krystin
Removing parallel() gives +1min longer on top. I don't care about shuffed lines in a csv.Documentary
Create the BufferedWriter yourself, using the constructor that lets you set the buffer size. Maybe a bigger (or smaller) buffer size will make a difference. I would try to match the BufferedWriter buffer size to the host operating system buffer size.Heterozygous
How can I know the buffer size suitable? Default is 8192Documentary
By trail and errorSholapur
@TimBiegeleisen: "Java is interpreted" is misleading at best and almost always wrong as well. Yes, for some optimizations you might need to leave the JVM world, but doing this quicker in Java is definitely doable.Compeer
@JoachimSauer Then why didn't you post an answer? ^ ^Jenicejeniece
You should profile the application to see if there are any hotspots that you can do something about. You won't be able to do much about the raw IO (the default 8192 byte buffer isn't that bad, since there are sector sizes etc. involved), but there might be things happening (internally) that you might be able to work with.Oruro
On the side of a non-functional aspect, how much does the size translate to, after the filtering logic that you're performing? and how about splitting the file into chunks performing the operation and merging the results?Krawczyk
The resulting file is about 30GB.Documentary
Try java.util.Scanner. It allows pattern matching right withing the mutable buffer, rather than creating immutable String instances. Care to extract the intended portions, without obsolete intermediate substring operations. This could be improved even more by a custom implementation that allows to pass the input buffer directly to the output writer (the fragment specified by offsets). Don't use BufferedReader/BufferedWriter.Uhl
This sounds promising, could you give an example on scanner pattern matching? I mean: scanner.nextLine() still returns a String, so conversation already took place, even if I apply scanner.skipPattern() beforehand....Documentary
@Documentary Could you shed more light on the filtering? Is it something like >#> some text # some more text and you want to read the delimiter # and then substring say from # to the end of the line?Imparadise
I one of my cases (there are many), I want to skip anything that is contained within two separators, like #.Documentary
A
3

If it is an option you could use GZipInputStream/GZipOutputStream to minimize disk I/O.

Files.newBufferedReader/Writer use a default buffer size, 8 KB I believe. You might try a larger buffer.

Converting to String, Unicode, slows down to (and uses twice the memory). The used UTF-8 is not as simple as StandardCharsets.ISO_8859_1.

Best would be if you can work with bytes for the most part and only for specific CSV fields convert them to String.

A memory mapped file might be the most appropriate. Parallelism might be used by file ranges, spitting up the file.

try (FileChannel sourceChannel = new RandomAccessFile("source.csv","r").getChannel(); ...
MappedByteBuffer buf = sourceChannel.map(...);

This will become a bit much code, getting lines right on (byte)'\n', but not overly complex.

Anna answered 22/10, 2019 at 10:17 Comment(8)
The problem with bytes reading is that in real world I have to evaluate the beginning of the line, substring on a specific character and only write the remaining part of the line into the outfile. So I probably cannot read the lines as bytes only?Documentary
I just tested GZipInputStream + GZipOutputStream fully inmemory on a ramdisk. Performance was much worse...Documentary
On Gzip: then it is not a slow disk. Yes, bytes is an option: newlines, comma, tab, semicolon all can be handled as bytes, and will be considerably faster than as String. Bytes as UTF-8 to UTF-16 char to String to UTF-8 to bytes.Anna
While this sounds promising, how could I use the MappedByteBuffer beyond the 2GB filesize limit?Documentary
Just map different parts of the file over time. When you reach the limit, just create a new MappedByteBuffer from the last known-good position (FileChannel.map takes longs).Compeer
@JoachimSauer is right, with the problem of the last line broken over 2 buffers. Thanks for the the tip.Anna
So there is no example on the net how to read/write files > 2GB with MappedByteBuffer?Documentary
In 2019, there is no need to use new RandomAccessFile(…).getChannel(). Just use FileChannel.open(…).Uhl
C
0

you can try this:

try (BufferedWriter writer = new BufferedWriter(new FileWriter(targetFile), 1024 * 1024 * 64)) {
  try (BufferedReader br = new BufferedReader(new FileReader(sourceFile), 1024 * 1024 * 64)) {

I think it will save you one or two minutes. the test can be done on my machine in about 4 minutes by specifying the buffer size.

could it be faster? try this:

final char[] cbuf = new char[1024 * 1024 * 128];

try (Writer writer = new FileWriter(targetFile)) {
  try (Reader br = new FileReader(sourceFile)) {
    int cnt = 0;
    while ((cnt = br.read(cbuf)) > 0) {
      // add your code to process/split the buffer into lines.
      writer.write(cbuf, 0, cnt);
    }
  }
}

This should save you three or four minutes.

If that's still not enough. (The reason I guess you ask the question probably is you need to execute the task repeatedly). if you want to get it done in one minutes or even couple of seconds. then you should process the data and save it into db, then process the task by multiple servers.

Confection answered 22/10, 2019 at 22:3 Comment(1)
To your last example: how can I then evaluate the cbuf content, and only write portions out? And would I have to reset the buffer once full? (how can I know the buffer is full?)Documentary
D
0

Thanks to all your suggestions, the fastest I came up with was exchanging the writer with BufferedOutputStream, which gave approx 25% improvement:

   try (BufferedReader reader = Files.newBufferedReader(Paths.get("sample.csv"))) {
        try (BufferedOutputStream writer = new BufferedOutputStream(Files.newOutputStream(Paths.get("target.csv")), 1024 * 16)) {
            reader.lines().parallel()
                    .filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
                    .forEach(line -> {
                        writer.write((line + "\n").getBytes());
                    });
        }
    }

Still the BufferedReader performs better than BufferedInputStream in my case.

Documentary answered 24/10, 2019 at 14:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.