I have a 35GB CSV
file. I want to read each line, and write the line out to a new CSV if it matches a condition.
try (BufferedWriter writer = Files.newBufferedWriter(Paths.get("source.csv"))) {
try (BufferedReader br = Files.newBufferedReader(Paths.get("target.csv"))) {
br.lines().parallel()
.filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
.forEach(line -> {
writer.write(line + "\n");
});
}
}
This takes approx. 7 minutes. Is it possible to speed up that process even more?
parallel
makes it faster? And doesn't that shuffle the lines around? – Krystinparallel()
gives +1min longer on top. I don't care about shuffed lines in a csv. – DocumentaryBufferedWriter
yourself, using the constructor that lets you set the buffer size. Maybe a bigger (or smaller) buffer size will make a difference. I would try to match theBufferedWriter
buffer size to the host operating system buffer size. – Heterozygous8192
– Documentaryjava.util.Scanner
. It allows pattern matching right withing the mutable buffer, rather than creating immutableString
instances. Care to extract the intended portions, without obsolete intermediate substring operations. This could be improved even more by a custom implementation that allows to pass the input buffer directly to the output writer (the fragment specified by offsets). Don't useBufferedReader
/BufferedWriter
. – Uhlscanner.nextLine()
still returns aString
, so conversation already took place, even if I applyscanner.skipPattern()
beforehand.... – Documentary>#> some text # some more text
and you want to read the delimiter#
and then substring say from#
to the end of the line? – Imparadise#
. – Documentary