The performance problem has nothing to do with the way the data is read. It is already buffered. Nothing happens until you actually iterate through the lines:
// measures time taken by enclosed code
def timed[A](block: => A) = {
val t0 = System.currentTimeMillis
val result = block
println("took " + (System.currentTimeMillis - t0) + "ms")
result
}
val source = timed(scala.io.Source.fromFile("test.txt")) // 200mb, 500 lines
// took 0ms
val lines = timed(source.getLines)
// took 0ms
timed(lines.next) // read first line
// took 1ms
// ... reset source ...
var x = 0
timed(lines.foreach(ln => x += ln.length)) // "use" every line
// took 421ms
// ... reset source ...
timed(lines.toArray)
// took 915ms
Considering a read-speed of 500mb per second for my hard drive, the optimum time would be at 400ms for the 200mb, which means that there is no room for improvements other than not converting the iterator to an array.
Depending on your application you could consider using the iterator directly instead of an Array. Because working with such a huge array in memory will definitely be a performance issue anyway.
Edit: From your comments I assume, that you want to further transform the array (Maybe split the lines into columns as you said you are reading a numeric array). In that case I recommend to do the transformation while reading. For example:
source.getLines.map(_.split(",").map(_.trim.toInt)).toArray
is considerably faster than
source.getLines.toArray.map(_.split(",").map(_.trim.toInt))
(For me it is 1.9s instead of 2.5s)
because you don't transform an entire giant array into another but just each line individually, ending up in one single array (Uses only half the heap space). Also since reading the file is a bottleneck, transforming while reading has the benefit that it results in better CPU utilization.
Source
usesBufferedSource
, which in turn uses Java'sBufferedReader
. So it already reads blocks of data into memory - it doesn't read byte-by-byte. – AlarumfromFile()
has an overloaded form that takes a bufferSize arg. Have you tried increasing that > 2048? – StericMappedByteBuffer
and it runs considerably slower, spending 3/4 of its time convertingByte
arrays toString
s.Source
already makes efficient use of the oldjava.io
classes, I suspect that there aren't any considerably faster solutions. – Hausmann