Scala fast text file read and upload to memory - McMap

About

Scala fast text file read and upload to memory

Asked 11/4, 2014 at 8:43 Answered 28/4, 2014 at 16:7

Solved file scala io scalaz scalaz-stream

C

1

12

In Scala, for reading a text file and uploading it into an array, a common approach is

scala.io.Source.fromFile("file.txt").getLines.toArray

Especially for very large files, is there a faster approach perhaps by reading blocks of bytes into memory first and then splitting them by new line characters ? (See Read entire file in Scala for commonly used approaches.)

Many Thanks.

Crudity answered 11/4, 2014 at 8:43 Comment(10)

Note that Source uses BufferedSource, which in turn uses Java's BufferedReader. So it already reads blocks of data into memory - it doesn't read byte-by-byte. – Alarum 11/4, 2014 at 8:56

@Alarum many thanks for the observation, wondering if there are (even) faster approaches, perhaps with java.nio ... – Crudity 11/4, 2014 at 9:3

please, define very large files and what you're going to do with that data (after splitting it in lines) – Hoedown 11/4, 2014 at 10:30

@Hoedown numerical arrays 20.000 x 500 ~ 200MB – Crudity 11/4, 2014 at 11:38

Next, obvious question is: how fast is your current approach and how fast would be fast enough? – Brundage 11/4, 2014 at 13:30

@patryk-wiek faster than the version above, in the same machine for the same (very large) input file; likely for small files the version above if the fastest due to possible overhead in more sophisticated approaches... – Crudity 13/4, 2014 at 16:40

fromFile() has an overloaded form that takes a bufferSize arg. Have you tried increasing that > 2048? – Steric 23/4, 2014 at 5:33

Quick approach, you could change the buffer size as @Steric says. If that is not enough, see https://mcmap.net/q/94591/-faster-way-to-read-file and nadeausoftware.com/articles/2008/02/… and use the code there to create something similar in Scala. Then, please, report back :D – Crista 23/4, 2014 at 14:11

I attempted a solution to this with NIO's MappedByteBuffer and it runs considerably slower, spending 3/4 of its time converting Byte arrays to Strings. Source already makes efficient use of the old java.io classes, I suspect that there aren't any considerably faster solutions. – Hausmann 23/4, 2014 at 14:58

Have you tried profiling your current code (VisualVM's Sampler is pretty good for this kind of thing), to see what it's spending most of its time doing? Once you know that, you know what you need to target for optimisation. – Telepathist 23/4, 2014 at 15:26

P

21

The performance problem has nothing to do with the way the data is read. It is already buffered. Nothing happens until you actually iterate through the lines:

// measures time taken by enclosed code
def timed[A](block: => A) = {
  val t0 = System.currentTimeMillis
  val result = block
  println("took " + (System.currentTimeMillis - t0) + "ms")
  result
}

val source = timed(scala.io.Source.fromFile("test.txt")) // 200mb, 500 lines
// took 0ms

val lines = timed(source.getLines)
// took 0ms

timed(lines.next) // read first line
// took 1ms

// ... reset source ...

var x = 0
timed(lines.foreach(ln => x += ln.length)) // "use" every line
// took 421ms

// ... reset source ...

timed(lines.toArray)
// took 915ms

Considering a read-speed of 500mb per second for my hard drive, the optimum time would be at 400ms for the 200mb, which means that there is no room for improvements other than not converting the iterator to an array.

Depending on your application you could consider using the iterator directly instead of an Array. Because working with such a huge array in memory will definitely be a performance issue anyway.

Edit: From your comments I assume, that you want to further transform the array (Maybe split the lines into columns as you said you are reading a numeric array). In that case I recommend to do the transformation while reading. For example:

source.getLines.map(_.split(",").map(_.trim.toInt)).toArray

is considerably faster than

source.getLines.toArray.map(_.split(",").map(_.trim.toInt))

(For me it is 1.9s instead of 2.5s) because you don't transform an entire giant array into another but just each line individually, ending up in one single array (Uses only half the heap space). Also since reading the file is a bottleneck, transforming while reading has the benefit that it results in better CPU utilization.

Psychogenic answered 28/4, 2014 at 16:7 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.