Ruby: start reading at arbitrary point in large file

F

4

3

I have some log files I would like to sift through. The content is exactly what you expect in a log file: many single lines of comma separated text. The files are about 4 gigs each. File.each_line or foreach takes about 20 minutes for one of them.

Since a simple foreach seems... simple (and slow), I was thinking that two separate threads might be able to work on the same file if I could only tell them where to start. But based on my (limited) knowledge, I can't decide if this is even possible.

Is there a way to start reading the file at an arbitrary line?

Fannyfanon answered 5/11, 2010 at 2:55 Comment(0)

F

1

For lines, it might be a bit difficult, but you can seek within a file to a certain byte.

IO#seek (link) and IO#pos (link) will both allow you to seek to a given byte within the file.

Ferocious answered 5/11, 2010 at 3:5 Comment(1)

Follow the seek or pos with a readline to read to the end of the current line and the file will be poised to continue reading complete lines from that point on. Be sure to trap for EOF in case you position close to the end of the file and don't encounter a line-end character before EOF occurs. – Putdown 5/11, 2010 at 4:6

P

3

To see what sort of difference slurping the entire file at once vs line-by-line, I tested against a file that is about 99MB, with over 1,000,000 lines.

greg-mbp-wireless:Desktop greg$ wc filelist.txt 
 1003002 1657573 99392863 filelist.txt

I put the following loop into a ruby file and ran it from the command line with the time command:

IO.read(ARGV.first).lines { |l|
}

greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt 

real    0m1.411s
user    0m0.653s
sys     0m0.169s

Then I changed it to read line-by-line and timed that too:

IO.readlines(ARGV.first) { |l|
}

greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt 

real    0m1.053s
user    0m0.741s
sys     0m0.278s

I'm not sure why, but reading line by line is faster. That might be tied to memory allocation as Ruby tries to load the entire file into RAM in the first example, or maybe it was an anomaly since I only did the test once for each file. Using a read with an explicit filesize might be faster as Ruby will know how much it's going to need to allocate in advance.

And that was all I needed to test this:

fcontent = ''
File.open(ARGV.first, 'r') do |fi|
  fsize = fi.size
  fcontent = fi.read(fsize)
end
puts fcontent.size

greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt 
99392863

real    0m0.168s
user    0m0.010s
sys     0m0.156s

Looks like knowing how much needs to be read makes quite a difference.

Adding back in the loop over the string buffer results in this:

File.open(ARGV.first, 'r') do |fi|
  fsize = fi.size
  fi.read(fsize).lines { |l| 
  }
end

greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt 

real    0m0.732s
user    0m0.572s
sys     0m0.158s

That's still an improvement.

If you used a Queue and fed it from a thread that was responsible for reading a file, then consumed the queue from whatever processes the incoming text then you might see a higher overall throughput.

Putdown answered 5/11, 2010 at 4:15 Comment(4)

Tin Man, I found that read(fsize).lines is indeed much faster than readlines, but it appears that read can only read as many bytes at a time as the maximum value of long. If the size of a long is only 32 bits, than this method is only capable of reading roughly 1 GB at a time, and multiple iterations of read are needed to read larger files efficiently. – Corrody 17/4, 2012 at 1:55

I suspect the underlying OS is unable to read in 1GB chunks anyway. That seems like a huge value for a single "slurp". Try varying the size of the blocks being read and look at how long it takes to read. Odds are good it won't scale smoothly, and a smaller value might end up being faster. Without the source or thorough documentation on the underlying OS's IO drivers it'll take experimenting. – Putdown 17/4, 2012 at 19:10

Tin Man, I suspect you're right about the performance, but what I mean is that, on certain platforms, the above implementation isn't capable of reading 4 GB files at all, fast or slow. It reports an error that it can't convert BigNumber (the type of fsize if the file is sufficiently large) to long (which is apparently what read requires). It needs to read 1 GB max at a time over several iterations due to this conversion issue. But yes, as you said, the block size probably isn't 1 GB anyway, and experimenting with the number of bytes read at a time might lead to further performance gains. – Corrody 19/4, 2012 at 23:41

If you read a file, then read it again, it will be much faster even with the same technique as Linux will cache the file's content with the first read. Subsequent reads will not touch the disk but read from RAM. For a proper test you have to clear any cache you have between tests. Use # sync; echo 3 > /proc/sys/vm/drop_caches. – Catharinecatharsis 31/8, 2020 at 21:20

S

2

If you want to start at a specific line in the file I would recommend just shelling out to tail.

excerpt = `tail -m +5000 filename.log`

This would give you the contents of filename.log from line 5000 to the end of the file.

Sprout answered 9/11, 2010 at 17:32 Comment(2)

Obviously, that is a very performant way of grabbing a specific postion of the file as well. :) – Sprout 9/11, 2010 at 17:34

For me (Ubuntu) it was tail -n +5000 filename.log (-n not -m) – Soph 12/4, 2013 at 22:9

F

1

For lines, it might be a bit difficult, but you can seek within a file to a certain byte.

IO#seek (link) and IO#pos (link) will both allow you to seek to a given byte within the file.

Ferocious answered 5/11, 2010 at 3:5 Comment(1)

Follow the seek or pos with a readline to read to the end of the current line and the file will be poised to continue reading complete lines from that point on. Be sure to trap for EOF in case you position close to the end of the file and don't encounter a line-end character before EOF occurs. – Putdown 5/11, 2010 at 4:6

A

0

Try faster_csv if you haven't already and if thats still too slow use something that has native extensions in c like this - http://github.com/wwood/excelsior

Aliphatic answered 5/11, 2010 at 5:41 Comment(0)

Recommended topics

Hot tags