To see what sort of difference slurping the entire file at once vs line-by-line, I tested against a file that is about 99MB, with over 1,000,000 lines.
greg-mbp-wireless:Desktop greg$ wc filelist.txt
1003002 1657573 99392863 filelist.txt
I put the following loop into a ruby file and ran it from the command line with the time command:
IO.read(ARGV.first).lines { |l|
}
greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt
real 0m1.411s
user 0m0.653s
sys 0m0.169s
Then I changed it to read line-by-line and timed that too:
IO.readlines(ARGV.first) { |l|
}
greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt
real 0m1.053s
user 0m0.741s
sys 0m0.278s
I'm not sure why, but reading line by line is faster. That might be tied to memory allocation as Ruby tries to load the entire file into RAM in the first example, or maybe it was an anomaly since I only did the test once for each file. Using a read
with an explicit filesize might be faster as Ruby will know how much it's going to need to allocate in advance.
And that was all I needed to test this:
fcontent = ''
File.open(ARGV.first, 'r') do |fi|
fsize = fi.size
fcontent = fi.read(fsize)
end
puts fcontent.size
greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt
99392863
real 0m0.168s
user 0m0.010s
sys 0m0.156s
Looks like knowing how much needs to be read makes quite a difference.
Adding back in the loop over the string buffer results in this:
File.open(ARGV.first, 'r') do |fi|
fsize = fi.size
fi.read(fsize).lines { |l|
}
end
greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt
real 0m0.732s
user 0m0.572s
sys 0m0.158s
That's still an improvement.
If you used a Queue and fed it from a thread that was responsible for reading a file, then consumed the queue from whatever processes the incoming text then you might see a higher overall throughput.
seek
orpos
with a readline to read to the end of the current line and the file will be poised to continue reading complete lines from that point on. Be sure to trap for EOF in case you position close to the end of the file and don't encounter a line-end character before EOF occurs. – Putdown