I have problem where I need to download, unzip, and then process line by line a very large CSV file. I think it's useful to give you an idea how large the file is:
- big_file.zip ~700mb
- big_file.csv ~23gb
Here's some things I'd like to happen:
- Don't have to download the whole file before unzipping
- Don't have to unzip whole file before parsing csv lines
- Don't use up very much memory/disk while doing all this
I don't know if that's possible or not. Here's what I was thinking:
require 'open-uri'
require 'rubyzip'
require 'csv'
open('http://foo.bar/big_file.zip') do |zipped|
Zip::InputStream.open(zipped) do |unzipped|
sleep 10 until entry = unzipped.get_next_entry && entry.name == 'big_file.csv'
CSV.foreach(unzipped) do |row|
# process the row, maybe write out to STDOUT or some file
end
end
end
Here's the problems I know about:
open-uri
reads the whole response and saves it into aTempfile
which is no good with a file this size. I'd probably need to useNet::HTTP
directly but I'm not sure how to do that and still get anIO
.- I don't know how fast the download is going to be or if the
Zip::InputStream
works the way I've shown it working. Can it unzip some of the file when it's not all there yet? - Will the
CSV.foreach
work with rubyzip'sInputStream
? Does it behave enough likeFile
that it will be able to parse out the rows? Will it freak out if it wants to read but the buffer is empty?
I don't know if any of this is the right approach. Maybe some EventMachine solution would be better (although I've never used EventMachine before, but if it works better for something like this, I'm all for it).
funzip
if there was only one file in the zip (or the one I wanted was first) but that's not the case. – Scholem