How can I tell when my dataset in R is going to be too large?
Asked Answered
G

1

38

I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead of time how much room my data is going to take up in RAM, and whether I will have enough. I know how much RAM I have (not a huge amount - 3GB under XP), and I know how many rows and cols my logfile will end up as and what data types the col entries ought to be (which presumably I need to check as it reads).

How do I put this together into a go/nogo decision for undertaking the analysis in R? (Presumably R needs to be able to have some RAM to do operations, as well as holding the data!) My immediate required output is a bunch of simple summary stats, frequencies, contingencies, etc, and so I could probably write some kind of parser/tabulator that will give me the output I need short term, but I also want to play around with lots of different approaches to this data as a next step, so am looking at feasibility of using R.

I have seen lots of useful advice about large datasets in R here, which I have read and will reread, but for now I would like to understand better how to figure out whether I should (a) go there at all, (b) go there but expect to have to do some extra stuff to make it manageable, or (c) run away before it's too late and do something in some other language/environment (suggestions welcome...!). thanks!

Gizzard answered 7/10, 2012 at 8:57 Comment(0)
M
41

R is well suited for big datasets, either using out-of-the-box solutions like bigmemory or the ff package (especially read.csv.ffdf) or processing your stuff in chunks using your own scripts. In almost all cases a little programming makes processing large datasets (>> memory, say 100 Gb) very possible. Doing this kind of programming yourself takes some time to learn (I don't know your level), but makes you really flexible. If this is your cup of tea, or if you need to run depends on the time you want to invest in learning these skills. But once you have them, they will make your life as a data analyst much easier.

In regard to analyzing logfiles, I know that stats pages generated from Call of Duty 4 (computer multiplayer game) work by parsing the log file iteratively into a database, and then retrieving the statsistics per user from the database. See here for an example of the interface. The iterative (in chunks) approach means that logfile size is (almost) unlimited. However, getting good performance is not trivial.

A lot of the stuff you can do in R, you can do in Python or Matlab, even C++ or Fortran. But only if that tool has out-of-the-box support for what you want, I could see a distinct advantage of that tool over R. For processing large data see the HPC Task view. See also an earlier answer of min for reading a very large text file in chunks. Other related links that might be interesting for you:

In regard to choosing R or some other tool, I'd say if it's good enough for Google it is good enough for me ;).

Merrile answered 7/10, 2012 at 9:20 Comment(5)
Very useful advice around the issues involved, thanks Paul. Re the job sizing q I got a very specific reply on quora, which is the rule of thumb that the mem needed = datasetsize * 4 or 5: linkGizzard
In addition, if this answers your question it is customary to tick the green checkmark as a sign that this question has been asnwered.Merrile
Paul, re cross posting - Do you think there is overlap between Quora and StackOverflow readers? I don't, or I wouldn't have cross-posted it. But I could be wrong. re green tick, your answer was really useful but it didn't actually directly address my question, which was to do with job sizing. The quora reply did address my question, with a rule of thumb, which is why I posted a ref to it, so that people with the same question could find an answer to it. I will tick your answer to signify 'case closed' and thank you for sharing your expertise. I found your answer valuable.Gizzard
@HeatherStark The guy who answered your question is active on SO (stackoverflow.com/users/608489/patrick-burns), and last visited the site yesterday. I think there is overlap, just as there is overlap between R-help and SO.Merrile
@HeatherStark Good to hear you found my answer valueble, thanks for the compliment. In the title your question only relates to the RAM size needed for a particular problem. However, in the post itself it seemed to me that your question was a bit broader, more about if R was useful for big data, if there where any other tools. In addition, you asked when your dataset was too big (in the title). My answer was that there was no limit with a bit of programming.Merrile

© 2022 - 2024 — McMap. All rights reserved.