I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to
- write the whole thing in C (or Fortran)
- import the files (tables) into a relational database directly and then pull off chunks in R or Python (some of the transformations are not amenable for pure SQL solutions)
- write the whole thing in Python
Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks
Edit Thanks for your responses. There seems to be conflicting opinions about Hadoop, but in any case I don't have access to a cluster (though I can use several unnetworked machines)...