Prototype--that's the most important thing when working with big data. Sensibly carve it up so that you can load it in memory to access it with an interpreter--e.g., python, R. That's the best way to create and refine your analytics process flow at scale.
In other words, trim your multi-GB-sized data files so that they are small enough to perform command-line analytics.
Here's the workflow i use to do that--surely not the best way to do it, but it is one way, and it works:
I. Use lazy loading methods (hopefully) available in your language of
choice to read in large data files, particularly those exceeding about 1 GB. I
would then recommend processing this data stream according to the
techniques i discuss below, then finally storing this fully
pre-processed data in a Data Mart, or intermediate staging container.
One example using Python to lazy load a large data file:
# 'filename' is the full path name for a data file whose size
# exceeds the memory on the box it resides. #
import tokenize
data_reader = open(some_filename, 'r')
tokens = tokenize.generate_tokens(reader)
tokens.next() # returns a single line from the large data file.
II. Whiten and Recast:
Recast your columns storing categorical
variables (e.g., Male/Female) as integers (e.g., -1, 1). Maintain
a
look-up table (the same hash as you used for this conversion
except
the keys and values are swapped out) to convert these integers
back
to human-readable string labels as the last step in your analytic
workflow;
whiten your data--i.e., "normalize" the columns that
hold continuous data. Both of these steps will substantially
reduce
the size of your data set--without introducing any noise. A
concomitant benefit from whitening is prevention of analytics
error
caused by over-weighting.
III. Sampling: Trim your data length-wise.
IV. Dimension Reduction: the orthogonal analogue to sampling. Identify the variables (columns/fields/features) that have no influence or de minimis influence on the dependent variable (a.k.a., the 'outcomes' or response variable) and eliminate them from your working data cube.
Principal Component Analysis (PCA) is a simple and reliable technique to do this:
import numpy as NP
from scipy import linalg as LA
D = NP.random.randn(8, 5) # a simulated data set
# calculate the covariance matrix: #
R = NP.corrcoef(D, rowvar=1)
# calculate the eigenvalues of the covariance matrix: #
eigval, eigvec = NP.eig(R)
# sort them in descending order: #
egval = NP.sort(egval)[::-1]
# make a value-proportion table #
cs = NP.cumsum(egval)/NP.sum(egval)
print("{0}\t{1}".format('eigenvalue', 'var proportion'))
for i in range(len(egval)) :
print("{0:.2f}\t\t{1:.2f}".format(egval[i], cs[i]))
eigenvalue var proportion
2.22 0.44
1.81 0.81
0.67 0.94
0.23 0.99
0.06 1.00
So as you can see, the first three eigenvalues account for 94% of the variance observed in original data. Depending on your purpose, you can often trim the original data matrix, D, by removing the last two columns:
D = D[:,:-2]
V. Data Mart Storage: insert a layer between your permanent storage (Data Warehouse) and your analytics process flow. In other words, rely heavily on data marts/data cubes--a 'staging area' that sits between your Data Warehouse and your analytics app layer. This data mart is a much better IO layer for your analytics apps. R's 'data frame' or 'data table' (from the CRAN Package of the same name) are good candidates. I also strongly recommend redis--blazing fast reads, terse semantics, and zero configuration, make it an excellent choice for this use case. redis will easily handle datasets of the size you mentioned in your Question. Using the hash data structure in redis, for instance, you can have the same structure and the same relational flexibility as MySQL or SQLite without the tedious configuration. Another advantage: unlike SQLite, redis is in fact a database server. I am actually a big fan of SQLite, but i believe redis just works better here for the reasons i just gave.
from redis import Redis
r0 = Redis(db=0)
r0.hmset(user_id : "100143321, {sex : 'M', status : 'registered_user',
traffic_source : 'affiliate', page_views_per_session : 17,
total_purchases : 28.15})