General techniques to work with huge amounts of data on a non-super computer
Asked Answered
P

5

14

I'm taking some AI classes and have learned about some basic algorithms that I want to experiment with. I have gotten access to several data sets containing lots of great real-world data through Kaggle, which hosts data analysis competitions.

I have tried entering several competitions to improve my machine learning skills, but have been unable to find a good way to access the data in my code. Kaggle provides one large data file, 50-200mb, per competition in csv format.

What is the best way to load and use these tables in my code? My first instinct was to use databases, so I tried loading the csv into sqlite a single database, but this put a tremendous load on my computer and during the commits, it was common for my computer to crash. Next, I tried using a mysql server on a shared host, but doing queries on it took forever, and it made my analysis code really slow. Plus, I am afraid I will exceed my bandwidth.

In my classes so far, my instructors usually clean up the data and give us managable datasets that can be completely loaded into RAM. Obviously this is not possible for my current interests. Please suggest how I should proceed. I am currently using a 4 year old macbook with 4gb ram and a dualcore 2.1Ghz cpu.

By the way, I am hoping to do the bulk of my analysis in Python, as I know this language the best. I'd like a solution that allows me to do all or nearly all coding in this language.

Permanganate answered 18/8, 2011 at 22:20 Comment(3)
SQLite should easily handle a 200MB database on a 4GB machine. If your machine is crashing there is something wrong. Fix that first.Lammond
A lot of the Kaggle competitors are using R. Python is a wonderful language for many purposes, but R has a significant edge in the statistical and modeling tools available for it. This isn't a plug for R, per se, as much as it is a plug for adding more languages to your skill set in order to get more things done.Aiken
I think I'm missing something. How is it "not possible" to load a 200 MB text file into your 4 GB of RAM?Trulatrull
S
18

Prototype--that's the most important thing when working with big data. Sensibly carve it up so that you can load it in memory to access it with an interpreter--e.g., python, R. That's the best way to create and refine your analytics process flow at scale.

In other words, trim your multi-GB-sized data files so that they are small enough to perform command-line analytics.

Here's the workflow i use to do that--surely not the best way to do it, but it is one way, and it works:

I. Use lazy loading methods (hopefully) available in your language of choice to read in large data files, particularly those exceeding about 1 GB. I would then recommend processing this data stream according to the techniques i discuss below, then finally storing this fully pre-processed data in a Data Mart, or intermediate staging container.

One example using Python to lazy load a large data file:

# 'filename' is the full path name for a data file whose size 
# exceeds the memory on the box it resides. #

import tokenize

data_reader = open(some_filename, 'r')
tokens = tokenize.generate_tokens(reader)
tokens.next()           # returns a single line from the large data file.


II. Whiten and Recast:

  • Recast your columns storing categorical variables (e.g., Male/Female) as integers (e.g., -1, 1). Maintain a look-up table (the same hash as you used for this conversion except the keys and values are swapped out) to convert these integers back to human-readable string labels as the last step in your analytic workflow;

  • whiten your data--i.e., "normalize" the columns that hold continuous data. Both of these steps will substantially reduce the size of your data set--without introducing any noise. A concomitant benefit from whitening is prevention of analytics error caused by over-weighting.

III. Sampling: Trim your data length-wise.


IV. Dimension Reduction: the orthogonal analogue to sampling. Identify the variables (columns/fields/features) that have no influence or de minimis influence on the dependent variable (a.k.a., the 'outcomes' or response variable) and eliminate them from your working data cube.

Principal Component Analysis (PCA) is a simple and reliable technique to do this:

import numpy as NP
from scipy import linalg as LA

D = NP.random.randn(8, 5)       # a simulated data set
# calculate the covariance matrix: #
R = NP.corrcoef(D, rowvar=1)
# calculate the eigenvalues of the covariance matrix: #
eigval, eigvec = NP.eig(R)
# sort them in descending order: #
egval = NP.sort(egval)[::-1]
# make a value-proportion table #
cs = NP.cumsum(egval)/NP.sum(egval)
print("{0}\t{1}".format('eigenvalue', 'var proportion'))
for i in range(len(egval)) :
    print("{0:.2f}\t\t{1:.2f}".format(egval[i], cs[i]))

  eigenvalue    var proportion
    2.22        0.44
    1.81        0.81
    0.67        0.94
    0.23        0.99
    0.06        1.00

So as you can see, the first three eigenvalues account for 94% of the variance observed in original data. Depending on your purpose, you can often trim the original data matrix, D, by removing the last two columns:

D = D[:,:-2]


V. Data Mart Storage: insert a layer between your permanent storage (Data Warehouse) and your analytics process flow. In other words, rely heavily on data marts/data cubes--a 'staging area' that sits between your Data Warehouse and your analytics app layer. This data mart is a much better IO layer for your analytics apps. R's 'data frame' or 'data table' (from the CRAN Package of the same name) are good candidates. I also strongly recommend redis--blazing fast reads, terse semantics, and zero configuration, make it an excellent choice for this use case. redis will easily handle datasets of the size you mentioned in your Question. Using the hash data structure in redis, for instance, you can have the same structure and the same relational flexibility as MySQL or SQLite without the tedious configuration. Another advantage: unlike SQLite, redis is in fact a database server. I am actually a big fan of SQLite, but i believe redis just works better here for the reasons i just gave.

from redis import Redis
r0 = Redis(db=0)
r0.hmset(user_id : "100143321, {sex : 'M', status : 'registered_user', 
       traffic_source : 'affiliate', page_views_per_session : 17, 
       total_purchases : 28.15})
Simmer answered 23/8, 2011 at 0:22 Comment(2)
I was under the impression that tokenize parses Python source code. How are you using it to parse arbitrary data? Have you found it to be flexible enough for this?Retroactive
Well i use tokenize e.g., in a pre-processing step--to parsing data that might not be clean enough for a csv parser. In particular, calling 'next' on the tokenize generator returns not just the next line of data, but also the position of that line w/r/t the document, which i find very useful to diagnose irregularly formatted data files. No question about tokenize primary use case though--source code parsing is only use case in the Docs. T/4, i wonder if my use of it in my answer is misleading though.Simmer
L
2

200 megabytes isn't a particularly large file for loading into a database. You might want to try to split the input file into smaller files.

split -l 50000 your-input-filename

The split utility will split your input file into multiple files of whatever size you like. I used 50000 lines per file above. It's a common Unix and Linux command-line utility; don't know whether it ships with Macs, though.

Local installs of PostgreSQL or even MySQL might be a better choice than SQLite for what you're doing.

If you don't want to load the data into a database, you can take subsets of it using command-line utilities like grep, awk, and sed. (Or scripting languages like python, ruby, and perl.) Pipe the subsets into your program.

Lexicologist answered 18/8, 2011 at 22:33 Comment(0)
C
1

You need 'Pandas' for it. I think you must have got it by now. But still if someone else is facing the issue can be benefited by the answer. So you don't have to load the data into any RDBMS. keep it in file and use it by simple Pandas load, dataframes. Here is the link for pandas lib--> http://pandas.pydata.org/

If data is too large you need any cluster for that. Apache Spark or Mahout which can run on Amazon EC2 cloud. Buy some space there and it will be easy to use. Spark also has an API for Python.

Congratulant answered 14/9, 2014 at 6:0 Comment(0)
A
1

I loaded a 2 GB Kaggle dataset in R using H2O. H2O is build on java and it creates a virtual environment, the data will be available in memory and you will have much faster access since H2O is java. You will just have to get used to the syntax of H2O. It has many beautifully built ml algorithms and provides good support for distributed computing. You can also use all your cpu cores easily. Check out h2o.ai for how to use it. It can handle 200 MB easily, given than you only have 4 GB ram. You should upgrade to 8 G or 16 G

Autosome answered 4/3, 2017 at 4:48 Comment(0)
G
0

General technic is to "divide and conquere". If you can split you data into parts and process them separetly then it can be handled by one machine. Some task can be solved that way(PageRank, NaiveBayes, HMM etc) and some were not,(one requred global optimisation) like LogisticeRegression, CRF, many dimension reduction technicues

Goral answered 25/8, 2011 at 11:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.