Amazon EC2 vs PiCloud [closed]
Asked Answered
E

5

6

We are students trying to handling data size of about 140 million records and trying to run few machine learning algorithms. we are newbie to the entire cloud solutions and mahout implementations.Currently we have set them up in postgresql database but the current implementation doesn't scale up and read/write operations seems to be extremely slow after numerous performance tuning.Hence we are planning to go for cloud based services.

We have explored a few possible alternatives.

  1. Amazon cloud based services( Mahout implementation)
  2. Picloud with scikits learn (we were planning to use HDF5 format with NumPy)
  3. Please recommend any other alternatives if any.

Here are the following questions

  1. Which would yield us better results(turn around time) and would be cost effective? Please do mention us any other alternatives present.
  2. In case if we set up amazon services how should we have the data format? If we use dynamodb will the cost shoot up?

Thanks

Eskil answered 11/3, 2012 at 6:55 Comment(0)
A
5

PiCloud is built on top of AWS, so either way you're going to be using Amazon at the end of the day. The question is how much infrastructure you'll have to write yourself to get everything wired together. PiCloud gives some free usage to put it through the paces so you might give it shot initially. I haven't used it myself but clearly they're trying to provide ease of deployment for machine-learning type applications.

It seems like this is trying for results, not to be a cloud project, so I would either look into using one of Amazon's other services besides straight EC2 or otherwise some other software like PiCloud or Heroku or other service that can take care of the bootstrapping.

Aubervilliers answered 14/3, 2012 at 4:52 Comment(0)
R
7

It depends on the nature of the machine learning problem you want to solve. I would recommend you to first subsample your dataset to something that fits in memory (e.g. 100k samples with a few hundred non-zero features per samples assuming a sparse representation).

Then try a couple of machine learning algorithms that scale to large number of samples in scikit-learn:

  • SGDClassifier or MultinomialNB if you want to do supervised classification (if you have categorical labels to predict in your dataset)
  • SGDRegressor if you want to do supervised regression (if you have continuous target variable to predict)
  • MiniBatchKMeans clustering to do unsupervised clustering (but then there is no objective way to quantify the quality of the resulting clusters by default).
  • ...

Perform grid search to find the optimal values of the hyperparameters of the model (e.g. the regularizer alpha and the number of passes n_iter for SGDClassifier) and evaluate the performance using cross-validation.

Once done, retry with 2x larger dataset (still fitting in memory) and see if it improves you predictive accuracy significantly. If it's not the case then don't waste your time trying to parallelize this on a cluster to run that on the full dataset as it won't yield any better results.

If it does what you could do, is shard the data into pieces, then slices of data on each nodes, learn of SGDClassifier or SGDRegressor model on each node independently with picloud and collect back the weights (coef_ and intercept_) and then compute the average weights to build the final linear model and evaluate it on some held out slice of your dataset.

To learn more about the error analysis. Have look at how to plot learning curves:

Roemer answered 20/7, 2012 at 8:41 Comment(0)
A
5

PiCloud is built on top of AWS, so either way you're going to be using Amazon at the end of the day. The question is how much infrastructure you'll have to write yourself to get everything wired together. PiCloud gives some free usage to put it through the paces so you might give it shot initially. I haven't used it myself but clearly they're trying to provide ease of deployment for machine-learning type applications.

It seems like this is trying for results, not to be a cloud project, so I would either look into using one of Amazon's other services besides straight EC2 or otherwise some other software like PiCloud or Heroku or other service that can take care of the bootstrapping.

Aubervilliers answered 14/3, 2012 at 4:52 Comment(0)
T
0

AWS has a program in place for supporting educational users, so you might want to do some research into that program.

Tetreault answered 11/3, 2012 at 7:45 Comment(2)
Can you please comment on the PiCloud(Hdf5 with Scikit) VS AWS possiblities.Eskil
no, I'm not familiar with PiCloud.Tetreault
A
0

You should take a look at numba if you are looking for some Numpy speed ups: https://github.com/numba/numba

Doesn't solve your cloud scaling issue, but may reduce time to compute.

Almena answered 30/8, 2012 at 15:24 Comment(0)
W
-1

I just made a comparison between PiCloud & Amazon EC2 > might be helpful.

Wade answered 27/5, 2013 at 15:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.