Gaussian Mixture Models: Difference between Spark MLlib and scikit-learn

Asked 18/6, 2018 at 18:49 Answered 2/8, 2019 at 16:10

Solved python apache-spark scikit-learn pyspark apache-spark-mllib

I'm trying to use Gaussian Mixture models on a sample of a dataset. I used bothMLlib (with pyspark) and scikit-learn and get very different results, the scikit-learn one looking more realistic.

from pyspark.mllib.clustering import GaussianMixture as SparkGaussianMixture
from sklearn.mixture import GaussianMixture
from pyspark.mllib.linalg import Vectors

Scikit-learn:

local = pd.DataFrame([ x.asDict() for x in df.sample(0.0001).collect() ])
model1 = GaussianMixture(n_components=3)
model1.fit([ [x] for x in local['field'].tolist() ])

model1.means_
array([[7.56123598e+00],
   [1.32517410e+07],
   [3.96762639e+04]])

model1.covariances_
array([[[6.65177423e+00]],
   [[1.00000000e-06]],
   [[8.38380897e+10]]])

MLLib:

model2 = SparkGaussianMixture.train(
    sc.createDataFrame(local).rdd.map(lambda x: Vectors.dense(x.field)),
    k=3,
    convergenceTol=1e-4,
    maxIterations=100
)

model2.gaussians
[MultivariateGaussian(mu=DenseVector([28736.5113]), sigma=DenseMatrix(1, 1, [1094083795.0001], 0)),
 MultivariateGaussian(mu=DenseVector([7839059.9208]), sigma=DenseMatrix(1, 1, [38775218707109.83], 0)),
 MultivariateGaussian(mu=DenseVector([43.8723]), sigma=DenseMatrix(1, 1, [608204.4711], 0))]

However, I'm interested in running the entire dataset through the model which I'm afraid would require parallelizing (and hence use MLlib) to get results in finite time. Am I doing anything wrong / missing something?

Data:

The complete data has an extremely long tail and looks like:

whereas the data has a clearly normal dist ceneterd somewhere closer to one clustered by scikit-learn:

I am using Spark 2.3.0 (AWS EMR).

Edit: Initialization params:

local = pd.DataFrame([ x.asDict() for x in df.sample(0.0001).collect() ])
model1 = GaussianMixture(n_components=3, init_params='random')
model1.fit([ [x] for x in local['field'].tolist() ])

model1.means_
array([[2.17611913e+04],
   [8.03184505e+06],
   [7.56871801e+00]])

model1.covariances_
rray([[[1.01835902e+09]],
   [[3.98552130e+13]],
   [[6.95161493e+00]]])

Nakada answered 18/6, 2018 at 18:49 Comment(11)

I can see that atleast the mu value is similar to three clusters in both approaches. However these values can highly depend on the initialization value – Mathre 18/6, 2018 at 19:0

The mu values are actually not even close. Updated the question with plots. – Nakada 19/6, 2018 at 4:20

scikit-learn uses k-means by default to initialize the weights which could be related. – Nichol 19/6, 2018 at 6:19

@shaido updated post with scikit-learn with random init. Does not change anything. – Nakada 19/6, 2018 at 12:42

Can you post a snapshot of your dataset ? or a dummy data to begin with ? – Idioplasm 21/6, 2018 at 15:4

when you say "would require parallelizing (and hence use MLlib)", you mean in multiple machines (i.e. a cluster)? – Monoplegia 21/6, 2018 at 15:27

@Monoplegia yes exactly, the dataset is HUGE. – Nakada 21/6, 2018 at 17:23

Is there a statistical reason why you couldn't just subsample the dataset? – Stodgy 24/6, 2018 at 11:24

@JeffEllen this result is on a (very small compared to the dataset) sample, is that what you mean? – Nakada 24/6, 2018 at 12:36

@MohammedKashif I'm afraid I cannot do that, though I can try to generate some similar data, lemme try... – Nakada 24/6, 2018 at 12:37

@ixaxaar, a similar dataset would also work :-) – Idioplasm 24/6, 2018 at 12:39

This isn't a python problem, per se. It seems to be more of a machine learning/data validation/data segmentation question, IMO. That being said, you are correct in thinking that you must parallelize your work, but it matters in what ways you do it. There are things like 8-bit quantization and model parallelism in your model that you may look into to help you get at what you're after: training a model on large datasets, in a timely manner, without sacrificing data quality or fidelity.

Here is a blog post about quantization: https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/

Here is a blog post about model parallelism and 8-bit quantization from Tim Dettmers' Blog: http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/

and the associated paper: https://arxiv.org/pdf/1511.04561.pdf

Though you will want to keep in mind that, depending on FP operations on your GPU, you may not see substantial benefit from this route: https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664

HTH and YMMV.

Also, you may want to look into data folding, but can't remember the details nor the paper I read at this point in time. I'll land this here to remember once I do though.

Telegu answered 10/7, 2018 at 20:29 Comment(0)

I really don't know which type of EM alghoritm is used in scikit-learn or in Spark, but i know for sure that if they use SEM (Stochastic Expectation Maximiztion), it should coverge faster than EM. (see this).

However multiple starting techinque is always suggested to avoid saddle points or local maxima.

I really don't understand your plots, they are on different scale, is the second a zoom in of the first? By the way, i suggest you to choose the number of k by BIC (Bayesian Information Criterion) and to choose the number of components by this measure.

Compliment answered 2/8, 2019 at 16:10 Comment(1)

provide some code example for this specific situation! – Autopilot 2/8, 2019 at 16:17

Recommended topics

Hot tags