Why are LASSO in sklearn (python) and matlab statistical package different?
Asked Answered
D

4

11

I am using LaasoCV from sklearn to select the best model is selected by cross-validation. I found that the cross validation gives different result if I use sklearn or matlab statistical toolbox.

I used matlab and replicate the example given in http://www.mathworks.se/help/stats/lasso-and-elastic-net.html to get a figure like this

enter image description here

Then I saved the matlab data, and tried to replicate the figure with laaso_path from sklearn, I got

enter image description here

Although there are some similarity between these two figures, there are also certain differences. As far as I understand parameter lambda in matlab and alpha in sklearn are same, however in this figure it seems that there are some differences. Can somebody point out which is the correct one or am I missing something? Further the coefficient obtained are also different (which is my main concern).

Matlab Code:

rng(3,'twister') % for reproducibility
X = zeros(200,5);
for ii = 1:5
      X(:,ii) = exprnd(ii,200,1);
end
r = [0;2;0;-3;0];
Y = X*r + randn(200,1)*.1;

save randomData.mat % To be used in python code

[b fitinfo] = lasso(X,Y,'cv',10);
lassoPlot(b,fitinfo,'plottype','lambda','xscale','log');

disp('Lambda with min MSE')
fitinfo.LambdaMinMSE
disp('Lambda with 1SE')
fitinfo.Lambda1SE
disp('Quality of Fit')
lambdaindex = fitinfo.Index1SE;
fitinfo.MSE(lambdaindex)
disp('Number of non zero predictos')
fitinfo.DF(lambdaindex)
disp('Coefficient of fit at that lambda')
b(:,lambdaindex)

Python Code:

import scipy.io
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, LassoCV

data=scipy.io.loadmat('randomData.mat')
X=data['X']
Y=data['Y'].flatten()

model = LassoCV(cv=10,max_iter=1000).fit(X, Y)
print 'alpha', model.alpha_
print 'coef', model.coef_

eps = 1e-2 # the smaller it is the longer is the path
models = lasso_path(X, Y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])

pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])
l1 = pl.semilogx(alphas_lasso,coefs_lasso)
pl.gca().invert_xaxis()
pl.xlabel('alpha')
pl.show()
Diagnostician answered 5/10, 2012 at 12:40 Comment(1)
I can just say I recall similar findings when working on real data. The Matlab results were different and significantly better. I didn't explore very deeply what this problem stems from, though.Counterespionage
S
3

I do not have matlab but be careful that the value obtained with the cross--validation can be unstable. This is because it influenced by the way you subdivide the samples.

Even if you run 2 times the cross-validation in python you can obtain 2 different results. consider this example :

kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_

0.00645093258722
0.00691712356467
Surbased answered 8/12, 2013 at 18:45 Comment(0)
G
2

it's possible that alpha = lambda / n_samples
where n_samples = X.shape[0] in scikit-learn

another remark is that your path is not very piecewise linear as it could/should be. Consider reducing the tol and increasing max_iter.

hope this helps

Gothicize answered 5/10, 2012 at 16:55 Comment(3)
I guess the issue is more than normalization. I tried the above one and still got different curves. Further, the coefficients obtained by cross validation is very different.Diagnostician
This still looks like a parameterization issue to me: the 2 curves looks similar but shifted on the X axis. A rescaling on the alpha in scikit-learn taken in the log space can cause this. The parameterization used in scikit-learn is given in the documentation. You can also generate more data from the same distribution and compute a regression score (e.g. the coefficient of determination r^2 or the RMSE) and check that the optimal value of alpha is found close to the cross validated value of alpha.Pasquil
@Diagnostician have you tried with alpha = lambda / (2 * X.shape[0])?Pasquil
F
0

I know this is an old thread, but:

I'm actually working on piping over to LassoCV from glmnet (in R), and I found that LassoCV doesn't do too well with normalizing the X matrix first (even if you specify the parameter normalize = True).

Try normalizing the X matrix first when using LassoCV.

If it is a pandas object,

(X - X.mean())/X.std()

It seems you also need to multiple alpha by 2

Fong answered 27/1, 2018 at 18:50 Comment(0)
G
-2

Though I am unable to figure out what is causing the problem, there is a logical direction in which to continue.

These are the facts:

  • Mathworks have selected an example and decided to include it in their documentation
  • Your matlab code produces exactly the result as the example.
  • The alternative does not match the result, and has provided inaccurate results in the past

This is my assumption:

  • The chance that mathworks have chosen to put an incorrect example in their documentation is neglectable compared to the chance that a reproduction of this example in an alternate way does not give the correct result.

The logical conclusion: Your matlab implementation of this example is reliable and the other is not. This might be a problem in the code, or maybe in how you use it, but either way the only logical conclusion would be that you should continue with Matlab to select your model.

Goshawk answered 10/10, 2012 at 11:23 Comment(3)
This is a very weak argument to advertise one technology over another. sklearn also provide exemples. Would it be reproducible by matlab code ? Actually LASSO is more like a class of solver than a precisely defined algorithm. So it is more probable that the algorithm slightly differ. Stating that scikit-learn is not reliable based on your arguments is quite harsh.Oast
I did not want to imply this, i have rephrased my answer slightly to be more clear.Goshawk
Thanks for the answer. scikit-learn is indeed a well implemented module. However the documentation and examples are still lacking which cause the above problem. I could solve the issue by proper normalization.Diagnostician

© 2022 - 2024 — McMap. All rights reserved.