importance of PCA or SVD in machine learning

Asked 6/3, 2012 at 19:0 Answered 16/12, 2016 at 12:5

All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand. I am trying to think (since long time) but I am not able to guess why is it so. In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like log-transformation stuff , mean normalization. But how does something like SVD helps. So lets say i have a huge matrix of user rating movies..and then in this matrix, I implement some version of recommendation system (say collaborative filtering):

1) Without SVD
2) With SVD

how does it helps

Eleonoraeleonore answered 6/3, 2012 at 19:0 Comment(2)

By "performance", do you mean speed or accuracy? – Breaking 6/3, 2012 at 19:46

@larsmans Hi.. I meant accuracy – Eleonoraeleonore 6/3, 2012 at 22:44

SVD is not used to normalize the data, but to get rid of redundant data, that is, for dimensionality reduction. For example, if you have two variables, one is humidity index and another one is probability of rain, then their correlation is so high, that the second one does not contribute with any additional information useful for a classification or regression task. The eigenvalues in SVD help you determine what variables are most informative, and which ones you can do without.

The way it works is simple. You perform SVD over your training data (call it matrix A), to obtain U, S and V*. Then set to zero all values of S less than a certain arbitrary threshold (e.g. 0.1), call this new matrix S'. Then obtain A' = US'V* and use A' as your new training data. Some of your features are now set to zero and can be removed, sometimes without any performance penalty (depending on your data and the threshold chosen). This is called k-truncated SVD.

SVD doesn't help you with sparsity though, only helps you when features are redundant. Two features can be both sparse and informative (relevant) for a prediction task, so you can't remove either one.

Using SVD, you go from n features to k features, where each one will be a linear combination of the original n. It's a dimensionality reduction step, just like feature selection is. When redundant features are present, though, a feature selection algorithm may lead to better classification performance than SVD depending on your data set (for example, maximum entropy feature selection). Weka comes with a bunch of them.

See: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition

https://stats.stackexchange.com/questions/33142/what-happens-when-you-apply-svd-to-a-collaborative-filtering-problem-what-is-th

Overburden answered 6/3, 2012 at 19:9 Comment(11)

While it's true that SVD does dimensionality reduction, it's not really a feature selection step as you describe it. I believe it's more commonly used to speed up training algorithms. – Breaking 6/3, 2012 at 19:20

@larsmans: can you please explain a bit more. Like how does it helps.. I mean in netflix and in general, the data is always sparse (the curse of dimensionality) but then how does running an SVD helps? – Eleonoraeleonore 6/3, 2012 at 19:42

@larsmans: I don't think it's used to speed up the learning phase, as you describe it. It is indeed used for feature selection. – Overburden 7/3, 2012 at 0:45

@Franz: I edited my question to explain a bit about sparsity. – Overburden 7/3, 2012 at 0:46

Andrew Ng in his online ML class describes it as a speedup measure. – Breaking 7/3, 2012 at 10:43

@larsmans: I am not sure what you are talking about. Andrew Ng describes it precisely as a dimensionality reduction technique. Can you please explain how can it be used as a speedup measure? Can you give us another reference? – Overburden 7/3, 2012 at 18:55

It's not feature selection since doing an SVD from n to k features will not necessarily give you a subset of size k of the original n features. The speedup point is obvious: if a classifier has to be fit using an optimization routine with complexity linear in the number of features (and most are), then a smaller number of features gives a speedup; assuming you can compute the SVD quickly. Besides, @Edouard's answer seems much more to the point wrt. collaborative filtering. – Breaking 7/3, 2012 at 19:5

So, when you say "a smaller number of features gives a speedup" how is that different from dimensionality reduction? You are right, it will run faster, but most importantly, it may perform better (lower error rate). – Overburden 7/3, 2012 at 19:30

Is it a fair (and very over simplified) statement to say that SVD shares similarities with Factor Analysis in that both handle Dimensional Reductions on their own type of data? – Paduasoy 3/5, 2012 at 12:14

@Diego, well it's not feature selection because feature selection is the process of selecting subset of features from a bunch of features at hand. While SVD is not doing that but mapping the features into lower dimension but not necessarily picking subset of the features – Mutt 15/6, 2013 at 22:2

Is there any specific set of guidelines regarding what the lower threshold should be? I.e., how generalizable is the 0.01 value? – Constant 2/10, 2015 at 13:19

The Singular Value Decomposition is often used to approximate a matrix X by a low rank matrix X_lr:

Compute the SVD X = U D V^T.
Form the matrix D' by keeping the k largest singular values and setting the others to zero.
Form the matrix X_lr by X_lr = U D' V^T.

The matrix X_lr is then the best approximation of rank k of the matrix X, for the Frobenius norm (the equivalent of the l2-norm for matrices). It is computationally efficient to use this representation, because if your matrix X is n by n and k << n, you can store its low rank approximation with only (2n + 1)k coefficients (by storing U, D' and V).

This was often used in matrix completion problems (such as collaborative filtering) because the true matrix of user ratings is assumed to be low rank (or well approximated by a low rank matrix). So, you wish to recover the true matrix by computing the best low rank approximation of your data matrix. However, there are now better ways to recover low rank matrices from noisy and missing observations, namely nuclear norm minimization. See for example the paper The power of convex relaxation: Near-optimal matrix completion by E. Candes and T. Tao.

(Note: the algorithms derived from this technique also store the SVD of the estimated matrix, but it is computed differently).

Ethiopian answered 7/3, 2012 at 15:36 Comment(1)

Under this approach if X matrix is originally m x n, your reduced rank matrix will still be m x n. If your goal is dimensionality reduction not matrix completion, you use U or V^T as your new training set (depending on whether your samples are oriented row or column wise in X) not X_lr. – Anuran 30/8, 2014 at 1:39

PCA or SVD, when used for dimensionality reduction, reduce the number of inputs. This, besides saving computational cost of learning and/or predicting, can sometimes produce more robust models that are not optimal in statistical sense, but have better performance in noisy conditions.

Mathematically, simpler models have less variance, i.e. they are less prone to overfitting. Underfitting, of-course, can be a problem too. This is known as bias-variance dilemma. Or, as said in plain words by Einstein: Things should be made as simple as possible, but not simpler.

Diastasis answered 16/12, 2016 at 12:5 Comment(0)

Recommended topics

Hot tags