How do I use principal component analysis in supervised machine learning classification problems?

Asked 28/11, 2013 at 2:37 Answered 19/11, 2016 at 17:43

machine-learning pca supervised-learning

I have been working through the concepts of principal component analysis in R.

I am comfortable with applying PCA to a (say, labeled) dataset and ultimately extracting out the most interesting first few principal components as numeric variables from my matrix.

The ultimate question is, in a sense, now what? Most of the reading I've come across on PCA immediately halts after the computations are done, especially with regards to machine learning. Pardon my hyperbole, but I feel as if everyone agrees that the technique is useful, but nobody wants to actually use it after they do it.

More specifically, here's my real question:

I respect that principle components are linear combinations of the variables you started with. So, how does this transformed data play a role in supervised machine learning? How could someone ever use PCA as a way to reduce dimensionality of a dataset, and THEN, use these components with a supervised learner, say, SVM?

I'm absolutely confused about what happens to our labels. Once we are in eigenspace, great. But I don't see any way to continue to move forward with machine learning if this transformation blows apart our concept of classification (unless there's some linear combination of "Yes" or "No" I haven't come across!)

Intramolecular answered 28/11, 2013 at 2:37 Comment(1)

do supervised PCA :) ? – Jameljamerson 28/7, 2017 at 21:16

Old question, but I don't think it's been satisfactorily answered (and I just landed here myself through Google). I found myself in your same shoes and had to hunt down the answer myself.

The goal of PCA is to represent your data X in an orthonormal basis W; the coordinates of your data in this new basis is Z, as expressed below:

X = ZW'

Because of orthonormality, we can invert W simply by transposing it and write:

XW = Z

Now to reduce dimensionality, let's pick some number of components k < p. Assuming our basis vectors in W are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first k columns of W.

XW = Z

Now we have a k dimensional representation of our training data X. Now you run some supervised classifier using the new features in Z.

Y=f(Z)

The key is to realize that W is in some sense a canonical transformation from our space of p features down to a space of k features (or at least the best transformation we could find using our training data). Thus, we can hit our test data with the same W transformation, resulting in a k-dimensional set of test features:

XW = Z

We can now use the same classifier trained on the k-dimensional representation of our training data to make predictions on the k-dimensional representation of our test data:

Y=f(Z)

The point of going through this whole procedure is because you may have thousands of features, but (1) not all of them are going to have a meaningful signal and (2) your supervised learning method may be far too complex to train on the full feature set (either it would take too long or your computer wouldn't have a enough memory to process the calculations). PCA allows you to dramatically reduce the number of features it takes to represent your data without eliminating features of your data that truly add value.

Particle answered 19/11, 2016 at 17:43 Comment(0)

After you have used PCA on a portion of your data to compute the transformation matrix, you apply that matrix to each of your data points before submitting them to your classifier.

This is useful when the intrinsic dimensionality of your data is much smaller than the number of components and the gain in performance you get during classification is worth the loss in accuracy and the cost of PCA. Also, keep in mind the limitations of PCA:

In performing a linear transformation, you implicitly assume that all components are expressed in equivalent units.
Beyond variance, PCA is blind to the structure of your data. It may very well happen that the data splits along low-variance dimensions. In that case, the classifier won't learn from transformed data.

Townie answered 28/11, 2013 at 4:49 Comment(2)

So, after I apply that matrix to each of my data points (in my training set), I then submit them to the classifier...keeping the labels associated with those data points? – Intramolecular 28/11, 2013 at 4:57

Exactly. The PCA transformation simply rotates your points around the origin. It does not affect their labels. – Townie 28/11, 2013 at 5:4

Recommended topics

Hot tags