Feature/Variable importance after a PCA analysis
Asked Answered
V

3

82

I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the original features that are important in the reduced dataset. How do I find out which feature is important and which is not among the remaining Principal Components after the dimension reduction? Here is my code:

from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)

Furthermore, I tried also to perform a clustering algorithm on the reduced dataset but surprisingly for me, the score is lower than on the original dataset. How is it possible?

Vapor answered 11/6, 2018 at 10:49 Comment(1)
For your second question: when you reduce the dimensionality, you lose some information that's available in the original data set. So it is no surprise (in most cases) that you fail to achieve a better performance when compared with the high dimensional setting.Alenaalene
S
137

First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data.

Before the example, please note that the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). See my last paragraph after the plot for more details.


Overview:

PART1: I explain how to check the importance of the features and how to plot a biplot.

PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.


PART 1:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)    

pca = PCA()
x_new = pca.fit_transform(X)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()

Visualize what's going on using the biplot

enter image description here


Now, the importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance)

Let's see first what amount of variance does each PC explain.

pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]

PC1 explains 72% and PC2 23%. Together, if we keep PC1 and PC2 only, they explain 95%.

Now, let's find the most important features.

print(abs( pca.components_ ))

[[0.52237162 0.26335492 0.58125401 0.56561105]
 [0.37231836 0.92555649 0.02109478 0.06541577]
 [0.72101681 0.24203288 0.14089226 0.6338014 ]
 [0.26199559 0.12413481 0.80115427 0.52354627]]

Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: [0.52237162 0.26335492 0.58125401 0.56561105]] we can conclude that feature 1, 3 and 4 (or Var 1, 3 and 4 in the biplot) are the most important. This is also clearly visible from the biplot (that's why we often use this plot to summarize the information in a visual way).

To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained_variance_. The larger they are these absolute values, the more a specific feature contributes to that principal component.


PART 2:

The important features are the ones that influence more the components and thus, have a large absolute value/score on the component.

To get the most important features on the PCs with names and save them into a pandas dataframe use this:

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(dic.items())

This prints:

     0  1
 0  PC0  e
 1  PC1  d

So on the PC1 the feature named e is the most important and on PC2 the d.



Nice article as well here: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

Siemens answered 13/6, 2018 at 20:24 Comment(14)
Thank you @Siemens for the answer. This makes completely sense, but, if I have to choose as good enough to keep the first 3 PCs instead of only PC1, then select among [-0.72101681, 0.24203288, 0.14089226, 0.6338014 ] (3rd row) is still meaningful for finding out the most important features for that number of PCs? Moreover, as "important" you would only choose the features that have positive magnitude or there is a more accurate decision criterion?Vapor
Hello. You should keep PC1 and PC2 and this would be sufficient because they explain 95% of the variance. See my updated answer. Personally, I would look at PC3 since it explains 3% only ! Consider upvoting my answer. cheersSiemens
Yes but I already know how many PCs I have to keep. The problem still to find the important features for PCA(n_components = 2), maybe I didn't get your point. Suppose I keep 3 PCs, do I have to look at the 3rd raw of "pca.componets_" to have the relevance of each original feature for those PCs I want to keep?Vapor
You have to understand something important first. Each feature influences each PC in different way. This means that you can only draw coclusions like the following: feature 1, 3 and 4 are the most important/have the highest influence on PC1 and feature 2 is the most important/has the highest influence on PC2 etc for N components. In my example, I would make coclusion like these ONLY for PC1 and PC2 because these 2 PCs explain together 95% of the variance. Is it clear now ?Siemens
Since I still have less than 15 of reputation, the feedback is recorded but not publicly visible yet. It will be soon :)Vapor
Hi, @serafeim I understand the second part is to find the most important feature on each component, but which I want to rank the features by using the average value of the weights for each PC? Could you please give me some hints? Many thanks.Extender
@Extender You want to rank the features based on their weights on both (or multiple) components ? This does not make much sense bit you could first take the absolute value of the PCA loadings np.abs(pca.components_) and then take the average for each feature like: np.mean(np.abs(pca.components_), axis = 1)Siemens
Hi thanks, but why this doesn't make sense? I want to identify the most important features but my data has no labels, this is the only way I can think ofExtender
I does not make sense to look only at the average loading coefficient across all PCs. In my second part, if you have no labels you can define some e.g. ['0','1','2','4','5'] and then you will eventually can map this label to the initial unlabeled features. For example '0' should be the first column/feature in your datasetSiemens
@serafeim Hi, by 'label' I mean my dataset is unsupervised problem, I can use random forest or something to find the important features if it's a classification probelm, but for my data I can only perform clustering, and the only thing I can think of fing important features is calculating average weights after PCA, does this make sense?Extender
Feature importance is defined as the most important features that predict a target variable (this can be classification or regression). In your case, you do not have a target variable but only a set of features. You can use PCA of course, but again, you can define names for these features (this is what I meant when I said labels), do a PCA and interpret the results as I have explained in my answer. You could say for example, feature named 'a' is the most important for PC1 and PC1 explains 80% of the variance.. etc.Siemens
@Siemens , great answer, may I ask for clarification on one point: How do you actually extract the most important features from the PCs? Because each PC contains all features (just with different weights). I am wondering if it is possible at all or if you set a manual threshold like "top 10 features of each significant PC".Forfar
Depending on the application, you could define the top 10 features of each PC. In that case, you would look at the top 10 features that have the highest absolute value for each component.Siemens
If I may make a statement: wouldn't it make sense to weight each feature by the varaiance explained i.e importance=np.abs(pca_.components_).T.dot(explained_var)? I.e for feature 1 you'll get the weighted sum as the ratio of explained variance for each component times the asbolute component-feature1-value e.g 0.52*0.72+0.37*0.229+0.04*0.72+0.006*0.26=0.49 ?Wellborn
T
12

the pca library contains this functionality.

pip install pca

A demonstration to extract the feature importance is as following:

# Import libraries
import numpy as np
import pandas as pd
from pca import pca

# Lets create a dataset with features that have decreasing variance. 
# We want to extract feature f1 as most important, followed by f2 etc
f1=np.random.randint(0,100,250)
f2=np.random.randint(0,50,250)
f3=np.random.randint(0,25,250)
f4=np.random.randint(0,10,250)
f5=np.random.randint(0,5,250)
f6=np.random.randint(0,4,250)
f7=np.random.randint(0,3,250)
f8=np.random.randint(0,2,250)
f9=np.random.randint(0,1,250)

# Combine into dataframe
X = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9]
X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9'])

# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)

# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])

#     PC      feature
# 0  PC1      f1
# 1  PC2      f2
# 2  PC3      f3
# 3  PC4      f4
# 4  PC5      f5
# 5  PC6      f6
# 6  PC7      f7
# 7  PC8      f8
# 8  PC9      f9

Plot the explained variance

model.plot()

Explained variance

Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc.

ax = model.biplot(n_feat=10, legend=False)

biplot

Biplot in 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction.

ax = model.biplot3d(n_feat=10, legend=False)

biplot3d

Tantrum answered 1/7, 2020 at 21:5 Comment(2)
How do you know most of the variance is in feature 1? @TantrumSpellman
Because data of f1 is created in range 0-100 f1=np.random.randint(0,100,250)Tantrum
R
4
# original_num_df the original numeric dataframe
# pca is the model
def create_importance_dataframe(pca, original_num_df):

    # Change pcs components ndarray to a dataframe
    importance_df  = pd.DataFrame(pca.components_)

    # Assign columns
    importance_df.columns  = original_num_df.columns

    # Change to absolute values
    importance_df =importance_df.apply(np.abs)

    # Transpose
    importance_df=importance_df.transpose()

    # Change column names again

    ## First get number of pcs
    num_pcs = importance_df.shape[1]

    ## Generate the new column names
    new_columns = [f'PC{i}' for i in range(1, num_pcs + 1)]

    ## Now rename
    importance_df.columns  =new_columns

    # Return importance df
    return importance_df

# Call function to create importance df
importance_df  =create_importance_dataframe(pca, original_num_df)

# Show first few rows
display(importance_df.head())

# Sort depending on PC of interest

## PC1 top 10 important features
pc1_top_10_features = importance_df['PC1'].sort_values(ascending = False)[:10]
print(), print(f'PC1 top 10 feautres are \n')
display(pc1_top_10_features )

## PC2 top 10 important features
pc2_top_10_features = importance_df['PC2'].sort_values(ascending = False)[:10]
print(), print(f'PC2 top 10 feautres are \n')
display(pc2_top_10_features )
Roee answered 2/6, 2020 at 5:3 Comment(1)
It might be more efficient to transpose and get the absolute value on the numpy array, before creating the DataFrame.Disposed

© 2022 - 2024 — McMap. All rights reserved.