Principal components analysis using pandas dataframe
Asked Answered
P

2

69

How can I calculate Principal Components Analysis from data in a pandas dataframe?

Posen answered 25/4, 2014 at 0:22 Comment(1)
I guess you too are trying to modify the w3schools example :)Downtown
I
105

Most sklearn objects work with pandas dataframes just fine, would something like this work for you?

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

df = pd.DataFrame(data=np.random.normal(0, 1, (20, 10)))

pca = PCA(n_components=5)
pca.fit(df)

You can access the components themselves with

pca.components_ 
Izaguirre answered 25/4, 2014 at 0:42 Comment(2)
This works great. Just an addition that might be of interest: it's often convenient to end up with a DataFrame as well, as opposed to an array. To do that one would do something like: pandas.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(n_components)], index=df.index), where I've set n_components=5. Also, you have a typo in the text above the code, "panadas" should be "pandas". :)Giantism
In my case I wanted the components, not the transform, so taking @Moot's syntax I used df = pandas.DataFrame(pca.components_). One last note also, is that if you are going to try to use this new df with a dot product, make sure to check out this link: [#16473229Superfetation
B
8
import pandas
from sklearn.decomposition import PCA
import numpy
import matplotlib.pyplot as plot

df = pandas.DataFrame(data=numpy.random.normal(0, 1, (20, 10)))

# You must normalize the data before applying the fit method
df_normalized=(df - df.mean()) / df.std()
pca = PCA(n_components=df.shape[1])
pca.fit(df_normalized)

# Reformat and view results
loadings = pandas.DataFrame(pca.components_.T,
columns=['PC%s' % _ for _ in range(len(df_normalized.columns))],
index=df.columns)
print(loadings)

plot.plot(pca.explained_variance_ratio_)
plot.ylabel('Explained Variance')
plot.xlabel('Components')
plot.show()
Bellwort answered 1/8, 2021 at 21:36 Comment(2)
The whiten=True argument to PCA does the normalization for you, if you need it at all.Proponent
When in doubt, you normalize, otherwise you could have two different scales for your data. For example, if you had age in one column and population in another, those are two different measure scales and would need to be normalized in order to run PCA.Bellwort

© 2022 - 2024 — McMap. All rights reserved.