How can I calculate Principal Components Analysis from data in a pandas dataframe?
Principal components analysis using pandas dataframe
Asked Answered
I guess you too are trying to modify the w3schools example :) –
Downtown
Most sklearn objects work with pandas
dataframes just fine, would something like this work for you?
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
df = pd.DataFrame(data=np.random.normal(0, 1, (20, 10)))
pca = PCA(n_components=5)
pca.fit(df)
You can access the components themselves with
pca.components_
This works great. Just an addition that might be of interest: it's often convenient to end up with a DataFrame as well, as opposed to an array. To do that one would do something like: pandas.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(n_components)], index=df.index), where I've set n_components=5. Also, you have a typo in the text above the code, "panadas" should be "pandas". :) –
Giantism
In my case I wanted the components, not the transform, so taking @Moot's syntax I used
df = pandas.DataFrame(pca.components_)
. One last note also, is that if you are going to try to use this new df
with a dot product, make sure to check out this link: [#16473229 –
Superfetation import pandas
from sklearn.decomposition import PCA
import numpy
import matplotlib.pyplot as plot
df = pandas.DataFrame(data=numpy.random.normal(0, 1, (20, 10)))
# You must normalize the data before applying the fit method
df_normalized=(df - df.mean()) / df.std()
pca = PCA(n_components=df.shape[1])
pca.fit(df_normalized)
# Reformat and view results
loadings = pandas.DataFrame(pca.components_.T,
columns=['PC%s' % _ for _ in range(len(df_normalized.columns))],
index=df.columns)
print(loadings)
plot.plot(pca.explained_variance_ratio_)
plot.ylabel('Explained Variance')
plot.xlabel('Components')
plot.show()
The whiten=True argument to PCA does the normalization for you, if you need it at all. –
Proponent
When in doubt, you normalize, otherwise you could have two different scales for your data. For example, if you had age in one column and population in another, those are two different measure scales and would need to be normalized in order to run PCA. –
Bellwort
© 2022 - 2024 — McMap. All rights reserved.