Keep pandas index while applying sklearn

Asked 1/2, 2017 at 13:50 Answered 26/9, 2024 at 12:7

I have a dataset which has a DateTime index and I'm using PCA from sklearn to reduce the number of dimensions.

The following question bugs me - will PCA keep the order of the points in my series so that I can reuse the index from the original dataframe?

df = pd.DataFrame(...)
df2 = pca.fit_transform(df)
df2.index = df.index

Moreover, is there a better (safer) approach than doing this?

Tour answered 1/2, 2017 at 13:50 Comment(3)

Maybe reindexing would help - pca.fit_transform(df).reindex(index=df.index)? – Clotheshorse 1/2, 2017 at 13:53

And is there any difference in what I am doing? – Tour 1/2, 2017 at 14:0

Not likely though. This would get rid of the unnecessary re-assignment of index axis. – Clotheshorse 1/2, 2017 at 14:4

Though the indices are removed by PCA but the underlying order of rows remains(see implementation for the transform function of PCA*). So it is safe to have df2.index = df1.index

*fit_transform is same as fit and then transform. None of them reorder the rows.

Krill answered 21/3, 2017 at 13:32 Comment(0)

Moreover, is there a better (safer) approach than doing this?

What you do is safe. But a cleaner way to do this is to wrap the output in either a DataFrame or Series and provide the original index. In your example:

df = pd.DataFrame(...)
df2 = pd.DataFrame(pca.fit_transform(df), index=df.index)

This is very useful when dealing with prediction vectors (np.ndarrays) out of a sci-kit learn model:

y_pred = pd.Series(clf.predict(X_train), index=X_train.index)

This is important when you have a more complicated index, like a MultiIndex.

Juanitajuanne answered 29/5, 2020 at 11:27 Comment(0)

You can directly output a pandas DataFrame by calling the set_output method on your estimator:

df = pd.DataFrame(...)
pca = PCA().set_output(transform="pandas")
df2 = pca.fit_transform(df)

Whinny answered 26/9, 2024 at 12:7 Comment(0)

Recommended topics

Hot tags