Pandas: get_dummies vs categorical
Asked Answered
D

1

7

I have a dataset which has a few columns with categorical data.

I've been using the Categorical function to replace categorical values with numerical ones.

data[column] = pd.Categorical.from_array(data[column]).codes

I've recently ran across the pandas.get_dummies function. Are these interchangeable? Is there an advantage of using one over the other?

Diglot answered 23/3, 2015 at 22:50 Comment(3)
If you just want to convert to numeric values for sklearn why not DictVectoriser?Cellulitis
To be honest, Ed, because I didn't know it existed :)Diglot
You'll probably find that sklearn has most of your data preprocessing needsCellulitis
P
6

Why are you converting the categorical datas to integers? I don't believe you save memory if that is your goal.

df = pd.DataFrame({'cat': pd.Categorical(['a', 'a', 'a', 'b', 'b', 'c'])})
df2 = pd.DataFrame({'cat': [1, 1, 1, 2, 2, 3]})

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null category
dtypes: category(1)
memory usage: 78.0 bytes

>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null int64
dtypes: int64(1)
memory usage: 96.0 bytes

The categorical codes are just integer values for the unique items in the given category. By contrast, get_dummies returns a new column for each unique item. The value in the column indicates whether or not the record has that attribute.

>>> pd.core.reshape.get_dummies(df)
Out[30]: 
   cat_a  cat_b  cat_c
0      1      0      0
1      1      0      0
2      1      0      0
3      0      1      0
4      0      1      0
5      0      0      1

To get the codes directly, you can use:

df['codes'] = [df.cat.codes.to_list()]
Parathyroid answered 23/3, 2015 at 23:41 Comment(5)
Thanks Alexander, I'm actually preparing the dataset for a Random Forest regression, so I need everything to be numerical. It actually turns out that get_dummies will give me memory errors, whereas Categorical will notDiglot
This is not an answer to the second part of the question, which was the key part I guess: I've recently ran across the pandas.get_dummies function. Are these interchangeable? Is there an advantage of using one over the other?Arthropod
The second part of the question isn't a programming question. A machine learning algorithm will interpret categorical data in df2 as having order (e.g. green is greater than red). Whether or not this desirable depends on your use case. To get around this issue, dummy variables (aka One-Hot-Encoding) create new features for each of the categorical items.Parathyroid
@Diglot In reference to the memory error, you can use the sparse=True option, which shouldn't use much more memory than the original categorical dataframe.Lindahl
@Parathyroid As for the machine learning question, it seems that random forest is generally able to create trees that can ignore the implied order, though I haven't seen any rigorous proofs of it. It seems to be common practice, though.Lindahl

© 2022 - 2024 — McMap. All rights reserved.