Pandas: get_dummies vs categorical - McMap

About

Pandas: get_dummies vs categorical

Asked 23/3, 2015 at 22:50 Answered 23/3, 2015 at 23:41

python pandas categorical-data dummy-data

D

1

7

I have a dataset which has a few columns with categorical data.

I've been using the Categorical function to replace categorical values with numerical ones.

data[column] = pd.Categorical.from_array(data[column]).codes

I've recently ran across the pandas.get_dummies function. Are these interchangeable? Is there an advantage of using one over the other?

Diglot answered 23/3, 2015 at 22:50 Comment(3)

If you just want to convert to numeric values for sklearn why not DictVectoriser? – Cellulitis 24/3, 2015 at 8:28

To be honest, Ed, because I didn't know it existed :) – Diglot 24/3, 2015 at 22:11

You'll probably find that sklearn has most of your data preprocessing needs – Cellulitis 24/3, 2015 at 22:13

P

6

Why are you converting the categorical datas to integers? I don't believe you save memory if that is your goal.

df = pd.DataFrame({'cat': pd.Categorical(['a', 'a', 'a', 'b', 'b', 'c'])})
df2 = pd.DataFrame({'cat': [1, 1, 1, 2, 2, 3]})

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null category
dtypes: category(1)
memory usage: 78.0 bytes

>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null int64
dtypes: int64(1)
memory usage: 96.0 bytes

The categorical codes are just integer values for the unique items in the given category. By contrast, get_dummies returns a new column for each unique item. The value in the column indicates whether or not the record has that attribute.

>>> pd.core.reshape.get_dummies(df)
Out[30]: 
   cat_a  cat_b  cat_c
0      1      0      0
1      1      0      0
2      1      0      0
3      0      1      0
4      0      1      0
5      0      0      1

To get the codes directly, you can use:

df['codes'] = [df.cat.codes.to_list()]

Parathyroid answered 23/3, 2015 at 23:41 Comment(5)

Thanks Alexander, I'm actually preparing the dataset for a Random Forest regression, so I need everything to be numerical. It actually turns out that get_dummies will give me memory errors, whereas Categorical will not – Diglot 24/3, 2015 at 0:25

This is not an answer to the second part of the question, which was the key part I guess: I've recently ran across the pandas.get_dummies function. Are these interchangeable? Is there an advantage of using one over the other? – Arthropod 3/11, 2015 at 22:29

The second part of the question isn't a programming question. A machine learning algorithm will interpret categorical data in df2 as having order (e.g. green is greater than red). Whether or not this desirable depends on your use case. To get around this issue, dummy variables (aka One-Hot-Encoding) create new features for each of the categorical items. – Parathyroid 6/11, 2015 at 17:39

@Diglot In reference to the memory error, you can use the sparse=True option, which shouldn't use much more memory than the original categorical dataframe. – Lindahl 5/6, 2016 at 19:24

@Parathyroid As for the machine learning question, it seems that random forest is generally able to create trees that can ignore the implied order, though I haven't seen any rigorous proofs of it. It seems to be common practice, though. – Lindahl 5/6, 2016 at 19:25

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.