drop_First=true during dummy variable creation in pandas
Asked Answered
W

3

15

I have months(Jan, Feb, Mar etc) data in my dataset and I am generating dummy variable using pandas library. pd.get_dummies(df['month'],drop_first=True)

I want to understand whether I should use drop_first=True or not in this case? Why is it important to use drop_first and for which type of variables?

Wellchosen answered 30/8, 2020 at 19:17 Comment(1)
yes you should, imagine you are looking at a coin flip, and have a feature called is_head, you do not need a column is_tail because you already know it via is_head=False. Same applies to other features like your month, if jan to nov are false it is clear that it is december. Why is that important? Because more dummy features make it harder for the algorithm to fit or even worse make it easier to overfit.Aldas
L
19
  • drop_first=True is important to use, as it helps in reducing the extra column created during dummy variable creation. Hence it reduces the correlations created among dummy variables.
  • Let’s say we have 3 types of values in Categorical column and we want to create dummy variable for that column. If one variable is not furnished and semi_furnished, then It is obvious unfurnished. So we do not need 3rd variable to identify the unfurnished. Example

Hence if we have categorical variable with n-levels, then we need to use n-1 columns to represent the dummy variables.

Leafy answered 30/10, 2020 at 15:44 Comment(4)
This should have been the default IMOEquity
Let's imagine that "unfurnished" is actually a very important feature in the model. How would I know this? Because I remove unfurnished as per your example above. Yes we can still tell what row is unfurnished looking at your data, but for example if you plot the feature importances after model building, unfurnished will not appear there, even though it is an important feature. How does this work?Epigone
if we have 3 levels (unknown, True, False) in the column. How can I tell python not only to drop_first but also drop exactly the 'unknown' column?Sulfonal
@AmanBagrecha makes a good point. If drop_first=True is so important in reducing the dimensionality, shouldn't it be the default setting for the get_dummies() function? When would it be appropriate to eve set drop_first=False?Dupre
P
4

What is drop_first=True

drop_first=True drops the first column during dummy variable creation. Suppose, you have a column for gender that contains 4 variables- "Male", "Female", "Other", "Unknown". So a person is either "Male", or "Female", or "Other". If they are not either of these 3, their gender is "Unknown".

We do NOT need another column for "Uknown".

It can be necessary for some situations, while not applicable for others. The goal is to reduce the number of columns by dropping the column that is not necessary. However, it is not always true. For some situations, we need to keep the first column.

Example

Suppose, we have 5 unique values in a column called "Fav_genre"- "Rock", "Hip hop", "Pop", "Metal", "Country" This column contains value While dummy variable creation, we usually generate 5 columns. In this case, drop_first=True is not applicable. A person may have more than one favorite genres. So dropping any of the columns would not be right. Hence, drop_first=False is the default parameter.

Pestilent answered 23/4, 2022 at 22:57 Comment(0)
F
0
  • What

one hot encoding: it is a technique which converts categorical data into a form which is understandable by ml model.

  • How

for eg: If there is a column which has 10 unique categorical values or labels, using pd.getdummies() we convert them into a binary vector which makes 10 columns, one column for each unique value of our original column and wherever this value is true for a row it is indicated as 1 else 0.

  • what

if drop_first is true it removes the first column which is created for the first unique value of a column.

In our case it will be 9 columns not 10 columns.

  • how

It useful bcoz it reduces the number columns, here is how, when all the other columns are zero that means the first columns is 1.

refer this to know better: https://www.dataindependent.com/pandas/pandas-get-dummies/

Fluoride answered 26/1, 2022 at 8:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.