Create dummies from column with multiple values in pandas

Asked 19/9, 2013 at 8:20 Answered 26/10, 2018 at 15:25

Solved python pandas dummy-data categorical-data

I am looking for for a pythonic way to handle the following problem.

The pandas.get_dummies() method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B'], get_dummies() creates 2 dummy variables and assigns 0 or 1 accordingly.

Now, I need to handle this situation. A single column, let's call it 'label', has values like ['A', 'B', 'C', 'D', 'A*C', 'C*D'] . get_dummies() creates 6 dummies, but I only want 4 of them, so that a row could have multiple 1s.

Is there a way to handle this in a pythonic way? I could only think of some step-by-step algorithm to get it, but that would not include get_dummies(). Thanks

Edited, hope it is more clear!

Plurality answered 19/9, 2013 at 8:20 Comment(3)

could you not just select the columns you want to call get_dummies() on? like df[['A', 'B', 'C','D']].get_dummies()? – Charest 19/9, 2013 at 8:24

A, B, C, C*D, and so on, are the values of a single column. – Plurality 19/9, 2013 at 8:46

OK understand, how about df[df.col.isin(['A','B','C'])].get_dummies() would this work? this would filter out the values you did not want to generate dummy values for – Charest 19/9, 2013 at 9:42

I know it's been a while since this question was asked, but there is (at least now there is) a one-liner that is supported by the documentation:

In [4]: df
Out[4]:
      label
0  (a, c, e)
1     (a, d)
2       (b,)
3     (d, e)

In [5]: df['label'].str.join(sep='*').str.get_dummies(sep='*')
Out[5]:
   a  b  c  d  e
0  1  0  1  0  1
1  1  0  0  1  0
2  0  1  0  0  0
3  0  0  0  1  1

Etem answered 8/8, 2014 at 17:25 Comment(4)

+1: You sir... are some sort of wizard! Just what I needed.. thx :) – Flamenco 20/8, 2014 at 18:25

Based on my question I think df['label'].str.get_dummies(sep='*') is enough. I guess the first part would be needed if the strings weren't already formated as the get_dummies function expects them to be – Plurality 19/9, 2016 at 8:48

The answer is quite old but, however I am strugling to understand how is works. Could anyone comment on the structure ? I have a similar case where each row is a list of strings. The solution works like a charm but I'd like to understand why.... – Heulandite 12/6, 2017 at 16:49

Oustanding solution @Etem – Grivation 4/3, 2019 at 12:4

I have a somewhat cleaner solution. Assume we want to transform the following dataframe

   pageid category
0       0        a
1       0        b
2       1        a
3       1        c

into

        a  b  c
pageid         
0       1  1  0
1       1  0  1

One way to do it is to make use of scikit-learn's DictVectorizer. I would, however, be interested in learning about other methods.

df = pd.DataFrame(dict(pageid=[0, 0, 1, 1], category=['a', 'b', 'a', 'c']))

grouped = df.groupby('pageid').category.apply(lambda lst: tuple((k, 1) for k in lst))
category_dicts = [dict(tuples) for tuples in grouped]
v = sklearn.feature_extraction.DictVectorizer(sparse=False)
X = v.fit_transform(category_dicts)

pd.DataFrame(X, columns=v.get_feature_names(), index=grouped.index)

Tithable answered 4/6, 2014 at 18:31 Comment(0)

You can generate the dummies dataframe with your raw data, isolate the columns that contains a given atom, and then store the result matches back to the atom column.

df
Out[28]: 
  label
0     A
1     B
2     C
3     D
4   A*C
5   C*D

dummies = pd.get_dummies(df['label'])

atom_col = [c for c in dummies.columns if '*' not in c]

for col in atom_col:
    ...:     df[col] = dummies[[c for c in dummies.columns if col in c]].sum(axis=1)
    ...:     

df
Out[32]: 
  label  A  B  C  D
0     A  1  0  0  0
1     B  0  1  0  0
2     C  0  0  1  0
3     D  0  0  0  1
4   A*C  1  0  1  0
5   C*D  0  0  1  1

Flagstaff answered 19/9, 2013 at 9:44 Comment(3)

Works like a charm! (although you need to edit 'd' into 'dummies' in the code) thanks a lot! – Plurality 19/9, 2013 at 10:14

I just found that the solution does not work if the values of the 'label' variable are of the form [A,B,C,D,A*C,C*D,E*F]. The way I got it to work anyway was by doing something like: atom_col = '*'.join(pd.Series(df['label'])).split('*').unique() (it's ugly in 1 line but the steps are all there) – Plurality 23/9, 2013 at 16:20

you can also use a set comprehension: atom_col = {c for raw_c in dummies.columns for c in raw_c.split('*') if '*' not in c} – Flagstaff 23/9, 2013 at 19:43

I believe this question needs an updated answer after coming across the MultiLabelBinarizer from sklearn.

The usage of this is as simple as...

# Instantiate the binarizer
mlb = MultiLabelBinarizer()

# Using OP's original data frame
df = pd.DataFrame(data=['A', 'B', 'C', 'D', 'A*C', 'C*D'], columns=["label"])

print(df)
  label
0     A
1     B
2     C
3     D
4   A*C
5   C*D

# Convert to a list of labels
df = df.apply(lambda x: x["label"].split("*"), axis=1)

print(df)
0       [A]
1       [B]
2       [C]
3       [D]
4    [A, C]
5    [C, D]
dtype: object

# Transform to a binary array
array_out = mlb.fit_transform(df)

print(array_out)
[[1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [0 0 0 1]
 [1 0 1 0]
 [0 0 1 1]]

# Convert back to a dataframe (unnecessary step in many cases)
df_out = pd.DataFrame(data=array_out, columns=mlb.classes_)

print(df_out)
   A  B  C  D
0  1  0  0  0
1  0  1  0  0
2  0  0  1  0
3  0  0  0  1
4  1  0  1  0
5  0  0  1  1

This is also very fast, took virtually no time (.03 seconds) across 1000 rows and 50K classes.

Brehm answered 26/10, 2018 at 15:25 Comment(0)

Recommended topics

Hot tags