How to change multiple Pandas DF columns to categorical without a loop
Asked Answered
M

2

7

I have a DataFrame where I want to change several columns from type 'object' to 'category'.

I can change several columns at the same time for float,

dftest[['col3', 'col4', 'col5', 'col6']] = \
    dftest[['col3', 'col4', 'col5', 'col6']].astype(float)

For 'category' I can not do it the same, I need to do one by one (or in a loop like here).

for col in ['col1', 'col2']:
    dftest[col] = dftest[col].astype('category')

Question: Is there any way of doing the change for all wanted columns at once like in the 'float' example?

If I try to do several columns at the same time I have:

dftest[['col1','col2']] = dftest[['col1','col2']].astype('category')
## NotImplementedError: > 1 ndim Categorical are not supported at this time

My current working test code:

import numpy as np
import pandas as pd 

factors= np.array([
        ['a', 'xx'],
        ['a', 'xx'],
        ['ab', 'xx'],
        ['ab', 'xx'],
        ['ab', 'yy'],
        ['cc', 'yy'],
        ['cc', 'zz'],
        ['d', 'zz'],
        ['d', 'zz'],
        ['g', 'zz'] 
        ])

values = np.random.randn(10,4).round(2)

dftest = pd.DataFrame(np.hstack([factors,values]), 
                  columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'])

#dftest[['col1','col2']] = dftest[['col1','col2']].astype('category')
## NotImplementedError: > 1 ndim Categorical are not supported at this time

## it works with individual astype
#dftest['col2'] = dftest['col2'].astype('category')
#dftest['col1'] = dftest['col1'].astype('category')

print(dftest)

## doing a loop
for col in ['col1', 'col2']:
    dftest[col] = dftest[col].astype('category')


dftest[['col3', 'col4', 'col5', 'col6']] = \
    dftest[['col3', 'col4', 'col5', 'col6']].astype(float)

dftest.dtypes

output:

col1    category
col2    category
col3     float64
col4     float64
col5     float64
col6     float64
dtype: object

== [update] ==

I don't have a problem using the loop now that I know the trick, but I asked the question because I wanted to learn/understand WHY I need to do a loop for 'category' and not for float, if there is no other way of doing it.

Monopoly answered 29/8, 2016 at 14:54 Comment(4)
Out of curiosity, what is the point? Speed?Remunerate
@Remunerate see my updateMonopoly
"Not implemented" generally means it is planned for a future version. Categoricals are relatively new so we may expect .astype('category) to work for more than 1 column in the future.Pattiepattin
FYI: DataFrame.astype('category') is implemented and availableVive
O
3

It's not immediately clear what the result of dftest[['col1','col2']].astype('category') should be, i.e. whether the resulting columns should share the same categories or not.

Looping over columns make each column have a separate set of categories. (I believe this is a desired result in your example.)

On the other hand, .astype(float) works differently: it ravels the underlying values to a 1d array, casts it to floats, and then reshape it back to the original shape. This way it may be faster than just iterating over columns. You can emulate this behaviour for category with higher level functions:

result = dftest[['col1', 'col2']].stack().astype('category').unstack()

But then you get a single set of categories shared by the both columns:

result['col1']
Out[36]: 
0     a
1     a
2    ab
3    ab
4    ab
5    cc
6    cc
7     d
8     d
9     g
Name: col1, dtype: category
Categories (8, object): [a < ab < cc < d < g < xx < yy < zz]
Obbard answered 30/8, 2016 at 16:21 Comment(1)
Thanks @ptrj, I see now that the problem is not the casting to another type but how to handle the levels in the case of categorical columns. I did not think on that, but make perfect sense to me now. Thanks for the enlightenment.Monopoly
W
1

you can do it this way:

In [99]: pd.concat([dftest[['col1', 'col2']].apply(lambda x: x.astype('category')), dftest.ix[:, 'col3':].astype('float')], axis=1)
Out[99]:
  col1 col2  col3  col4  col5  col6
0    a   xx  0.30  2.28  0.84  0.31
1    a   xx -0.13  2.04  2.62  0.49
2   ab   xx -0.34 -0.32 -1.87  1.49
3   ab   xx -1.18 -0.57 -0.57  0.87
4   ab   yy  0.66  0.65  0.96  0.07
5   cc   yy  0.88  2.43  0.76  1.93
6   cc   zz  1.81 -1.40 -2.29 -0.13
7    d   zz -0.05  0.60 -0.78 -0.28
8    d   zz -0.36  0.98  0.23 -0.17
9    g   zz -1.31 -0.84  0.02  0.47

In [100]: pd.concat([dftest[['col1', 'col2']].apply(lambda x: x.astype('category')), dftest.ix[:, 'col3':].astype('float')], axis=1).dtypes
Out[100]:
col1    category
col2    category
col3     float64
col4     float64
col5     float64
col6     float64
dtype: object

but it won't be much faster, as apply() method uses looping under the hood

Wolpert answered 29/8, 2016 at 15:11 Comment(1)
Thanks @MaxU, but the aim of the question was more about "why I can not change type to category for several columns like in float?". I wanted to know if it was a limitation of my pandas knlowledge. See updateMonopoly

© 2022 - 2024 — McMap. All rights reserved.