Python Pandas - Changing some column types to categories
Asked Answered
D

8

105

I have fed the following CSV file into iPython Notebook:

public = pd.read_csv("categories.csv")
public

I've also imported pandas as pd, numpy as np and matplotlib.pyplot as plt. The following data types are present (the below is a summary - there are about 100 columns)

In [36]:   public.dtypes
Out[37]:   parks          object
           playgrounds    object
           sports         object
           roading        object               
           resident       int64
           children       int64

I want to change 'parks', 'playgrounds', 'sports' and 'roading' to categories (they have likert scale responses in them - each column has different types of likert responses though (e.g. one has "strongly agree", "agree" etc., another has "very important", "important" etc.), leaving the remainder as int64.

I was able to create a separate dataframe - public1 - and change one of the columns to a category type using the following code:

public1 = {'parks': public.parks}
public1 = public1['parks'].astype('category')

However, when I tried to change a number at once using this code, I was unsuccessful:

public1 = {'parks': public.parks,
           'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')

Notwithstanding this, I don't want to create a separate dataframe with just the categories columns. I would like them changed in the original dataframe.

I tried numerous ways to achieve this, then tried the code here: Change column type in pandas.

public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')

and got the following error:

 NotImplementedError: > 1 ndim Categorical are not supported at this time

Is there a way to change 'parks', 'playgrounds', 'sports', 'roading' to categories (so the likert scale responses can then be analysed), leaving 'resident' and 'children' (and the 94 other columns that are string, int + floats) untouched?

I am using Python 2.7.

Drongo answered 7/3, 2015 at 2:51 Comment(0)
B
160

Sometimes, you just have to use a for-loop:

for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')
Babblement answered 7/3, 2015 at 3:1 Comment(6)
Thank you so much @unutbu, this has worked a treat. I can't believe that it's that simple and I feel really stupid now!Drongo
what if I'm getting error? /Users/air/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: pandas.pydata.org/pandas-docs/stable/…Hubble
@JanSila: You may get that UserWarning if public is a sub-DataFrame of another DataFrame and has data which was copied from that other DataFrame. Out of an abundance of caution, Pandas emits a UserWarning to warn you that modifying public does not modify that other DataFrame. If modifying that other DataFrame is not what you intend to do or is not an issue, then you are free to ignore the UserWarning. If you wish to silence the UserWarning anyway, place public.is_copy = False before making assignments of the form public[col] = ....Babblement
@Babblement thanks, I didn't quite get it from the documentation. This makes it clear. Kind of like deep and shallow copy in C++ right? I casted the original variables and then copied. Or using.copy() when making the subset works as well, if anyone faces this issue as well.Hubble
@JanSila: Yes, that's right. public.copy() also works, but note that if public is a large DataFrame, public.copy() could be much slower than setting the flag public.is_copy = False. (Moreover, the UserWarning is relevant only when public is a copy, so it seems ironic that we would need to make yet another copy just to silence the warning.) On the other hand, I don't think public.is_copy = False is documented. I found it by reading the source code. So if sticking to the documented API is a priority, you may wish to use public = public.copy().Babblement
@unutbu: note it has been vectorized (df[sel_cols] = df[sel_cols].astype(new_type)); and there is a typo in your nick;)Contort
O
72

You can use the pandas.DataFrame.apply method along with a lambda expression to solve this. In your example you could use

df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))

I don't know of a way to execute this inplace, so typically I'll end up with something like this:

df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))

Obviously you can replace .select_dtypes with explicit column names if you don't want to select all of a certain datatype (although in your example it seems like you wanted all object types).

Otila answered 13/1, 2017 at 22:4 Comment(3)
Excellent +1 ! Really pythonic and concise solution. Also great as it uses a query to obtain the desired columns instead of passing them as a hard-coded array.Aerology
df = df.apply(lambda s: s.astype('category') if s.name in ['parks', 'playgrounds', 'sports'] else s)Varnish
df[categoricals] = df[categoricals].apply(lambda x: x.astype('category'))Zeb
B
72

No need for loops, Pandas can do it directly now, just pass a list of columns you want to convert and Pandas will convert them all.

cols = ['parks', 'playgrounds', 'sports', 'roading']
public[cols] = public[cols].astype('category')

df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e']})

>>     a  b
>>  0  a  c
>>  1  b  d
>>  2  c  e

df.dtypes
>> a    object
>> b    object
>> dtype: object

df[df.columns] = df[df.columns].astype('category')
df.dtypes
>> a    category
>> b    category
>> dtype: object
Brenza answered 20/7, 2019 at 7:52 Comment(0)
I
14

As of pandas 0.19.0, What's New describes that read_csv supports parsing Categorical columns directly. This answer applies only if you're starting from read_csv otherwise, I think unutbu's answer is still best. Example on 10,000 records:

import pandas as pd
import numpy as np

# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
            'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'resident' : np.random.choice([1, 2, 3], size=N),
            'children' : np.random.choice([0, 1, 2, 3], size=N)
                       })
categories.to_csv('categories_large.csv', index=False)

<0.19.0 (or >=19.0 without specifying dtype)

pd.read_csv('categories_large.csv').dtypes # inspect default dtypes

children        int64
parks          object
playgrounds    object
resident        int64
roading        object
sports         object
dtype: object

>=0.19.0

For mixed dtypes parsing as Categorical can be implemented by passing a dictionary dtype={'colname' : 'category', ...} in read_csv.

pd.read_csv('categories_large.csv', dtype={'parks': 'category',
                                           'playgrounds': 'category',
                                           'sports': 'category',
                                           'roading': 'category'}).dtypes
children          int64
parks          category
playgrounds    category
resident          int64
roading        category
sports         category
dtype: object

Performance

A slight speed-up (local jupyter notebook), as mentioned in the release notes.

# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop

# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop
Isom answered 16/12, 2016 at 21:51 Comment(0)
K
9

To make things easier. No apply. No map. No loop.

cols=data.select_dtypes(exclude='int').columns.to_list()
data[cols]=data[cols].astype('category')
Kain answered 15/12, 2019 at 4:5 Comment(1)
That's what I was looking for. Thanks!Transpire
S
1

Using list comprehension(avoiding loop), this would convert all colums with dtypes=object to dtypes=category. I've put 'df' as the dataframe to be more generic.

df[[col for col in df.columns if df[col].dtypes == object]].astype('category', copy=False)

In case you'd like to avoid "copy=False" argument for some reason (as python documentation tells us to be careful while using that), you may use the following line.

df[[col for col in df.columns if df[col].dtypes == object]] = df[[col for col in df.columns if df[col].dtypes == object]].astype('category')

This is my first answer on stack, so please be kind.

Strigil answered 15/2, 2021 at 5:32 Comment(0)
R
0

I found that using a for loop works well.

for col in ['col_variable_name_1', 'col_variable_name_2', ect..]:
    dataframe_name[col] = dataframe_name[col].astype(float)
Reprise answered 19/7, 2018 at 17:47 Comment(0)
W
-1

Jupyter Notebook

In my case, I had big Dataframe with many objects that I would like to convert it to category.

Therefore, what I did is I selected the object columns and filled anything that is NA to missing and then saved it in the original Dataframe as in

# Convert Object Columns to Categories
obj_df =df.select_dtypes(include=['object']).copy()
obj_df=obj_df.fillna('Missing')
for col in obj_df:
    obj_df[col] = obj_df[col].astype('category')
df[obj_df.columns]=obj_df[obj_df.columns]
df.head()

I hope this might be a helpful resource for later reference

Wuhu answered 13/4, 2019 at 15:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.