How to specify a variable in pandas as ordinal/categorical?
Asked Answered
A

4

22

I am trying to run some Machine learning algo on a dataset using scikit-learn. My dataset has some features which are like categories. Like one feature is A, which has values 1,2,3 specifying the quality of something. 1:Upper, 2: Second, 3: Third class. So it's an ordinal variable.

Similarly I re-coded a variable City, having three values ('London', Zurich', 'New York' into 1,2,3 but with no specific preference for the values. So now this is a nominal categorical variable.

How do I specify the algorithm to consider these as categorical and ordinal etc. in pandas?. Like in R, a categorical variable is specified by factor(a) and hence is not considered a continuous value. Is there anything like that in pandas/python?

Argosy answered 9/4, 2015 at 2:18 Comment(0)
S
38

... years later (and because I think a good explanation of these issues is required not only for this question but to help remind myself in the future)

Ordinal vs. Nominal

In general, one would translate categorical variables into dummy variables (or a host of other methodologies), because they were nominal, e.g. they had no sense of a > b > c . In OPs original question, this would only be performed on the Cities, like London, Zurich, New York.

Dummy Variables for Nominal

For this type of issue, pandas provides -- by far -- the easiest transformation using pandas.get_dummies. So:

# create a sample of OPs unique values
series = pandas.Series(
           numpy.random.randint(low=0, high=3, size=100))
mapper = {0: 'New York', 1: 'London', 2: 'Zurich'}
nomvar = series.replace(mapper)

# now let's use pandas.get_dummies
print(
    pandas.get_dummies(series.replace(mpr))

Out[57]:
    London  New York  Zurich
0        0         0       1
1        0         1       0
2        0         1       0
3        1         0       0

Ordinal Encoding for Categorical Variables

However in the case of ordinal variables, the user must be cautious in using pandas.factorize. The reason is that the engineer wants to preserve the relationship in the mapping such that a > b > c.

So if I want to take a set of categorical variables where large > medium > small, and preserve that, I need to make sure that pandas.factorize preserves that relationship.

# leveraging the variables already created above
mapper = {0: 'small', 1: 'medium', 2: 'large'}
ordvar = series.replace(mapper)

print(pandas.factorize(ordvar))

Out[58]:
(array([0, 1, 1, 2, 1,...  0, 0]),
Index(['large', 'small', 'medium'], dtype='object'))

In fact, the relationship that needs to be preserved in order to maintain the concept of ordinal has been lost using pandas.factorize. In an instance like this, I use my own mappings to ensure that the ordinal attributes are preserved.

preserved_mapper = {'large':2 , 'medium': 1, 'small': 0}
ordvar.replace(preserved_mapper)
print(ordvar.replace(preserved_mapper))

Out[78]:
0     2
1     0
...
99    2
dtype: int64

In fact, by creating your own dict to map the values is a way to not only preserve your desired ordinal relationship but also can be used as "keeping the contents and mappings of your prediction algorithm organized" ensuring that not only have you not lost any ordinal information in the process, but also have stored records of what each mapping for each variable is.

ints into sklearn

Lastly, the OP spoke about passing the information into scikit-lean classifiers, which means that ints are required. For that case, make sure you're aware of the astype(int) gotcha that is detailed here if you have any NaNs in your data.

Spirochete answered 26/1, 2017 at 14:1 Comment(2)
I know this is old. But do you one hot encode variables that you believe are ordinal?Scanties
No capacity to write a full answer at the moment, and this answer also provides a lot of context. Current pandas (Mar 2022 now) explicitly has an appropriate type for this: pandas.pydata.org/pandas-docs/stable/user_guide/…Pydna
A
1

See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html and see this question How to reformat categorical Pandas variables for Sci-kit Learn

Angloirish answered 9/4, 2015 at 2:50 Comment(1)
It's better to flesh out your answer rather than post linksEuphonious
K
1

You should use the OneHotEncoder transformer with the categorical variables, and leave the ordinal variable untouched:

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> df = pd.DataFrame({'quality': [1, 2, 3], 'city': [3, 2, 1], columns=['quality', 'city']}
>>> enc = OneHotEncoder(categorical_features=[False, True])
>>> X = df.values
>>> enc.fit(X)
>>> enc.transform(X).todense()
matrix([[ 0.,  0.,  1.,  1.],
        [ 0.,  1.,  0.,  2.],
        [ 1.,  0.,  0.,  3.]])
Keramics answered 9/4, 2015 at 7:25 Comment(3)
Hi , I have tried that in the past. So does this method encodes all categorical variable which we specify as dummy variable right? Is there a way to keep sense of order? Like for quality variable 1,2,3 does make some sense with respect to kind of qualityArgosy
OneHotEncoder lets you specify a subset of vars (array cols) to dummify in the categorical_features argument. This way you can leave the ordinal variable untouched and keep the sense of order.Keramics
but if you leave the ordinals untoruched you did not accomplish anything either...Chabot
A
0

pd.Categorical() allows you to create categorical columns for your (pandas) DataFrames. For ordinal categorical data, pass the parameter ordered=True.

Minimal working example:

import pandas as pd

df = pd.DataFrame(
    {'one': ['d','d','a','c'],
     'two': ['b','d','b','a']
    })
df['one'] = pd.Categorical(df['one'],categories=list('abcd'), ordered=True)
df['two'] = pd.Categorical(df['one'],categories=list('abcd'), ordered=True)

In [1]: df
Out[1]: 
  one two
0   d   b
1   d   d
2   a   b
3   c   a

In [2]: df.dtypes
Out[2]: 
one    category
two    category
dtype: object

This allows you to represent the ordinal data similar to R.

Attempt answered 21/6 at 15:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.