pandas equivalent of Stata's encode

C

3

16

I'm looking for a way to replicate the encode behaviour in Stata, which will convert a categorical string column into a number column.

x = pd.DataFrame({'cat':['A','A','B'], 'val':[10,20,30]})
x = x.set_index('cat')

Which results in:

     val
cat     
A     10
A     20
B     30

I'd like to convert the cat column from strings to integers, mapping each unique string to an (arbitrary) integer 1-to-1. It would result in:

Or, just as good:

Any suggestions?

Many thanks as always, Rob

Cornice answered 16/12, 2013 at 20:3 Comment(7)

maybe: DataFrame([(i[1], i[0]) for i in enumerate(set(x.index))]) and then merge? – Vimineous 16/12, 2013 at 20:17

Important detail: this is not what Stata's encode does. It produces one-to-one mappings. – Swaggering 17/12, 2013 at 0:46

@NickCox I don't understand how this isn't a one-to-one mapping. Each instance of 'A' becomes 1, each instance of 'B' becomes 2 etc. – Cornice 17/12, 2013 at 14:55

That's not what I see in your example. I see A, A, B mapping to 10, 20, 30. Why does the first A get 10 and the second get 20? If that's what you want, I don't understand but that's up to you; my point remains that it's not what encode does in Stata. – Swaggering 17/12, 2013 at 15:8

@NickCox it's the cat column that's getting the mapping, not the val column. The val column remains unchanged and is of no relevance to the example. The important thing is that cat goes from ['A','A','B'] to [1,1,2] as per my example. – Cornice 17/12, 2013 at 15:30

Glad to hear it, but I don't see that being clear anywhere in your post. – Swaggering 17/12, 2013 at 15:35

Made the description of what I'm trying to do more explicit, in response to @NickCox's comments. – Cornice 17/12, 2013 at 16:6

M

9

Stata's encode command starts with a string variable and creates a new integer variable with labels mapped to the original string variable. The direct analog of this in pandas would now be the categorical variable type which became a full-fledged part of pandas starting in 0.15 (which was released after this question was originally asked and answered).

See documentation here.

To demonstrate for this example, the Stata command would be something like:

encode cat, generate(cat2)

whereas the pandas command would be:

x['cat2'] = x['cat'].astype('category')

  cat  val cat2
0   A   10    A
1   A   20    A
2   B   30    B

Just as Stata does with encode, the data are stored as integers, but display as strings in the default output.

You can verify this by using the categorical accessor cat to see the underlying integer. (And for that reason you probably don't want to use 'cat' as a column name.)

x['cat2'].cat.codes

0    0
1    0
2    1

Melvin answered 23/9, 2015 at 22:1 Comment(2)

I've been trying to do this for hours! Was searching convert object to integer, or convert categorical to numeric and going crazy. I'm on pandas 16.2 (current version with anaconda). – Jennettejenni 15/12, 2015 at 4:46

+1000 df['a'].cat.codes is a lifesaver! Have been scouring the web to find as an alternative to using sklearn's DictVectorizer or LabelEncoder. This combined with OneHotEncoder works beautifully with sklearn-pandas – Keefer 16/12, 2015 at 4:58

R

17

You could use pd.factorize:

import pandas as pd

x = pd.DataFrame({'cat':('A','A','B'), 'val':(10,20,30)})
labels, levels = pd.factorize(x['cat'])
x['cat'] = labels
x = x.set_index('cat')
print(x)

yields

You could add 1 to labels if you wish to replicate Stata's behaviour:

x['cat'] = labels+1

Rubescent answered 16/12, 2013 at 20:10 Comment(7)

Another way to get at [0,0,1] is to look in pd.Categorical(seq).labels. – Kleiman 16/12, 2013 at 20:14

Thanks, @DSM. Looking at the source code, I see Categorical calls factorize. – Rubescent 16/12, 2013 at 20:18

Thanks @unutbu. FYI: this is a brilliant way to make beautiful categorised scatter plots, using a text column as the category. – Cornice 16/12, 2013 at 20:26

@Rubescent this should go in the docs, can you do a PR for somewhere around here: pandas.pydata.org/pandas-docs/dev/… – Octroi 16/12, 2013 at 20:42

use the main repo; stable docs will be updated when 0.13 is released – Octroi 17/12, 2013 at 14:58

@Jeff: I've grepped my clone of the repo but could not find strings such as dummy variables or bbacab which are used on pandas.pydata.org/pandas-docs/dev/…. What file in the repo should be edited to affect the docs? – Rubescent 17/12, 2013 at 15:4

pandas/doc/source/reshape.rst – Octroi 17/12, 2013 at 15:14

M

9

Stata's encode command starts with a string variable and creates a new integer variable with labels mapped to the original string variable. The direct analog of this in pandas would now be the categorical variable type which became a full-fledged part of pandas starting in 0.15 (which was released after this question was originally asked and answered).

See documentation here.

To demonstrate for this example, the Stata command would be something like:

encode cat, generate(cat2)

whereas the pandas command would be:

x['cat2'] = x['cat'].astype('category')

  cat  val cat2
0   A   10    A
1   A   20    A
2   B   30    B

Just as Stata does with encode, the data are stored as integers, but display as strings in the default output.

You can verify this by using the categorical accessor cat to see the underlying integer. (And for that reason you probably don't want to use 'cat' as a column name.)

x['cat2'].cat.codes

0    0
1    0
2    1

Melvin answered 23/9, 2015 at 22:1 Comment(2)

I've been trying to do this for hours! Was searching convert object to integer, or convert categorical to numeric and going crazy. I'm on pandas 16.2 (current version with anaconda). – Jennettejenni 15/12, 2015 at 4:46

+1000 df['a'].cat.codes is a lifesaver! Have been scouring the web to find as an alternative to using sklearn's DictVectorizer or LabelEncoder. This combined with OneHotEncoder works beautifully with sklearn-pandas – Keefer 16/12, 2015 at 4:58

S

1

Assuming you have the fixed set of single capitalized English letters as your categorical variable, you can also do this:

x['cat'] = x.cat.map(lambda x: ord(x) - 64)

I believe it is a bit of a hack. But then again, in Python, the best thing would be to define a mapping from characters to integers that you desire, such as

my_map = {"A":1, ...} 
# e.g.: {x:ord(x)-64  for x in string.ascii_uppercase}
# if that's the convention you happen to desire.

and then do

x['cat'] = x.cat.map(lambda x: my_map[x])

or something similar.

This is superior to reliance on the conventions of built-in functions for your integer mapping, for numerous reasons, and (IMO) it is things like this that "feel like" nuisance conversions to the programmer-analyst, but in reality represent important metadata about the software you are writing, that expose the real weakness of global convenience functions in higher level languages like MATLAB, STATA, etc. Even if there is a built-in function that happens to randomly adhere to the particular convention you want to use (the arbitrary convention that "A" is mapped to 1, "B" is mapped to 2, etc.) it doesn't make it a good idea to use it.

Striation answered 16/12, 2013 at 20:15 Comment(7)

I leave comments on MATLAB to experienced users. The comments on Stata's encode command are puzzling. It defaults to mapping distinct string values in alphabetical order to integers 1 up, so "A", "B", "C" would be mapped to 1, 2, 3. But that default can be overridden through some specified string to integer translation scheme. If you don't want that, don't use it; there's no discernible issue of language design or philosophy implied. – Swaggering 17/12, 2013 at 0:51

int64('A') == 65 in MATLAB. int('A') raises a ValueError in Python, which makes more sense IMHO. Of course, if you only write code in MATLAB that doesn't ever talk to the outside world, then it's a moot point. – Nedi 17/12, 2013 at 2:25

@Phillip Cloud I suppose it's a matter of taste as to whether someone expects int to behave that way. Since int(x) in Python is just syntactical sugar for x.__int__(), I don't see it the same way you do. I don't expect single-length str variables to have a different __int__ than multi-character str variables, which provides the distinction for wanting a function like ord, but it's just my opinion. – Striation 17/12, 2013 at 14:30

@EMS Your experience with Stata doesn't extend to being able to spell its name correctly or to know the difference between a Stata command and a Stata function. If length of experience is an argument, feel the weight of my 22 years with Stata. More seriously, and more importantly, your comments about encode remain puzzling, as you have changed your argument (really an assertion) to arguing that a language feature is indicted if used in ways you can consider dubious. That's more a reflection of your personal taste than anything else. – Swaggering 17/12, 2013 at 14:40

I can only echo that as you have descended into criticising me, not my argument. – Swaggering 17/12, 2013 at 15:9

@EMS I think you're mistaken. I agree with you w.r.t. the behavior of __int__. I supplied an example for @NickCox. Guess I should have mentioned that. – Nedi 17/12, 2013 at 16:24

Oh I see, my bad. I misread your comment as saying that the MATLAB behavior was more desirable. – Striation 17/12, 2013 at 16:26

Recommended topics

Hot tags