pandas equivalent of Stata's encode
Asked Answered
C

3

16

I'm looking for a way to replicate the encode behaviour in Stata, which will convert a categorical string column into a number column.

x = pd.DataFrame({'cat':['A','A','B'], 'val':[10,20,30]})
x = x.set_index('cat')

Which results in:

     val
cat     
A     10
A     20
B     30

I'd like to convert the cat column from strings to integers, mapping each unique string to an (arbitrary) integer 1-to-1. It would result in:

     val
cat     
1     10
1     20
2     30

Or, just as good:

  cat  val
0   1   10
1   1   20
2   2   30

Any suggestions?

Many thanks as always, Rob

Cornice answered 16/12, 2013 at 20:3 Comment(7)
maybe: DataFrame([(i[1], i[0]) for i in enumerate(set(x.index))]) and then merge?Vimineous
Important detail: this is not what Stata's encode does. It produces one-to-one mappings.Swaggering
@NickCox I don't understand how this isn't a one-to-one mapping. Each instance of 'A' becomes 1, each instance of 'B' becomes 2 etc.Cornice
That's not what I see in your example. I see A, A, B mapping to 10, 20, 30. Why does the first A get 10 and the second get 20? If that's what you want, I don't understand but that's up to you; my point remains that it's not what encode does in Stata.Swaggering
@NickCox it's the cat column that's getting the mapping, not the val column. The val column remains unchanged and is of no relevance to the example. The important thing is that cat goes from ['A','A','B'] to [1,1,2] as per my example.Cornice
Glad to hear it, but I don't see that being clear anywhere in your post.Swaggering
Made the description of what I'm trying to do more explicit, in response to @NickCox's comments.Cornice
M
9

Stata's encode command starts with a string variable and creates a new integer variable with labels mapped to the original string variable. The direct analog of this in pandas would now be the categorical variable type which became a full-fledged part of pandas starting in 0.15 (which was released after this question was originally asked and answered).

See documentation here.

To demonstrate for this example, the Stata command would be something like:

encode cat, generate(cat2)

whereas the pandas command would be:

x['cat2'] = x['cat'].astype('category')

  cat  val cat2
0   A   10    A
1   A   20    A
2   B   30    B

Just as Stata does with encode, the data are stored as integers, but display as strings in the default output.

You can verify this by using the categorical accessor cat to see the underlying integer. (And for that reason you probably don't want to use 'cat' as a column name.)

x['cat2'].cat.codes

0    0
1    0
2    1
Melvin answered 23/9, 2015 at 22:1 Comment(2)
I've been trying to do this for hours! Was searching convert object to integer, or convert categorical to numeric and going crazy. I'm on pandas 16.2 (current version with anaconda).Jennettejenni
+1000 df['a'].cat.codes is a lifesaver! Have been scouring the web to find as an alternative to using sklearn's DictVectorizer or LabelEncoder. This combined with OneHotEncoder works beautifully with sklearn-pandasKeefer
R
17

You could use pd.factorize:

import pandas as pd

x = pd.DataFrame({'cat':('A','A','B'), 'val':(10,20,30)})
labels, levels = pd.factorize(x['cat'])
x['cat'] = labels
x = x.set_index('cat')
print(x)

yields

     val
cat     
0     10
0     20
1     30

You could add 1 to labels if you wish to replicate Stata's behaviour:

x['cat'] = labels+1
Rubescent answered 16/12, 2013 at 20:10 Comment(7)
Another way to get at [0,0,1] is to look in pd.Categorical(seq).labels.Kleiman
Thanks, @DSM. Looking at the source code, I see Categorical calls factorize.Rubescent
Thanks @unutbu. FYI: this is a brilliant way to make beautiful categorised scatter plots, using a text column as the category.Cornice
@Rubescent this should go in the docs, can you do a PR for somewhere around here: pandas.pydata.org/pandas-docs/dev/…Octroi
use the main repo; stable docs will be updated when 0.13 is releasedOctroi
@Jeff: I've grepped my clone of the repo but could not find strings such as dummy variables or bbacab which are used on pandas.pydata.org/pandas-docs/dev/…. What file in the repo should be edited to affect the docs?Rubescent
pandas/doc/source/reshape.rstOctroi
M
9

Stata's encode command starts with a string variable and creates a new integer variable with labels mapped to the original string variable. The direct analog of this in pandas would now be the categorical variable type which became a full-fledged part of pandas starting in 0.15 (which was released after this question was originally asked and answered).

See documentation here.

To demonstrate for this example, the Stata command would be something like:

encode cat, generate(cat2)

whereas the pandas command would be:

x['cat2'] = x['cat'].astype('category')

  cat  val cat2
0   A   10    A
1   A   20    A
2   B   30    B

Just as Stata does with encode, the data are stored as integers, but display as strings in the default output.

You can verify this by using the categorical accessor cat to see the underlying integer. (And for that reason you probably don't want to use 'cat' as a column name.)

x['cat2'].cat.codes

0    0
1    0
2    1
Melvin answered 23/9, 2015 at 22:1 Comment(2)
I've been trying to do this for hours! Was searching convert object to integer, or convert categorical to numeric and going crazy. I'm on pandas 16.2 (current version with anaconda).Jennettejenni
+1000 df['a'].cat.codes is a lifesaver! Have been scouring the web to find as an alternative to using sklearn's DictVectorizer or LabelEncoder. This combined with OneHotEncoder works beautifully with sklearn-pandasKeefer
S
1

Assuming you have the fixed set of single capitalized English letters as your categorical variable, you can also do this:

x['cat'] = x.cat.map(lambda x: ord(x) - 64)

I believe it is a bit of a hack. But then again, in Python, the best thing would be to define a mapping from characters to integers that you desire, such as

my_map = {"A":1, ...} 
# e.g.: {x:ord(x)-64  for x in string.ascii_uppercase}
# if that's the convention you happen to desire.

and then do

x['cat'] = x.cat.map(lambda x: my_map[x])

or something similar.

This is superior to reliance on the conventions of built-in functions for your integer mapping, for numerous reasons, and (IMO) it is things like this that "feel like" nuisance conversions to the programmer-analyst, but in reality represent important metadata about the software you are writing, that expose the real weakness of global convenience functions in higher level languages like MATLAB, STATA, etc. Even if there is a built-in function that happens to randomly adhere to the particular convention you want to use (the arbitrary convention that "A" is mapped to 1, "B" is mapped to 2, etc.) it doesn't make it a good idea to use it.

Striation answered 16/12, 2013 at 20:15 Comment(7)
I leave comments on MATLAB to experienced users. The comments on Stata's encode command are puzzling. It defaults to mapping distinct string values in alphabetical order to integers 1 up, so "A", "B", "C" would be mapped to 1, 2, 3. But that default can be overridden through some specified string to integer translation scheme. If you don't want that, don't use it; there's no discernible issue of language design or philosophy implied.Swaggering
int64('A') == 65 in MATLAB. int('A') raises a ValueError in Python, which makes more sense IMHO. Of course, if you only write code in MATLAB that doesn't ever talk to the outside world, then it's a moot point.Nedi
@Phillip Cloud I suppose it's a matter of taste as to whether someone expects int to behave that way. Since int(x) in Python is just syntactical sugar for x.__int__(), I don't see it the same way you do. I don't expect single-length str variables to have a different __int__ than multi-character str variables, which provides the distinction for wanting a function like ord, but it's just my opinion.Striation
@EMS Your experience with Stata doesn't extend to being able to spell its name correctly or to know the difference between a Stata command and a Stata function. If length of experience is an argument, feel the weight of my 22 years with Stata. More seriously, and more importantly, your comments about encode remain puzzling, as you have changed your argument (really an assertion) to arguing that a language feature is indicted if used in ways you can consider dubious. That's more a reflection of your personal taste than anything else.Swaggering
I can only echo that as you have descended into criticising me, not my argument.Swaggering
@EMS I think you're mistaken. I agree with you w.r.t. the behavior of __int__. I supplied an example for @NickCox. Guess I should have mentioned that.Nedi
Oh I see, my bad. I misread your comment as saying that the MATLAB behavior was more desirable.Striation

© 2022 - 2024 — McMap. All rights reserved.