Specifying which category to treat as the base with 'statsmodels'
Asked Answered
P

3

24

In understand that when I have a category variable in a model passed to a statsmodels fit that dummy variables will automatically be generated for the categories. For example if I have a variable 'Location' with values 'IndianOcean', 'Thailand', 'China' and 'Mars' I will get variables in my model of the form

Location[T.Thailand]

with one of the value not represented. By default the excluded variable seems to be the least common one. Is there a way to specify — ideally within the model specification — which value is treated as the "base value" and excluded?

Peerage answered 16/3, 2014 at 0:28 Comment(2)
It seems that using C in the formula (as in ... + C(Location, Treatment) + ... does the trick, but this results in some pretty ugly category names that I'd like to avoid.Peerage
I don't understand this. Do you write e.g. C(Location, 'IndianOcean') if you want 'IndianOcean' to be the reference category from the variable 'Location'?Guimpe
K
42

You can pass a reference arg to the Treatment contrast, using syntax like

"y ~ C(Location, Treatment(reference='China'))"

http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.Treatment

If you have a better suggestion for naming conventions please file an issue with patsy.

Kurtzig answered 16/3, 2014 at 16:53 Comment(6)
To be explicit, the syntax is "y ~ C(Location, Treatment(reference='China'))" .Faggoting
@PiotrMigdal thanks for clarifying. I wish the original answer actually included code.Abeu
"y ~ C(Location, Treatment('China'))" works as well.Kleist
@Kurtzig , I'm getting error as follows PatsyError: Error evaluating factor: TypeError: 'Series' object is not callable. while doing the above two methods. Do you have any idea ?Kamerman
I am having this problem as well. "TypeError: 'Series' object is not callable"Farce
"TypeError: 'Series' object is not callable" will occur if you get the syntax wrong and fail to add the "Treatment" piece. If your variable is "Location" to set the reference group to "China" you need "y ~ C(Location, Treatment('China'))". If you type "y ~ C(Location('China'))", you will get this error because you are trying to call on the pd.Series directly.Pollard
H
4

If you use single quotes to wrap your string, reference's argument needs to be wrapped with double quotes. Very easy mistake to make. I was using single quotes on both.

For example:

'y ~ C(Location, Treatment(reference="China"))'

is correct.

'y ~ C(Location, Treatment(reference='China'))'

is not correct.

Handley answered 30/11, 2020 at 6:20 Comment(0)
A
3

Ok, maybe someone will find this one helpfull. I needed to set a new baseline category for the dependent variable, I had no idea how to do it. I searched and found nothing, so i simply added a "_" for the other categories. If you have 3 categories A, B, C, and you want your baseline to be C you just change the labeles from A and B to _A and _B. It works. I appears that the baseline category is defined by sorted()

Maybe someone knows a proper way to do it, this is not very phytonic, ja.

Advice answered 28/4, 2021 at 19:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.