Patsy: New levels in categorical fields in test data
Asked Answered
C

2

9

I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.

My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:

df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')

The last line throws the following error:

patsy.PatsyError: Error converting data to categorical: observation with value 'Kolkata' does not match any of the expected levels

I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.

Is there any way I can make this work with Patsy?

Canary answered 2/12, 2015 at 6:2 Comment(0)
T
8

The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.

One way is by using the levels= argument to C(...), like:

# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))

dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)

Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.

Tsarevna answered 14/7, 2017 at 4:10 Comment(4)
I appreciate the answer but it runs into the same problem faced by timctran's answer. Before putting into production, we would not know all possible levels that we are going to encounter. The number of cities and towns in India simply too big to include in a model. First of all there is no exhaustive list of these places. And believe it or not, new ones keep cropping up. And cities is just one such example anyways.Canary
Well, what do you even want to happen in this case? Have patsy output a matrix that has grown new columns compared to the original matrix? What will you do with them?Tsarevna
As I mentioned in the question, I feel sklearn's Dictvectoriser handles this nicely. It does not throw an error. It just says to the model that "here's a record which does not match any city value I have seen before". It just marks all existing columns as 0. I just thought Patsy being a mature library, also handles this very common use case. But apparently it doesn't.Canary
Ah, that's an interesting strategy. Believe it or not, you are the first person to ever request this – at least somewhere where I saw it :-). (And even here it took stumbling across a 1.5 year old question...) ...however, returning all-zeros will give wildly incorrect results for any kind of linear-ish model, like patsy is designed for, so I'm not sure this is a good idea. Let's continue the discussion here: github.com/pydata/patsy/issues/110Tsarevna
W
0

I ran into a similar problem and I built the design matrices prior to splitting the data.

df_Y, df_X = dmatrices(formula, data=df, return_type='dataframe')
df_train_X, df_test_X, df_train_Y, df_test_Y = \
    train_test_split(df_X, df_Y, test_size=test_size)

Then as an example of applying a fit:

model = smf.OLS(df_train_Y, df_train_X)
model2 = model.fit()
predicted = model2.predict(df_test_X)

Technically I haven't built a test case, but I haven't run into the Error converting data to categorical error again since implementing the above.

Wail answered 25/6, 2017 at 5:10 Comment(3)
Yes, that is a possible workaround. But the problem with that approach is that now there leak from test set to training set. Ideally, at the time of training, we should not use any information from the test data that is not already part of training data. This way we simulate in our tests what is actually going to happen in production. To take the example from my question above, if there are cities that I will only know about in production, it will not be correct (actually not even possible) to include such cities as columns in my data used to train the model. I hope that is clear.Canary
I see your concern. With that being said, it is my understanding that my answer would not leak information in the sense of giving it any additional information at training. It simply makes the model compatible with other items in your data set. In R, this would be equivalent to adjusting the levels without changing the data. If the occurence of not having available cities is high in your model, then perhaps lumping cities together to form a more inclusive model would be the way to go; but this leans towards model design, separate from your question of compatibility of test and training set.Wail
As an example, suppose we had a data on three cities A, B, C. And in production there will be a fourth city D and fifth city E. So in this case perhaps the approach would be dummy encode on A, B, C, so that not being A, B, or C is any other city. However, we could not expect the model to work optimally as it would clump all unseen cities together. Instead, we could design variables such as city_population (either numerical or categorical: sm, med, lg), city_density, etc and map each city to its characteristics. The model can then be applied to cities not in the original data set.Wail

© 2022 - 2024 — McMap. All rights reserved.