I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.
My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:
df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')
The last line throws the following error:
patsy.PatsyError: Error converting data to categorical: observation with value 'Kolkata' does not match any of the expected levels
I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.
Is there any way I can make this work with Patsy?