I found a problem. For future readers - why can't you get get_preds
work for new df?
(tested on kaggle's house prices advanced)
The root of the problem was in categorical nans. If you train your model with one set of cat features, say color = red, green, blue; and your new df has colors: red, green, blue, black - it will throw an error because it won't know what to do with new class (black). Not to mention you need to have the same columns everywhere, which can be tricky since if you use fillmissing proc, like I did, it's nice, it would create new cols for cat values (was missing or not). So you need to triple check these nans in cats.
I really wanted to make it work start to finish with fastai:
Columns for train/test are identical, only train has 1 extra - target. At this point there are different classes in some cat cols. I just decided to combine them (jus to make it work), but doesn't it introduce leakage?
combined = pd.concat([train, test]) # test will have nans at target, but we don't care
cont_cols, cat_cols = cont_cat_split(combined, max_card=50)
combined = combined[cat_cols]
Some tweaking while we at it.
train[cont_cols] = train[cont_cols].astype('float') # if target is not float, there will be an error later
test[cont_cols[:-1]] = test[cont_cols[:-1]].astype('float'); # slice target off (I had mine at the end of cont_cols)
made it to the Tabular Panda
procs = [Categorify, FillMissing]
to = TabularPandas(combined,
procs = procs,
cat_names = cat_cols)
train_to_cat = to.items.iloc[:train.shape[0], :] # transformed cat for train
test_to_cat = to.items.iloc[train.shape[0]:, :] # transformed cat for test. Need to separate them
to.items will gave us transformed cat columns. After that, we need to assemble everything back together
train_imp = pd.concat([train_to_cat, train[cont_cols]], 1) # assemble new cat and old cont together
test_imp = pd.concat([test_to_cat, test[cont_cols[:-1]]], 1) # exclude SalePrice
train_imp['SalePrice'] = np.log(train_imp['SalePrice']) # metric for kaggle
After that, we do as per fastai tutorial.
dep_var = 'SalePrice'
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter(valid_pct=0.2)(range_of(train_imp))
to = TabularPandas(train_imp,
procs = procs,
cat_names = cat_cols,
cont_names = cont_cols[:-1], # we need to exclude target
y_names = 'SalePrice',
splits=splits)
dls = to.dataloaders(bs=64)
learn = tabular_learner(dls, n_out=1, loss_func=F.mse_loss)
learn.lr_find()
learn.fit_one_cycle(20, slice(1e-2, 1e-1), cbs=[ShowGraphCallback()])
At this point, we have a learner but still can't predict. I thought after we do:
dl = learn.dls.test_dl(test_imp, bs=64)
preds, _ = learn.get_preds(dl=dl) # get prediction
it would just work (preprocessing of cont values and predict), but no. It will not fillna.
So just find and fill nans in test:
missing = test_imp.isnull().sum().sort_values(ascending=False).head(12).index.tolist()
for c in missing:
test_imp[c] = test_imp[c].fillna(test_imp[c].median())
after that we can finally predict:
dl = learn.dls.test_dl(test_imp, bs=64)
preds, _ = learn.get_preds(dl=dl) # get prediction
final_preds = np.exp(preds.flatten()).tolist()
sub = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
sub.SalePrice = final_preds
filename = 'submission.csv'
sub.to_csv(filename, index=False)
Apologies for the long narrative but I'm relatively new to coding and this problem was hard to point out. Very little info on how to solve it online. In short, it was a pain.
Unfortunately, this is still a workaround to a problem. If the number of classes in any feature is different for test, it will freak out. Also strange it didn't fillna while fitting test to dls.
Should you have any suggestions you are willing to share, please let me know.