I am trying to pickle a sklearn machine-learning model, and load it in another project. The model is wrapped in pipeline that does feature encoding, scaling etc. The problem starts when i want to use self-written transformers in the pipeline for more advanced tasks.
Let's say I have 2 projects:
- train_project: it has the custom transformers in src.feature_extraction.transformers.py
- use_project: it has other things in src, or has no src catalog at all
If in "train_project" I save the pipeline with joblib.dump(), and then in "use_project" i load it with joblib.load() it will not find something such as "src.feature_extraction.transformers" and throw exception:
ModuleNotFoundError: No module named 'src.feature_extraction'
I should also add that my intention from the beginning was to simplify usage of the model, so programist can load the model as any other model, pass very simple, human readable features, and all "magic" preprocessing of features for actual model (e.g. gradient boosting) is happening inside.
I thought of creating /dependencies/xxx_model/ catalog in root of both projects, and store all needed classes and functions in there (copy code from "train_project" to "use_project"), so structure of projects is equal and transformers can be loaded. I find this solution extremely inelegant, because it would force the structure of any project where the model would be used.
I thought of just recreating the pipeline and all transformers inside "use_project" and somehow loading fitted values of transformers from "train_project".
The best possible solution would be if dumped file contained all needed info and needed no dependencies, and I am honestly shocked that sklearn.Pipelines seem to not have that possibility - what's the point of fitting a pipeline if i can not load fitted object later? Yes it would work if i used only sklearn classes, and not create custom ones, but non-custom ones do not have all needed functionality.
Example code:
train_project
src.feature_extraction.transformers.py
from sklearn.pipeline import TransformerMixin
class FilterOutBigValuesTransformer(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self.biggest_value = X.c1.max()
return self
def transform(self, X):
return X.loc[X.c1 <= self.biggest_value]
train_project
main.py
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
from src.feature_extraction.transformers import FilterOutBigValuesTransformer
pipeline = Pipeline([
('filter', FilterOutBigValuesTransformer()),
('encode', MinMaxScaler()),
])
X=load_some_pandas_dataframe()
pipeline.fit(X)
joblib.dump(pipeline, 'path.x')
test_project
main.py
from sklearn.externals import joblib
pipeline = joblib.load('path.x')
The expected result is pipeline loaded correctly with transform method possible to use.
Actual result is exception when loading the file.