mlflow How to save a sklearn pipeline with custom transformer?
Asked Answered
C

3

12

I am trying to save with mlflow a sklearn machine-learning model, which is a pipeline containing a custom transformer I have defined, and load it in another project. My custom transformer inherits from BaseEstimator and TransformerMixin.

Let's say I have 2 projects:

  • train_project: it has the custom transformers in src.ml.transformers.py
  • use_project: it has other things in src, or has no src catalog at all

So in my train_project I do :

mlflow.sklearn.log_model(preprocess_pipe, 'model/preprocess_pipe')

and then when I try to load it into use_project :

preprocess_pipe = mlflow.sklearn.load_model(f'{ref_model_path}/preprocess_pipe')

An error occurs :

[...]
File "/home/quentin/anaconda3/envs/api_env/lib/python3.7/site-packages/mlflow/sklearn.py", line 210, in _load_model_from_local_file
    return pickle.load(f)
ModuleNotFoundError: No module named 'train_project'

I tried to use format mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE :

mlflow.sklearn.log_model(preprocess_pipe, 'model/preprocess_pipe', serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE)

but I get the same error during load.

I saw option code_path into mlflow.pyfunc.log_model but its use and purpose is not clear to me.

I thought mlflow provide a easy way to save model and serialize them so they can be used anywhere, Is that true only if you have native sklearn models (or keras, ...)?

It's seem that this issue is more related to pickle functioning (mlflow use it and pickle needs to have all dependencies installed).

The only solution I found so far is to make my transformer a package, import it in both project. Save version of my transformer library with conda_env argument of log_model, and check if it's same version when I load the model into my use_project. But it's painfull if I have to change my transformer or debug in it...

Is anybody have a better solution? More elegent? Maybe there is some mlflow functionality I would have missed?

other informations :
working on linux (ubuntu)
mlflow=1.5.0
python=3.7.3

I saw in test of mlflow.sklearn api that they do a test with custom transformer, but they load it into the same file so it seems not resolve my issue but maybe it can helps other poeple :

https://github.com/mlflow/mlflow/blob/master/tests/sklearn/test_sklearn_model_export.py

Chlorothiazide answered 4/3, 2020 at 16:8 Comment(1)
Did you manage to solve this issue? We have a similar process. Did you try mlflow.pyfunc?Interval
M
2

What you are trying to do is serialize something "customized" that you've trained in a module outside of train.py, correct?

What you probably will need to do is log your model with mlflow.pyfunc.log_model with the code argument, which takes in a list of strings containing the path to the modules you will need to deserialize and make predictions, as documented here.

What needs to be clear is that every mlflow model is a PyFunc by nature. Even when you log a model with mlflow.sklearn, you can load it with mlflow.pyfunc.load_model. And what a PyFunc does is standardize all models and frameworks in a unique way, that will guarantee you'll always declare how to:

  1. de-serialize your model, with the load_context() method
  2. make your predictions, with the predict() method

If you make sure about both things in an object that inherits mlflow's PythonModel class, you can then log your model as a PyFunc.

What mlflow.sklearn.log_model does is basically wrap up the way you declare serialization and de-serialization. If you stick with sklearn's basic modules, such as basic transformers and pipelines, you'll always be fine with it. But when you need something custom, then you refer to Pyfuncs instead.

You can find a very useful example here. Notice it states exactly how to make the predictions, transforming the input into a XGBoost's DMatrix.

Marsipobranch answered 23/11, 2021 at 19:4 Comment(2)
there is no code argument in mlflow.pyfunc.log_modelGobble
mendonca there is code_paths argument in mlflow.pyfunc.log_model and mlflow.sklearn.log_model. With pyfunc, we get this error: python_model must be a PythonModel instance or a callable objectGobble
D
0

You can use the code_path parameter to save Python file dependencies (or directories containing file dependencies). These files are prepended to the system path when the model is loaded. The model folder will contain a directory code which includes all these files.

Duodenary answered 30/7, 2020 at 13:2 Comment(0)
P
0

I got confronted to a very similar issue. The problem seems to lie with cloudpickle.

I imagine that your pipeline is defined as class, ex in src.ml.transformers.py:

class PreprocessPipeline:
   ...

As explained in this question, cloudpickle, by default, only stores actual functions rather than merely recording their module name and reimporting, as pickle does. This means that it will just not serialise your required class.

The solution becomes to do, in src.ml.transformers.py:

def __PreprocessPipeline():
   class PreprocessPipeline:
      ...

PreprocessPipeline = __PreprocessPipeline()

Doing this you actually don't even need the code_path arg

Photoluminescence answered 16/5, 2023 at 8:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.