Custom sklearn pipeline transformer giving "pickle.PicklingError"
Asked Answered
S

4

10

I am trying to create a custom transformer for a Python sklearn pipeline based on guidance from this tutorial: http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/

Right now my custom class/transformer looks like this:

class SelectBestPercFeats(BaseEstimator, TransformerMixin):
    def __init__(self, model=RandomForestRegressor(), percent=0.8,
                 random_state=52):
        self.model = model
        self.percent = percent
        self.random_state = random_state


    def fit(self, X, y, **fit_params):
        """
        Find features with best predictive power for the model, and
        have cumulative importance value less than self.percent
        """
        # Check parameters
        if not isinstance(self.percent, float):
            print("SelectBestPercFeats.percent is not a float, it should be...")
        elif not isinstance(self.random_state, int):
            print("SelectBestPercFeats.random_state is not a int, it should be...")

        # If checks are good proceed with fitting...
        else:
            try:
                self.model.fit(X, y)
            except:
                print("Error fitting model inside SelectBestPercFeats object")
                return self

            # Get feature importance
            try:
                feat_imp = list(self.model.feature_importances_)
                feat_imp_cum = pd.Series(feat_imp, index=X.columns) \
                    .sort_values(ascending=False).cumsum()

                # Get features whose cumulative importance is <= `percent`
                n_feats = len(feat_imp_cum[feat_imp_cum <= self.percent].index) + 1
                self.bestcolumns_ = list(feat_imp_cum.index)[:n_feats]
            except:
                print ("ERROR: SelectBestPercFeats can only be used with models with"\
                       " .feature_importances_ parameter")
        return self


    def transform(self, X, y=None, **fit_params):
        """
        Filter out only the important features (based on percent threshold)
        for the model supplied.

        :param X: Dataframe with features to be down selected
        """
        if self.bestcolumns_ is None:
            print("Must call fit function on SelectBestPercFeats object before transforming")
        else:
            return X[self.bestcolumns_]

I am integrating this Class into an sklearn pipeline like this:

# Define feature selection and model pipeline components
rf_simp = RandomForestRegressor(criterion='mse', n_jobs=-1,
                                n_estimators=600)
bestfeat = SelectBestPercFeats(rf_simp, feat_perc)
rf = RandomForestRegressor(n_jobs=-1,
                           criterion='mse',
                           n_estimators=200,
                           max_features=0.4,
                           )

# Build Pipeline
master_model = Pipeline([('feat_sel', bestfeat), ('rf', rf)])

# define GridSearchCV parameter space to search, 
#   only listing one parameter to simplify troubleshooting
param_grid = {
    'feat_select__percent': [0.8],
}

# Fit pipeline model
grid = GridSearchCV(master_model, cv=3, n_jobs=-1,
                    param_grid=param_grid)

# Search grid using CV, and get the best estimator
grid.fit(X_train, y_train)

Whenever I run the last line of code (grid.fit(X_train, y_train)) I get the following "PicklingError". Can anyone see what is causing this problem in my code?

EDIT:

Or, is there something in my Python setup that's wrong... Might I be missing a package or something similar? I just checked that I can import pickle successfully

Traceback (most recent call last): File "", line 5, in File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\model_selection_search.py", line 945, in fit return self._fit(X, y, groups, ParameterGrid(self.param_grid)) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\model_selection_search.py", line 564, in _fit for parameters in parameter_iterable File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 768, in call self.retrieve() File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 719, in retrieve raise exception File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 682, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 608, in get raise self._value File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 385, in _handle_tasks put(task) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\pool.py", line 371, in send CustomizablePickler(buffer, self._reducers).dump(obj) _pickle.PicklingError: Can't pickle : attribute lookup SelectBestPercFeats on builtins failed

Stein answered 26/7, 2017 at 19:9 Comment(4)
Or, is there something in my Python setup that's wrong... Might I be missing a package or something similar? I just checked that I can import pickle successfully.Stein
I think I figured it out. The pickle package needs to definition of the custom class(es) to be defined in another module and imported. So I created another file called transformation.py and then imported it in like this from transformation import SelectBestPercFeats. That resolved the pickling errorStein
Also make sure that you can unpickle the saved estimators and work as expected.Universalize
@VivekKumar, Thanks for the heads-up. I checked and everything unpickles fine. However, in my experience that's not always the case, so I appreciate the reminder.Stein
S
13

The pickle package needs the custom class(es) to be defined in another module and then imported. So, create another python package file (e.g. transformation.py) and then import it like this from transformation import SelectBestPercFeats. That will resolve the pickling error.

Stein answered 2/8, 2017 at 5:58 Comment(6)
Thank you so much, I spent 4 hours trying to fix this because my custom class was in the same file.Seditious
This does not work for me in Python 3 and I am about to go crazy. Please SOMEONE, help me out haha! I have a package where I put my custom preprocessor into another file and created an __init__.py in the folder and tried all possible combinations of importing the processor to the file where I load the dumped preprocessor. Package still does not recognize it.Dutch
@eonurk, I'd recommend starting a new question with this question referenced and saying that this one didn't solve your problem.Stein
I solved it defining the paths in init file explicitly. You can find the repo here: github.com/eonurk/sinkaf. Plus, I explained how to save the model in sinkaf.ipynb file. cheers.Dutch
@Stein What about jupyter?Meggs
@Ofir, generally this solution should work with jupyter as well. You would just create the custom class file in a directory near the Jupyter notebook (simplest is in the same directory) and then import that file into the Jupyter notebook.Stein
O
2

When you code your own transformer, and IF this transformer contains code that can't be serialized, then a whole pipeline won't be serializable if you try to serialize it.

Not only that, but also, you need such serialization to be able to parallelize your things, such as visible with n_jobs=-1 as you've put, to use many threads.

A bad thing with scikit-learn is that every object should have its saver. Hopefully, there is a solution. It's either to make your object serializable (and hence removing things you import from external libs), or else to have only 1 job (no threading), or else to make your object have a saver that will save the object to serialize it. The second solution will be explored here.

First, here is the definition of a problem, and its solution, taken from this source:

Problem: You can’t Parallelize nor Save Pipelines Using Steps that Can’t be Serialized “as-is” by Joblib

This problem will only surface past some point of using Scikit-Learn. This is the point of no-return: you’ve coded your entire production pipeline, but once you trained it and selected the best model, you realize that what you’ve just coded can’t be serialized.

This means once trained, your pipeline can’t be saved to disks because one of its steps imports things from a weird python library coded in another language and/or uses GPU resources. Your code smells weird and you start panicking over what was a full year of research development.

Hopefully, you’re nice enough to start coding your own open-source framework on the side because you’ll live this same situation in your next 100 coding projects, and you have other clients who will be in the same situation soon, and this sh** is critical.

Well, that’s out of shared need that Neuraxle was created.

Solution: Use a Chain of Savers in each Step

Each step is responsible for saving itself, and you should define one or many custom saver objects for your weird object. The saver should:

  1. Save what’s important in the step using a Saver (See: Saver )
  2. Delete that from the step (to make it serializable). The step is now stripped by the Saver.
  3. Then the default JoblibStepSaver will execute (in chain) past that point by saving all what’s left of the stripped object and deleting the object from your code’s RAM. This means you can have many partial savers before the final default JoblibStepSaver.

For instance, a Pipeline will do the following upon having the save() method called, as it has its own TruncableJoblibStepSaver :

  1. Save all its substeps in relative subfolders to the pipeline’s serialization’s subfolder
  2. Delete them from the pipeline object, except for their names to find them later when loading. The pipeline is now stripped.
  3. Let the default saver save the stripped pipeline.

You don’t want to do dirty code. Don’t break the Law of Demeter, they say. This is one of the most important (and easily overlooked) laws of programming, in my opinion. Google it, I dare you. Breaking this law is the root of most evil in your codebase.

I’ve come to the conclusion that the neatest way to not break this law here is by having a chain of Savers. It makes each object responsible for having special savers if it isn’t serializable with joblib. Neat. So just when things break, you have the option of creating your own serializer just for the object that breaks, this way you won’t need to break encapsulation at save-time to dig into your objects manually, which would break the Law of Demeter.

Note that the savers also need to be able to reload the object when loading the save, too. We already wrote a TensorFlow Neuraxle saver.

TL;DR: You can call the save() method on any pipeline in Neuraxle, and if some steps define a custom Saver, then the step will use that saver before using the default JoblibStepSaver.

Parallelization of your non-picklable pipeline

So you've done the things above using Neuraxle. Neat. Now use Neuraxle's classes for the AutoML and random search and things like that. They should have the proper abstractions for parallelization using the savers to serialize things. Things must be serialized to send your code to other python processes for parallelization.

Outlaw answered 6/3, 2020 at 1:51 Comment(0)
S
0

I had the same problem, but in my case the issue was using function transformers where pickle sometimes has difficulties in serializing functions. The solution for me was to use dill instead, though it is a bit slower.

Sanctimonious answered 6/3, 2019 at 15:29 Comment(0)
S
-1

In my case, I just had to restart the Ipython IDE where I was checking the transformer. After restarting IDE and re-running the code, it ether works well or starts giving you a more meaningful error.

Susanasusanetta answered 25/3, 2022 at 23:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.