fit_transform() takes 2 positional arguments but 3 were given with LabelBinarizer
Asked Answered
W

17

98

I am totally new to Machine Learning and I have been working with unsupervised learning technique.

Image shows my sample Data(After all Cleaning) Screenshot : Sample Data

I have this two Pipline built to Clean the Data:

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

print(type(num_attribs))

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', LabelBinarizer())
])

Then I did the union of this two pipelines and the code for the same is shown below :

from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

Now I am trying to do fit_transform on the Data But Its showing Me the Error.

Code for Transformation:

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

Error message:

fit_transform() takes 2 positional arguments but 3 were given

Wallenstein answered 11/9, 2017 at 19:12 Comment(2)
LabelBinarizer is not supposed to be used with X (Features), but is intended for labels only. Hence the fit and fit_transform methods are changed to include only single object y. But the Pipeline (which works on features) will try sending both X and y to it. Hence the error.Nucleotidase
You should use LabelBinarizer outside of the pipeline to convert the categorical features to one-hot encoded, or maybe use pandas.get_dummies().Nucleotidase
F
79

The Problem:

The pipeline is assuming LabelBinarizer's fit_transform method is defined to take three positional arguments:

def fit_transform(self, x, y)
    ...rest of the code

while it is defined to take only two:

def fit_transform(self, x):
    ...rest of the code

Possible Solution:

This can be solved by making a custom transformer that can handle 3 positional arguments:

  1. Import and make a new class:

    from sklearn.base import TransformerMixin #gives fit_transform method for free
    class MyLabelBinarizer(TransformerMixin):
        def __init__(self, *args, **kwargs):
            self.encoder = LabelBinarizer(*args, **kwargs)
        def fit(self, x, y=0):
            self.encoder.fit(x)
            return self
        def transform(self, x, y=0):
            return self.encoder.transform(x)
    
  2. Keep your code the same only instead of using LabelBinarizer(), use the class we created : MyLabelBinarizer().


Note: If you want access to LabelBinarizer Attributes (e.g. classes_), add the following line to the fit method:
    self.classes_, self.y_type_, self.sparse_input_ = self.encoder.classes_, self.encoder.y_type_, self.encoder.sparse_input_
Faddish answered 7/10, 2017 at 10:56 Comment(8)
I suggest this alternative for the class body. What do you think (sorry for the formatting) ? def fit(self, X, y = None): \n return self \n def transform(self, X, y = None): \n return LabelBinarizer().fit_transform(X)Erastus
I am getting an error -- '<' not supported between instances of 'str' and 'int'. What could be the reason for this. There are no missing values in categorical columns.Dividivi
@Dividivi I need to see your code in order to help you but this error can be generated when you are passing a string to one of pos_labels and neg_labels parameters (i.e. LabelBinarizer(pos_labels = "good"))Faddish
@otonglet I think this works but having (fit_transform) in there means every time you call (transform) on the new class it will do the fitting all over again. This can lead to unexpected behavior if you are using it on a test set with few examples and many label categories. Also, the post is updated to have simpler code.Faddish
Excuse my ignorance, but can your example be used to fit 4 or 5 inputs? (Or is that bad practice)Coronal
The note doesn't seem to fully work AttributeError: 'MultiLabelBinarizer' object has no attribute 'sparse_input_' / 'y_type_'Dann
I am trying to put it in a Pipe, and all I get is: TypeError: Cannot clone object '<__main__.MyLabelBinarizer object at 0x7fa081579160>' (type <class 'main.MyLabelBinarizer'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' method.Tympanic
@EliBorodach in order to use it in a Pipeline you need it to also inherit from BaseEstimator, similar to the solution from shyam padia below: https://mcmap.net/q/216743/-fit_transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarizerHutchison
N
68

I believe your example is from the book Hands-On Machine Learning with Scikit-Learn & TensorFlow. Unfortunately, I ran into this problem, as well. A recent change in scikit-learn (0.19.0) changed LabelBinarizer's fit_transform method. Unfortunately, LabelBinarizer was never intended to work how that example uses it. You can see information about the change here and here.

Until they come up with a solution for this, you can install the previous version (0.18.0) as follows:

$ pip install scikit-learn==0.18.0

After running that, your code should run without issue.

In the future, it looks like the correct solution may be to use a CategoricalEncoder class or something similar to that. They have been trying to solve this problem for years apparently. You can see the new class here and further discussion of the problem here.

Nunhood answered 11/9, 2017 at 22:43 Comment(5)
This is not a bug per se. LabelBinarizer is not supposed to be used with features (X),but only for labels (y). Hence they have stopped sending both X and y to the method.Nucleotidase
They are working on OneHotEncoder which supports string features. github.com/scikit-learn/scikit-learn/issues/4920Nucleotidase
Thank you for your reply , And yes I am in learning mode with "Hands-On Machine Learning with Scikit-Learn & TensorFlow". So yeah Well Instead of using previous version I got a custom Binarizer That worked for me. Link for the code is : github.com/scikit-learn/scikit-learn/pull/7375/…Wallenstein
I edited the question to further explain the issue and clarify that it isn't a bug.Nunhood
Thanks. Got stuck with the same issue and this worked.Desimone
T
14

I think you are going through the examples from the book: Hands on Machine Learning with Scikit Learn and Tensorflow. I ran into the same problem when going through the example in Chapter 2.

As mentioned by other people, the problem is to do with sklearn's LabelBinarizer. It takes less args in its fit_transform method compared to other transformers in the pipeline. (only y when other transformers normally take both X and y, see here for details). That's why when we run pipeline.fit_transform, we fed more args into this transformer than required.

An easy fix I used is to just use OneHotEncoder and set the "sparse" to False to ensure the output is a numpy array same as the num_pipeline output. (this way you don't need to code up your own custom encoder)

your original cat_pipeline:

cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer())
])

you can simply change this part to:

cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('one_hot_encoder', OneHotEncoder(sparse=False))
])

You can go from here and everything should work.

Text answered 29/12, 2018 at 10:58 Comment(3)
Some pages more early the author uses a 'reshape()' in the OneHotEncoder. How comes we do not need to use this reshape() of the categorical data when now replacing the LabelBinarizer by the OneHotEncoder?Oleneolenka
@Oleneolenka probably becuase the DataFrameSelector returns a numpy array rather than a pandas dataframe. I assume this numpy array is in the correct dimensions and does not need to be reshaped.Tangle
@Oleneolenka Because they apply the encoder to housing["ocean_proximity"] (a 1-d numpy array) instead of housing[["ocean_proximity"]] (a pandas dataframe).Cherri
L
10

Since LabelBinarizer doesn't allow more than 2 positional arguments you should create your custom binarizer like

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
        self.sparse_output = sparse_output
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        enc = LabelBinarizer(sparse_output=self.sparse_output)
        return enc.fit_transform(X)

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scalar', StandardScaler())
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', CustomLabelBinarizer())
])

full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])

housing_prepared = full_pipeline.fit_transform(new_housing)
Lobachevsky answered 22/12, 2017 at 0:49 Comment(1)
This implementation of CustomLabelBinarizer causes a problem later in the chapter, when applying the pipeline to a subset of data. See https://mcmap.net/q/218695/-why-my-output-from-preprocessing-methods-in-sklearn-pipeline-does-not-align for a description of the problem and a better implementation of CustomLabelBinarizerSipple
H
7

I ran into the same problem and got it working by applying the workaround specified in the book's Github repo.

Warning: earlier versions of the book used the LabelBinarizer class at this point. Again, this was incorrect: just like the LabelEncoder class, the LabelBinarizer class was designed to preprocess labels, not input features. A better solution is to use Scikit-Learn's upcoming CategoricalEncoder class: it will soon be added to Scikit-Learn, and in the meantime you can use the code below (copied from Pull Request #9151).

To save you some grepping here's the workaround, just paste and run it in a previous cell:

# Definition of the CategoricalEncoder class, copied from PR #9151.
# Just run this cell, or copy it to your code, do not try to understand it (yet).

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

class CategoricalEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
                 handle_unknown='error'):
        self.encoding = encoding
        self.categories = categories
        self.dtype = dtype
        self.handle_unknown = handle_unknown

    def fit(self, X, y=None):
        """Fit the CategoricalEncoder to X.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_feature]
            The data to determine the categories of each feature.
        Returns
        -------
        self
        """

        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' "
                        "or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.handle_unknown not in ['error', 'ignore']:
            template = ("handle_unknown should be either 'error' or "
                        "'ignore', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for"
                             " encoding='ordinal'")

        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape

        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]

        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1}"
                               " during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))

        self.categories_ = [le.classes_ for le in self._label_encoders_]

        return self

    def transform(self, X):
        """Transform X using one-hot encoding.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            The data to encode.
        Returns
        -------
        X_out : sparse matrix or a 2-d array
            Transformed input.
        """
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool)

        for i in range(n_features):
            valid_mask = np.in1d(X[:, i], self.categories_[i])

            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])

        if self.encoding == 'ordinal':
            return X_int.astype(self.dtype, copy=False)

        mask = X_mask.ravel()
        n_values = [cats.shape[0] for cats in self.categories_]
        n_values = np.array([0] + n_values)
        indices = np.cumsum(n_values)

        column_indices = (X_int + indices[:-1]).ravel()[mask]
        row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                                n_features)[mask]
        data = np.ones(n_samples * n_features)[mask]

        out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                shape=(n_samples, indices[-1]),
                                dtype=self.dtype).tocsr()
        if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return out
Houston answered 17/4, 2018 at 1:32 Comment(0)
E
6

Simply, what you can do is define following class just before your pipeline:

class NewLabelBinarizer(LabelBinarizer):
    def fit(self, X, y=None):
        return super(NewLabelBinarizer, self).fit(X)
    def transform(self, X, y=None):
        return super(NewLabelBinarizer, self).transform(X)
    def fit_transform(self, X, y=None):
        return super(NewLabelBinarizer, self).fit(X).transform(X)

Then the rest of the code is like the one has mentioned in the book with a tiny modification in cat_pipeline before pipeline concatenation - follow as:

cat_pipeline = Pipeline([
    ("selector", DataFrameSelector(cat_attribs)),
    ("label_binarizer", NewLabelBinarizer())])

You DONE!

Endorsement answered 27/5, 2019 at 15:43 Comment(0)
D
3

Forget LaberBinarizer and use OneHotEncoder instead.

In case you use a LabelEncoder before OneHotEncoder to convert categories to integers, you can now use the OneHotEncoder directly.

Deeprooted answered 20/1, 2019 at 11:59 Comment(1)
This could be comment but thanks anyway for your responseAbridgment
T
3

I have also faced the same issue. Following link helped me in fixing this issue. https://github.com/ageron/handson-ml/issues/75

Summarizing changes to be made

1) Define following class in your notebook

class SupervisionFriendlyLabelBinarizer(LabelBinarizer):
    def fit_transform(self, X, y=None):
        return super(SupervisionFriendlyLabelBinarizer,self).fit_transform(X)

2) Modify following piece of code

cat_pipeline = Pipeline([('selector', DataFrameSelector(cat_attribs)),
                         ('label_binarizer', SupervisionFriendlyLabelBinarizer()),])

3) Re-run the notebook. You will be able to run now

Trutko answered 28/9, 2019 at 7:36 Comment(0)
S
2

The LabelBinarizer class is outdated for this example, and unfortunately was never meant to be used in the way that the book uses it.

You'll want to use the OrdinalEncoder class from sklearn.preprocessing, which is designed to

"Encode categorical features as an integer array." (sklearn documentation).

So, just add:

from sklearn.preprocessing import OrdinalEncoder

then replace all mentions of LabelBinarizer() with OrdinalEncoder() in your code.

Stymie answered 5/4, 2021 at 15:15 Comment(1)
It did not really solve the problem. It did get rid of the error message but the function (one-hat the category) at the end did not work. The shape of the output of OrdinalEncoder is as if the fuction did not work. The answer from @Terrence solves it.Cityscape
F
1

I got the same issue, and got resolved by using DataFrameMapper (need to install sklearn_pandas):

from sklearn_pandas import DataFrameMapper
cat_pipeline = Pipeline([
    ('label_binarizer', DataFrameMapper([(cat_attribs, LabelBinarizer())])),
])
Flieger answered 22/9, 2018 at 21:54 Comment(1)
LabelBinarizer() will create OHE features. You can however use sklearn.preprocessing.LabelEncoder() directly in a DataFrameMapper pipeline. At least for me it worked fine.Cortege
A
1

You can create one more Custom Transformer which does the encoding for you.

class CustomLabelEncode(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return LabelEncoder().fit_transform(X);

In this example, we have done LabelEncoding but you can use LabelBinarizer as well

Absorbance answered 6/3, 2019 at 11:47 Comment(0)
D
1

I've seen many custom label binarizers but there is one from this repo that worked for me.

class LabelBinarizerPipelineFriendly(LabelBinarizer):
    def fit(self, X, y=None):
        """this would allow us to fit the model based on the X input."""
        super(LabelBinarizerPipelineFriendly, self).fit(X)
    def transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).transform(X)

    def fit_transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)

Then edit the cat_pipeline to this:

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('label_binarizer', LabelBinarizerPipelineFriendly()),
    ])

Have a good one!

Disrespect answered 7/5, 2021 at 11:22 Comment(0)
I
1

You can use this LabelBinarizer modified class instead in your code:

class mod_LabelBinarizer(LabelBinarizer):
    def fit_transform(self, X, y=None):
        self.fit(X)
        return self.transform(X) 

Now you can use mod_LabelBinarizer() instead of LabelBinarizer() in your cat_pipeline so your code should be like that:

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', mod_LabelBinarizer())
])
Izak answered 7/4, 2023 at 13:40 Comment(0)
C
0

I ended up rolling my own

class LabelBinarizer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        X = self.prep(X)
        unique_vals = []
        for column in X.T:
            unique_vals.append(np.unique(column))
        self.unique_vals = unique_vals
    def transform(self, X, y=None):
        X = self.prep(X)
        unique_vals = self.unique_vals
        new_columns = []
        for i, column in enumerate(X.T):
            num_uniq_vals = len(unique_vals[i])
            encoder_ring = dict(zip(unique_vals[i], range(len(unique_vals[i]))))
            f = lambda val: encoder_ring[val]
            f = np.vectorize(f, otypes=[np.int])
            new_column = np.array([f(column)])
            if num_uniq_vals <= 2:
                new_columns.append(new_column)
            else:
                one_hots = np.zeros([num_uniq_vals, len(column)], np.int)
                one_hots[new_column, range(len(column))]=1
                new_columns.append(one_hots)
        new_columns = np.concatenate(new_columns, axis=0).T        
        return new_columns

    def fit_transform(self, X, y=None):
        self.fit(X)
        return self.transform(X)

    @staticmethod
    def prep(X):
        shape = X.shape
        if len(shape) == 1:
            X = X.values.reshape(shape[0], 1)
        return X

Seems to work

lbn = LabelBinarizer()
thingy = np.array([['male','male','female', 'male'], ['A', 'B', 'A', 'C']]).T
lbn.fit(thingy)
lbn.transform(thingy)

returns

array([[1, 1, 0, 0],
       [1, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 1]])
Cancel answered 29/1, 2018 at 5:2 Comment(0)
P
0

The easiest way is to replace LabelBinarize() inside your pipeline with OrdinalEncoder()

Photoelectrotype answered 22/3, 2021 at 12:9 Comment(0)
G
0

In my case it helped

def _binarize(series: pd.Series) -> pd.Series:
    return series.astype(int)


binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ('binary_encoder', FunctionTransformer(_binarize))
])
Gantry answered 27/8, 2022 at 14:51 Comment(0)
N
-1

We can just add attribute sparce_output=False

cat_pipeline = Pipeline([
  ('selector', DataFrameSelector(cat_attribs)),
  ('label_binarizer', LabelBinarizer(sparse_output=False)),   
])
Norty answered 28/5, 2020 at 0:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.