Sklearn ROC AUC Score : ValueError: y should be a 1d array, got an array of shape (15, 2) instead
Asked Answered
A

2

4

I have this dataset with target LULUS, it's an imbalance dataset. I'm trying to print roc auc score if I could for each fold of my data but in every fold somehow it's always raise error saying ValueError: y should be a 1d array, got an array of shape (15, 2) instead.. I'm kind of confused which part I did wrong because I do it exactly like in the documentation. And in several fold, I get it that It won't print the score if there's only one label but then it will return the second type of error about 1d array.

merged_df = pd.read_csv(r'C:\...\merged.csv')

num_columns = merged_df.select_dtypes(include=['float64']).columns
cat_columns = merged_df.select_dtypes(include=['object']).drop(['TARGET','NAMA'], axis=1).columns

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('label', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_columns),
        ('cat', categorical_transformer, cat_columns)])

X = merged_df.drop(['TARGET','Unnamed: 0'],1)
y = merged_df['TARGET']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

X_train = X_train.drop(['NIM', 'NAMA'],1)
X_test = X_test.drop(['NIM', 'NAMA'],1)

rf = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier',tree.DecisionTreeClassifier(class_weight='balanced', criterion='entropy'))])

rf.fit(X_train, y_train)

pred = rf.predict(X_test)

y_proba = rf.predict_proba(X_test)

from sklearn.model_selection import KFold

kf = KFold(n_splits=10)

for train, test in kf.split(X):
    X_train, X_test = X.loc[train], X.loc[test]
    y_train, y_test = y.loc[train], y.loc[test]
    model = rf.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)
    try:
        print(roc_auc_score(y_test, y_proba,average='weighted', multi_class='ovr'))
    except ValueError:
        pass

See my data in spreadsheet

Antwanantwerp answered 29/5, 2021 at 16:7 Comment(0)
I
11

Your output from model.predict_proba() is a matrix with 2 columns, one for each class. To calculate roc, you need to provide the probability of the positive class:

Using an example dataset:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

X, y = make_classification(n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)
rf = RandomForestClassifier()
model = rf.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)

It looks like this:

array([[0.69, 0.31],
       [0.13, 0.87],
       [0.94, 0.06],
       [0.94, 0.06],
       [0.07, 0.93]])

Then do:

roc_auc_score(y_test, y_proba[:,1])
Isochromatic answered 29/5, 2021 at 19:15 Comment(2)
ah I see, but I ever tried to calculate auc roc score with Iris dataset and just use roc_auc_score(y_test, y_proba) but it still works.. why is that? sorry, I'm not familiar with roc auc score..Antwanantwerp
works like butterMagnuson
J
0

As mentioned by @StupidWolf, this error happens when you try to pass a 2-column matrix of probabilities to the y_score argument of roc_auc_score when doing binary classification.

As for why roc_auc_score does not throw an error when passing an n x n_classes matrix to y_score on datasets like iris, the docs for roc_auc_score explain this (see the docs for y_score), where if you are doing multiclass classification (as in iris) then it expects this format. The reasoning is that there is no 'positive class' anymore in the multiclass setting, and it needs the per-class probabilities to perform a one-vs-rest calculation.

Joannjoanna answered 10/7, 2024 at 17:35 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.