ValueError: unknown is not supported in sklearn.RFECV [duplicate]
Asked Answered
R

1

12

I was trying to narrow down the number of features really relevant for my classifier using rfecv. This is the code I have written

import sklearn
import pandas as p
import numpy as np
import scipy as sp
import pylab as pl
from sklearn import linear_model, cross_validation, metrics
from sklearn.svm import SVC
from sklearn.feature_selection import RFECV
from sklearn.metrics import zero_one_loss
from sklearn import preprocessing
#from sklearn.feature_extraction.text import CountVectorizer
#from sklearn.feature_selection import SelectKBest, chi2

modelType = "notext"

# ----------------------------------------------------------
# Prepare the Data
# ----------------------------------------------------------
training_data = np.array(p.read_table('F:/NYC/NYU/SM/3/SNLP/Project/Data/train.tsv'))
print ("Read Data\n")

# get the target variable and set it as Y so we can predict it
Y = training_data[:,-1]

print(Y)

# not all data is numerical, so we'll have to convert those fields
# fix "is_news":
training_data[:,17] = [0 if x == "?" else 1 for x in training_data[:,17]]

# fix -1 entries in hasDomainLink
training_data[:,14] = [0 if x =="-1" else x for x in training_data[:,10]]

# fix "news_front_page":
training_data[:,20] = [999 if x == "?" else x for x in training_data[:,20]]
training_data[:,20] = [1 if x == "1" else x for x in training_data[:,20]]
training_data[:,20] = [0 if x == "0" else x for x in training_data[:,20]]

# fix "alchemy category":
training_data[:,3] = [0 if x=="arts_entertainment" else x for x in training_data[:,3]]
training_data[:,3] = [1 if x=="business" else x for x in training_data[:,3]]
training_data[:,3] = [2 if x=="computer_internet" else x for x in training_data[:,3]]
training_data[:,3] = [3 if x=="culture_politics" else x for x in training_data[:,3]]
training_data[:,3] = [4 if x=="gaming" else x for x in training_data[:,3]]
training_data[:,3] = [5 if x=="health" else x for x in training_data[:,3]]
training_data[:,3] = [6 if x=="law_crime" else x for x in training_data[:,3]]
training_data[:,3] = [7 if x=="recreation" else x for x in training_data[:,3]]
training_data[:,3] = [8 if x=="religion" else x for x in training_data[:,3]]
training_data[:,3] = [9 if x=="science_technology" else x for x in training_data[:,3]]
training_data[:,3] = [10 if x=="sports" else x for x in training_data[:,3]]
training_data[:,3] = [11 if x=="unknown" else x for x in training_data[:,3]]
training_data[:,3] = [12 if x=="weather" else x for x in training_data[:,3]]
training_data[:,3] = [999 if x=="?" else x for x in training_data[:,3]]

print ("Corrected outliers data\n")

# ----------------------------------------------------------
# Models
# ----------------------------------------------------------
if modelType == "notext":
    print ("no text model\n")
    #ignore features which are useless
    X = training_data[:,list([3, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 19, 20, 22, 25])]
    scaler = preprocessing.StandardScaler()
    print("initialized scaler \n")
    scaler.fit(X,Y)
    print("fitted train data and labels\n")
    X = scaler.transform(X)
    print("Transformed train data\n")
    svc = SVC(kernel = "linear")
    print("Initialized SVM\n")
    rfecv = RFECV(estimator = svc, cv = 5, loss_func = zero_one_loss, verbose = 1)
    print("Initialized RFECV\n")
    rfecv.fit(X,Y)
    print("Fitted train data and label\n")
    rfecv.support_
    print ("Optimal Number of features : %d" % rfecv.n_features_)
    savetxt('rfecv.csv', rfecv.ranking_, delimiter=',', fmt='%f')

At call of "rfecv.fit(X,Y)" my code throws an error from the metrices.py file "ValueError: unknown is not supported"

The error sprouts in sklearn.metrics.metrics:

# No metrics support "multiclass-multioutput" format
    if (y_type not in ["binary", "multiclass", "multilabel-indicator", "multilabel-sequences"]):
        raise ValueError("{0} is not supported".format(y_type))

This is a classification problem, target values only 0 or 1. The data set can be found at Kaggle Competition Data

If anyone can point out where I am going wrong, I would appreciate it.

Rambow answered 27/11, 2013 at 5:48 Comment(1)
Welcome to SO! While your question (Q) is well formed, it's easy to reproduce error, but there are some improvements I would advise your to keep in mind for your next Q, if any. Your Q contains lot of redundand code, such as imports, and code non relevant to error, lesser code is more readable. Second, your data is too big and requires login to be loaded. You can check that error persist with sample data (first few lines), and include it in your Q. That way it will get more attention an will be answered better and faster. Have a good experience with sklearn and SO!Exponible
E
17

RFECV checks target/train data to be of one of types binary, multiclass, multilabel-indicator or multilabel-sequences:

  • 'binary': y contains <= 2 discrete values and is 1d or a column vector.
  • 'multiclass': y contains more than two discrete values, is not a sequence of sequences, and is 1d or a column vector.
  • 'mutliclass-multioutput': y is a 2d array that contains more than two discrete values, is not a sequence of sequences, and both dimensions are of size > 1.
  • 'multilabel-indicator': y is a label indicator matrix, an array of two dimensions with at least two columns, and at most 2 unique values.

while your Y is unknown, that is

  • 'unknown': y is array-like but none of the above, such as a 3d array, or an array of non-sequence objects.

The reason for that is your target data is string (of form "0" and "1") and is loaded with read_table as object:

>>> training_data[:, -1].dtype
dtype('O')
>>> type_of_target(training_data[:, -1])
'unknown'

To solve the issue, you can convert to int:

>>> Y = training_data[:, -1].astype(int)
>>> type_of_target(Y)
'binary'
Exponible answered 27/11, 2013 at 7:19 Comment(3)
@Exponible - what type do you have to convert to to get multiclass? do you need to use the factorize function? Also the type_of_target function does not seem to be working for me. Is this a function you wrote?Subadar
please refer to scikit-learn.org/stable/modules/multiclass.html and github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/…Exponible
This answer was the fix for me but just raising this in case others run in to the same issue: this error can happen even if your datatype for y isn't original string! My y dtype was Int32 and I got the same error as OP until I explicitly converted to int64 via .astype(int).Brassard

© 2022 - 2024 — McMap. All rights reserved.