Multiclass classification
To better illustrate the differences, let us assume that your goal is that of classifying SO questions into n_classes
different, mutually exclusive classes. For the sake of simplicity in this example we will only consider four classes, namely 'Python'
, 'Java'
, 'C++'
and 'Other language'
. Let us assume that you have a dataset formed by just six SO questions, and the class labels of those questions are stored in an array y
as follows:
import numpy as np
y = np.asarray(['Java', 'C++', 'Other language', 'Python', 'C++', 'Python'])
The situation described above is usually referred to as multiclass classification (also known as multinomial classification). In order to fit the classifier and validate the model through scikit-learn library you need to transform the text class labels into numerical labels. To accomplish that you could use LabelEncoder:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_numeric = le.fit_transform(y)
This is how the labels of your dataset are encoded:
In [220]: y_numeric
Out[220]: array([1, 0, 2, 3, 0, 3], dtype=int64)
where those numbers denote indices of the following array:
In [221]: le.classes_
Out[221]:
array(['C++', 'Java', 'Other language', 'Python'],
dtype='|S14')
An important particular case is when there are just two classes, i.e. n_classes = 2
. This is usually called binary classification.
Multilabel classification
Let us now suppose that you wish to perform such multiclass classification using a pool of n_classes
binary classifiers, being n_classes
the number of different classes. Each of these binary classifiers makes a decision on whether an item is of a specific class or not. In this case you cannot encode class labels as integer numbers from 0
to n_classes - 1
, you need to create a 2-dimensional indicator matrix instead. Consider that sample n
is of class k
. Then, the [n, k]
entry of the indicator matrix is 1
and the rest of the elements in row n
are 0
. It is important to note that if the classes are not mutually exclusive there can be multiple 1
's in a row. This approach is named multilabel classification and can be easily implemented through MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_indicator = mlb.fit_transform(y[:, None])
The indicator looks like this:
In [225]: y_indicator
Out[225]:
array([[0, 1, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 0, 0, 1]])
and the column numbers where 1
's are actually indices of this array:
In [226]: mlb.classes_
Out[226]: array(['C++', 'Java', 'Other language', 'Python'], dtype=object)
Multioutput classification
What if you want to classify a particular SO question according to two different criteria simultaneously, for instance language and application? In this case you intend to do multioutput classification. For the sake of simplicity I will consider only three application classes, namely 'Computer Vision'
, 'Speech Processing
' and 'Other application
'. The label array of your dataset should be 2-dimensional:
y2 = np.asarray([['Java', 'Computer Vision'],
['C++', 'Speech Recognition'],
['Other language', 'Computer Vision'],
['Python', 'Other Application'],
['C++', 'Speech Recognition'],
['Python', 'Computer Vision']])
Again, we need to transform text class labels into numeric labels. As far as I know this functionality is not implemented in scikit-learn yet, so you will need to write your own code. This thread describes some clever ways to do that, but for the purposes of this post the following one-liner should suffice:
y_multi = np.vstack((le.fit_transform(y2[:, i]) for i in range(y2.shape[1]))).T
The encoded labels look like this:
In [229]: y_multi
Out[229]:
array([[1, 0],
[0, 2],
[2, 0],
[3, 1],
[0, 2],
[3, 0]], dtype=int64)
And the meaning of the values in each column can be inferred from the following arrays:
In [230]: le.fit(y2[:, 0]).classes_
Out[230]:
array(['C++', 'Java', 'Other language', 'Python'],
dtype='|S18')
In [231]: le.fit(y2[:, 1]).classes_
Out[231]:
array(['Computer Vision', 'Other Application', 'Speech Recognition'],
dtype='|S18')