how to handle unbalanced data for multilabel classification using CNN in Keras?
Asked Answered
K

2

6

My dataset shape is (91149, 12)

I used CNN to train my classifier in text classification tasks

I found Training Accuracy: 0.5923 and Testing Accuracy: 0.5780

My Class has 9 labels as below:

df['thematique'].value_counts()
Corporate                   42399
Economie collaborative      13272
Innovation                  11360
Filiale                      5990
Richesses Humaines           4445
Relation sociétaire          4363
Communication                4141
Produits et services         2594
Sites Internet et applis     2585

The model structure:

model = Sequential()
embedding_layer = Embedding(vocab_size, 300, weights=[embedding_matrix],   input_length=maxlen   ,   trainable=False)
model.add(embedding_layer)
model.add(Conv1D(128, 7, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(9, activation='sigmoid'))
model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics= ['categorical_accuracy'])

My data for multilabel classification is imbalanced. I need to handle imbalanced data for multipabel classification using CNN in Keras.

Kimbra answered 27/12, 2019 at 14:14 Comment(0)
C
2

I am not sure that you need to handle the imbalance issue using in particular Keras per se, rather than using some intuition. One simple way to do so is to use the same amount of data per each class. Of course, that causes another problem, which is that you filter a lot of samples. But still is a thing that you can check. Of course, when you have imbalance data it is not a very good idea to just calculate the classification performance since it does so well how each class performs.

You should further, calculate the confusion matrix, in order to visualize how well each class performs individually. A more detailed approach to imbalanced data issues could be found in this blog and in here.

The most important is to use the right tools to evaluate the performance of your classification, and also handle the input data as proposed in the links I mentioned.

Costly answered 27/12, 2019 at 14:25 Comment(0)
L
2

Accuracy could be misleading as a metric for your problem, with high class imbalance, I would use the F1 score.

As for the loss, you could use the focal loss it is an variant of the categorical cross-entropy that focuses on the least represented classes. You can find an example here, in my experience, it helps a lot with little classes on NLP classification tasks.

Listless answered 21/4, 2020 at 20:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.