Prevent over-fitting of text classification using Word embedding with LSTM
Asked Answered
B

4

13

Objective :

  • Identifying class label using user entered question (like Question Answer system).
  • Data extracted from Big PDF file, and need to predict page number based on user input.
  • Majorly used in policy document, where user have question about policy and need to show particular page number.

Previous Implementation : Applied elastic-search but very less accuracy, because user enter any text like "I need" == "want to"


Dataset information : Dataset contains each row as, Text( or paragraph) and Label (as Page number). here dataset size is small, I have only 500 rows.

Current Implementation :

  • Applied word-embedding(Glove) with LSTM in Keras and back-end is Tensor-flow
  • Applied Droupout
  • Applied ActivityRegularization
  • Applied L2 W_regularizer( from 0.1 to 0.001)
  • Applied different nb_epoch from 10 to 600
  • Changed EMBEDDING_DIM from 100 to 300 of Glove Data

Applied NLP for,

  • Convert to lower case
  • Remove Stop word of English
  • Stemming
  • Remove numbers
  • Remove URL and IP address

Result : Accuracy on test data(or validation data) is 23% but on train data is 91%


Code :

import time
from time import strftime

import numpy as np
from keras.callbacks import CSVLogger, ModelCheckpoint
from keras.layers import Dense, Input, LSTM, ActivityRegularization
from keras.layers import Embedding, Dropout,Bidirectional
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.regularizers import l2
from keras.utils import to_categorical

import pickle
from DataGenerator import *

BASE_DIR = ''
GLOVE_DIR = 'D:/Dataset/glove.6B'  # BASE_DIR + '/glove.6B/'

MAX_SEQUENCE_LENGTH = 50
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 300
VALIDATION_SPLIT = 0.2

# first, build index mapping words in the embeddings set
# to their embedding vector
np.random.seed(1337)  # for reproducibility

print('Indexing word vectors.')

t_start = time.time()

embeddings_index = {}

if os.path.exists('pickle/glove.pickle'):
    print('Pickle found..')
    with open('pickle/glove.pickle', 'rb') as handle:
        embeddings_index = pickle.load(handle)
else:
    print('Pickle not found...')
    f = open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt'), encoding='utf8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    with open('pickle/glove.pickle', 'wb') as handle:
        pickle.dump(embeddings_index, handle, protocol=pickle.HIGHEST_PROTOCOL)

print('Found %s word vectors.' % len(embeddings_index))

# second, prepare text samples and their labels
print('Processing text dataset')

texts = []  # list of text samples
labels = []  # list of label ids
labels_index = {}  # dictionary mapping label name to numeric id

(texts, labels, labels_index) = get_data('D:/PolicyDocument/')

print('Found %s texts.' % len(texts))

# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

# prepare embedding matrix
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words + 1, EMBEDDING_DIM))
print('Preparing embedding matrix. :', embedding_matrix.shape)
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            mask_zero=True,
                            trainable=False)

print('Training model.')

csv_file = "logs/training_log_" + strftime("%Y-%m-%d %H-%M", time.localtime()) + ".csv"
model_file = "models/Model_" + strftime("%Y-%m-%d %H-%M", time.localtime()) + ".mdl"
print("Model file:" + model_file)
csv_logger = CSVLogger(csv_file)

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

rate_drop_lstm = 0.15 + np.random.rand() * 0.25
num_lstm = np.random.randint(175, 275)
rate_drop_dense = 0.15 + np.random.rand() * 0.25

x = LSTM(num_lstm, return_sequences=True, W_regularizer=l2(0.001))(embedded_sequences)
x = Dropout(0.5)(x)
x = LSTM(64)(x)
x = Dropout(0.25)(x)
x = ActivityRegularization(l1=0.01, l2=0.001)(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

model_checkpoint = ModelCheckpoint(model_file, monitor='val_loss', verbose=0, save_best_only=True,
                                   save_weights_only=False, mode='auto')

model.fit(x_train, y_train,
          batch_size=1,
          nb_epoch=600,
          validation_data=(x_val, y_val), callbacks=[csv_logger, model_checkpoint])

score = model.evaluate(x_val, y_val, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

t_end = time.time()
total = t_end - t_start
ret_str = "Time needed(s): " + str(total)
print(ret_str)
Bungalow answered 8/5, 2017 at 8:56 Comment(6)
and the question is... how to prevent overfitting? This is not really a programming question...Tollbooth
Have you ever had a closer look at the content of your test and/or validation set?Boardwalk
You are doing very frequent updates by passing batch_size=1. Try changing it to values like 32, 64, etc. Too frequent updates can break your networkBacchanalia
Please clearly indicate the question.Gauvin
Nice points @Nain, I also changed batch_size from 1 to 64Bungalow
Hi @onurgüngör, I tried to put all things here, which I implemented. Any specific points you want to need, please provide here.Bungalow
N
10

Dropout and BN are very effective with feedforward NNs. However, they can cause problems with RNNs (There are many papers published on this topic)

The best way to make your RNN model generalize better is to increase the dataset size. In your case (LSTM with about 200 cells), you probably want to have on the order of 100,000 or more labeled samples to train on.

Nucleoprotein answered 17/5, 2017 at 10:38 Comment(0)
T
7

Besides simply reducing the parameters such as the embedding size and the amount of units in some layers, there is also the possibility of adjusting the recurrent dropout in LSTMs.

LSTMs seem to overfit quite easily (so I have read).

Then you can see in Keras documentation the use of dropout and recurrent_dropout as parameters of each LSTM layer.

Example with arbitrary numbers:

x = LSTM(num_lstm, return_sequences=True, W_regularizer=l2(0.001), recurrent_dropout=0.4)(embedded_sequences)
x = Dropout(0.5)(x)
x = LSTM(64,dropout=0,5, recurrent_dropout=0,3)(x)

Other causes may be wrong or insufficient data:

  • Have you tried shuffling the test and the validation data together and creating new train and validation set?

  • How many sentences do you have in training data? Are you trying with small sets? Use the entire set or try dada augmentation (creating new sentences and their classifications - but this may be very tricky with text).

Terza answered 12/5, 2017 at 18:22 Comment(2)
Thanks @Daniel, will check. I tried shuffling using np.random.shuffle(indices). I have 50 files and each file having 10 sentence, How I create new sentence which is relevant to current training?Bungalow
Got below error when put recurrent_dropout=0.4 in LSTM, currently using keras 1.0 AssertionError: Keyword argument not understood: recurrent_dropoutBungalow
R
3

What you describe sounds a lot like overfitting. Without more information regarding the data the best suggestion is for you to try stronger regularization methods. @Daniel already suggested that you use the dropout parameters that you haven't used - dropout and recurrent_dropout. You can also try and increase the ratio for dropout layers, use stronger regularization with the W_regularizer parameter.

Other options could be opened with more information such as whether you have tried Daniel's suggestion and what were the results.

Runge answered 15/5, 2017 at 15:59 Comment(0)
S
0

Adversarial training methods ( as a mean of regularization) may be worth looking into. Adversarial Training Methods for Semi-Supervised Text Classification

Stenophyllous answered 8/9, 2017 at 18:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.