How to use Keras to build a Part-of-Speech tagger?

I'm trying to implement a Part-of-Speech tagger using neural network with the help of Keras.

I'm using a Sequential model, and training data from NLTK's Penn Treebank Corpus(i.e. from nltk.corpus import treebank). According to my understanding, to form a neural networking with Keras includes the following steps:

Load data
Define -> compile -> fit a model
Evaluate the model

Specifically, I'm not sure how to pre-processing the tagged training data in order to use it in my model? These tagged data comes from nltk's corpus, they are key-value pairs, key is the English word and value is the corresponding POS tag.

To be precise, I don't konw how to arrange data in the "data" and "labels" variables in the following code:

model.fit(data, labels, nb_epoch=50, batch_size=32)

Could someone please give me some hints? Thank you so much for your time and I'm really appreciate your help!

There's many variations to how to do this and they depend on the amount of data you have and time you want to invest into this. I'll try to give you the mainstream path which you can improve upon yourself while citing some of the alternatives. I will not assume prior knowledge of text modeling with deep learning.

One way is to model the problem as multi-class classification, where the classes/label types are all possible POS tags. There's two most common ways to frame this with a deep learning model: One is a window model. The other is a sequence tagger using a recurrent unit.

Let's assume the simplest of both, the window model. Then you can do the following:

Structuring the data

Chop your corpus into windows of W words (e.g. 3 words) where the center word is the one you want to classify the other ones are context. Let's call this part of the data X.
For each window, get the POS tag for the center word. Let's call this part of data y

Encoding the data

Encoding X as vectors

Now neural nets need X encoded as a sequence of vectors. A common choice is to encode each word as a word embedding.

To do so first you tokenize your text and encode each word as an integer word id (e.g. every ocurrence of "cat" will be the number 7). If you don't have your own tokenizer you can use the one bundled with Keras. This takes text and returns sequence of integers/word ids.

Second you may want to pad and truncate each sequence of word ids so that every instance have the same length (note: there's other ways of handling this). An example from the imdb_lstm.py is

(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

Then you can use an Embedding layer to convert the sequence of padded/truncated word ids to a sequence of word embeddings. Example from imdb_lstm.py:

model = Sequential()
model.add(Embedding(max_features, 128, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))  # try using a GRU instead, for fun

Here the output of Embedding is being used fed to an LSTM. I list other model options at the end.

Encoding y

To do multi-class classification with Keras one usually uses categorical_crossentropy which expects the label to be a one-hot vector which as long as the number of possible categories (number of possible POS tags in your case). You can use keras' to_categorical. Note that it expects a vector of integer where each integer represents a class (e.g. NNP could be 0, VBD could be 1 and so on):

def to_categorical(y, nb_classes=None):
    '''Convert class vector (integers from 0 to nb_classes) to binary class matrix, for use with categorical_crossentropy.
    # Arguments
        y: class vector to be converted into a matrix
        nb_classes: total number of classes
    # Returns
        A binary matrix representation of the input.
    '''

Model options

Since in this line of solution you would be basically be doing multi-class classification you can basically take as imdb_ follow any examples from keras examples. These are actually binary text classification examples. To make them multi-class you need to use a softmax instead of a sigmoid as the final activation function and categorical_crossentropy instead of binary_crossentropy like in the mnist_ examples:

model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

Structuring the data

Encoding the data

Encoding X as vectors

Encoding y

Model options

Recommended topics

Hot tags