Truncated Backpropagation in keras with one sequence per batch

Asked 8/11, 2018 at 9:14 Answered 19/8, 2021 at 13:56

python keras deep-learning backpropagation

If I understood correctly, to perform TBPTT in keras we have to split our sequences into smaller parts of k timesteps. To re-use the state of our LSTM accross all the parts of the sequence we have to use the stateful parameter, according to the documentation of keras :

You can set RNN layers to be 'stateful', which means that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch. This assumes a one-to-one mapping between samples in different successive batches.

So if I understand correctly the 1st sample of the 1st batch is the 1s part of the 1st sequence, the 1st sample of the 2nd batch is the 2nd part of the 1 sequence, etc. I have 125973 sequences of length 1000 that I split into 40 sequences of k=25 timesteps. So my model should train on 40 batches containing 125973 sequences of 25 timesteps. My issue is the memory of my GPU (quadro K2200, I'm poor), a batch size of 125973 seems to be too much. I'd like to know if it is possible to keep the state of the LSTM inside the same batch and reset it between batches, so I should have a batch size of 40 and 125973 batches instead.

Here is my model:

model = Sequential()
model.add(Embedding(len(char_to_num), 200, mask_zero=True, batch_input_shape=(batch_size, k)))
model.add(Dropout(0.5))
model.add(LSTM(512, activation='relu', return_sequences=True, stateful=True))
model.add(Dropout(0.5))
model.add(TimeDistributed(Dense(len(char_to_num), activation='softmax')))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.summary()

EDIT 2021
Recent answers have been made this year, but it's kind of an old question. The state of libraries, DL, and NLP have changed a lot in the meantime and I've moved on from LSTM to Transformers. I haven't used an LSTM in years and I don't plan nor have the time to test the answers posted.

Tetartohedral answered 8/11, 2018 at 9:14 Comment(1)

Did u get the answer? – Padding 9/5, 2019 at 16:14

Your batch size is flexible in so far that it must divide P = 125973. If there is not such a number (since P is a prime number for example) then just add dummy sequences filled with thousand zeros each. In case of added dummy sequences, make sure to ignore them during training by adding an appropriate "sample_weights" nd-array to model.fit() (where real sequences are masked with "1" and dummy sequences with "0"), and call model.compile(.., sample_weight_mode='temporal').

Then, for resetting states in between batches, go for keras callbacks:

# N must be divisible by batch_size
N = 40*126000  # number of time series snippets (sequences + dummies)
batch_size = 50  # processing 50 sequences at a time

class StateResetter(tf.keras.callbacks.Callback):
    def on_train_batch_end(self, batch, logs={}):
        # reset states if we processed a set of sequences
        if (batch+1) % 40 == 0:
            self.model.get_layer('my_lstm_layer').reset_states()

# input_data.shape = (N, 25, num_features)
model.fit(input_data, labels, batch_size=batch_size, 
          callbacks=[StateResetter], sample_weight=sample_weight)

I guess you should be able to figure out how to shape input_data accordingly.

Quahog answered 18/2, 2021 at 10:40 Comment(0)

I'd like to know if it is possible to keep the state of the LSTM inside the same batch and reset it between batches...

This is the approach to take, in order to train the LSTM model better. This is because the samples in a batch will be adjacent to each other in time and the network can be trained well, when trained in a stateful manner for each batch. The memory savings of having smaller batch-size is a desirable side effect.

Resetting the state after every batch could be implemented as shown by @Kirgsn.

Mickiemickle answered 19/8, 2021 at 13:56 Comment(0)

Recommended topics

Hot tags