If I understood correctly, to perform TBPTT in keras we have to split our sequences into smaller parts of k timesteps. To re-use the state of our LSTM accross all the parts of the sequence we have to use the stateful parameter, according to the documentation of keras :
You can set RNN layers to be 'stateful', which means that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch. This assumes a one-to-one mapping between samples in different successive batches.
So if I understand correctly the 1st sample of the 1st batch is the 1s part of the 1st sequence, the 1st sample of the 2nd batch is the 2nd part of the 1 sequence, etc. I have 125973 sequences of length 1000 that I split into 40 sequences of k=25 timesteps. So my model should train on 40 batches containing 125973 sequences of 25 timesteps. My issue is the memory of my GPU (quadro K2200, I'm poor), a batch size of 125973 seems to be too much. I'd like to know if it is possible to keep the state of the LSTM inside the same batch and reset it between batches, so I should have a batch size of 40 and 125973 batches instead.
Here is my model:
model = Sequential()
model.add(Embedding(len(char_to_num), 200, mask_zero=True, batch_input_shape=(batch_size, k)))
model.add(Dropout(0.5))
model.add(LSTM(512, activation='relu', return_sequences=True, stateful=True))
model.add(Dropout(0.5))
model.add(TimeDistributed(Dense(len(char_to_num), activation='softmax')))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.summary()
EDIT 2021
Recent answers have been made this year, but it's kind of an old question. The state of libraries, DL, and NLP have changed a lot in the meantime and I've moved on from LSTM to Transformers. I haven't used an LSTM in years and I don't plan nor have the time to test the answers posted.