Keras - stateful vs stateless LSTMs

Asked 24/9, 2016 at 21:16 Answered 12/6, 2022 at 4:56

Solved tensorflow deep-learning keras lstm

I'm having a hard time conceptualizing the difference between stateful and stateless LSTMs in Keras. My understanding is that at the end of each batch, the "state of the network is reset" in the stateless case, whereas for the stateful case, the state of the network is preserved for each batch, and must then be manually reset at the end of each epoch.

My questions are as follows: 1. In the stateless case, how is the network learning if the state isn't preserved in-between batches? 2. When would one use the stateless vs stateful modes of an LSTM?

Kagera answered 24/9, 2016 at 21:16 Comment(0)

I recommend you to firstly learn the concepts of BPTT (Back Propagation Through Time) and mini-batch SGD(Stochastic Gradient Descent), then you'll have further understandings of LSTM's training procedure.

For your questions,

Q1. In stateless cases, LSTM updates parameters on batch1 and then, initiate hidden states and cell states (usually all zeros) for batch2, while in stateful cases, it uses batch1's last output hidden states and cell sates as initial states for batch2.

Q2. As you can see above, when two sequences in two batches have connections (e.g. prices of one stock), you'd better use stateful mode, else (e.g. one sequence represents a complete sentence) you should use stateless mode.

BTW, @vu.pham said if we use stateful RNN, then in production, the network is forced to deal with infinite long sequences. This seems not correct, actually, as you can see in Q1, LSTM WON'T learn on the whole sequence, it first learns sequence in batch1, updates parameters, and then learn sequence on batch2.

Benefic answered 29/3, 2017 at 10:12 Comment(1)

Regarding the note about what @vu.pham said... If the LSTM is stateful, the information about previous batches are stored in the hidden states, so the updates on batch2 should depend on batch1, don't they? (this is to be regarded as the truncated BPTT of vanilla RNN, I think, there the backprop uses just a few time-steps but the RNN could still learn long dependencies, longer than the length of the sequence on which gradients are computed) – Lean 7/9, 2017 at 13:25

The network still learns the connection from item i and item i+1 in every batch. So if you decide to go with stateless RNN, very often you would split your series into multiple segments, each segment of length N. If you feed those segments into the network, it still learn to predict the next element given its knowledge about all previous elements.
I believe most people use stateless RNN in practice, because if we use stateful RNN, then in production, the network is forced to deal with infinite long sequences, and this might be cumbersome to handle.

Algebraist answered 28/9, 2016 at 18:43 Comment(0)

For stateful vs stateless, we are often confuse between state and weights. The state is getting reset, the weights are not. That is how it is learning. The hidden states flow in for one segment ( Sentence, paragraph ,etc) The RNN learns the relation between words in paragraph but ignores relationships between paragraphs.
It's a domain knowledge question rather than a one size fits all answer. If you believe the initial values still make sense in the final value, go for stateful. For example, within a chapter ideas maybe connected, but between first and last chapter is there any significant connection? Is that worth the additional compute?

Quickel answered 12/6, 2022 at 4:56 Comment(0)

Recommended topics

Hot tags