I'm having a hard time conceptualizing the difference between stateful and stateless LSTMs in Keras. My understanding is that at the end of each batch, the "state of the network is reset" in the stateless case, whereas for the stateful case, the state of the network is preserved for each batch, and must then be manually reset at the end of each epoch.
My questions are as follows: 1. In the stateless case, how is the network learning if the state isn't preserved in-between batches? 2. When would one use the stateless vs stateful modes of an LSTM?
stateful
, the information about previous batches are stored in the hidden states, so the updates on batch2 should depend on batch1, don't they? (this is to be regarded as the truncated BPTT of vanilla RNN, I think, there the backprop uses just a few time-steps but the RNN could still learn long dependencies, longer than the length of the sequence on which gradients are computed) – Lean