In the default mode (stateful = False) in Keras' LSTM implementation, all samples in a batch are independent and the state is not propagated from one sample to the next. As per my understanding, input sequence length (L) is the only way to have the LSTM maintain state. But this restricts the state propagation to a fixed number of time steps i.e. L. Theoretically, what advantage will this mode of operation have as compared to a feed forward NN with a fixed size sliding input window. So that each input to the NN is a vector of L consecutive input values.
In theory, LSTMs should be able to learn long range dependencies spanning even 1000 time steps. But doesn't this require me to have L = 1000, as there is no way to capture dependencies longer than the input sequence length? I know that one can use the stateful mode by formatting the input data such that i-th sample of each batch is dependent. I am having a hard time understanding what advantage does the default LSTM mode have over a feed forward NN with a sliding window over input data?