Problem

I am trying to use 5 years of consecutive, historical data to forecast values for the following year.

Data Structure

My input data input_04_08 looks like this where the first column is the day of the year (1 to 365) and the second column is the recorded input.

1,2
2,2
3,0
4,0
5,0

My output data output_04_08 looks like this, a single column with the recorded output on that day of the year.

I then normalise the values between 0 and 1 so the first sample given to the network would look like

Number of training patterns:  1825
Input and output dimensions:  2 1
First sample (input, target):
[ 0.00273973  0.04      ] [ 0.02185273]

Approach(s)

Feed Forward Network

I have implemented the following code in PyBrain

input_04_08 = numpy.loadtxt('./data/input_04_08.csv', delimiter=',')
input_09 = numpy.loadtxt('./data/input_09.csv', delimiter=',')
output_04_08 = numpy.loadtxt('./data/output_04_08.csv', delimiter=',')
output_09 = numpy.loadtxt('./data/output_09.csv', delimiter=',')

input_04_08 = input_04_08 / input_04_08.max(axis=0)
input_09 = input_09 / input_09.max(axis=0)
output_04_08 = output_04_08 / output_04_08.max(axis=0)
output_09 = output_09 / output_09.max(axis=0)
ds = SupervisedDataSet(2, 1)

for x in range(0, 1825):
    ds.addSample(input_04_08[x], output_04_08[x])

n = FeedForwardNetwork()
inLayer = LinearLayer(2)
hiddenLayer = TanhLayer(25)
outLayer = LinearLayer(1)
n.addInputModule(inLayer)
n.addModule(hiddenLayer)
n.addOutputModule(outLayer)
in_to_hidden = FullConnection(inLayer, hiddenLayer)
hidden_to_out = FullConnection(hiddenLayer, outLayer)
n.addConnection(in_to_hidden)
n.addConnection(hidden_to_out)
n.sortModules()

trainer = BackpropTrainer(n, ds, learningrate=0.01, momentum=0.1)

for epoch in range(0, 100000000): 
    if epoch % 10000000 == 0:
        error = trainer.train()  
        print 'Epoch: ', epoch
        print 'Error: ', error


result = numpy.array([n.activate(x) for x in input_09])

and this gave me the following result with final error of 0.00153840123381

Admittedly, this looks good. However, having read more about LSTM (Long Short-Term Memory) neural networks, and there applicability to time series data, I am trying to build one.

LSTM Network

Below is my code

input_04_08 = numpy.loadtxt('./data/input_04_08.csv', delimiter=',')
input_09 = numpy.loadtxt('./data/input_09.csv', delimiter=',')
output_04_08 = numpy.loadtxt('./data/output_04_08.csv', delimiter=',')
output_09 = numpy.loadtxt('./data/output_09.csv', delimiter=',')

input_04_08 = input_04_08 / input_04_08.max(axis=0)
input_09 = input_09 / input_09.max(axis=0)
output_04_08 = output_04_08 / output_04_08.max(axis=0)
output_09 = output_09 / output_09.max(axis=0)
ds = SequentialDataSet(2, 1)

for x in range(0, 1825):
    ds.newSequence()
    ds.appendLinked(input_04_08[x], output_04_08[x])


fnn = buildNetwork( ds.indim, 25, ds.outdim, hiddenclass=LSTMLayer, bias=True, recurrent=True)
trainer = BackpropTrainer(fnn, ds, learningrate=0.01, momentum=0.1)

for epoch in range(0, 10000000): 
    if epoch % 100000 == 0:
        error = trainer.train()  
        print 'Epoch: ', epoch
        print 'Error: ', error

result = numpy.array([fnn.activate(x) for x in input_09])

This results in a final error of 0.000939719502501, but this time, when I feed the test data, the output plot looks terrible.

Possible Problems

I have looked around here at pretty much all the PyBrain questions, these stood out, but haven't helped me figure things out

I have read a few blog posts, these helped further my understanding a bit, but obviously not enough

Naturally, I have also gone through the PyBrain docs but couldn't find much to help with the sequential dataset bar here.

Any ideas/tips/direction would be welcome.

I think what happened here is that you tried to assign hyperparameter values according to some rule-of-thumb which worked for the first case, but didn't for the second.

1) The error estimate that you're looking at is an optimistic prediction error estimate of the training set. The actual prediction error is high, but because you didn't test your model on unseen data there's no way of knowing it. Elements of statistical learning gives a nice description of this phenomenon. I would highly recommend this book. You can get it online for free.

2) To get an estimator with low prediction error you need to perform hyperparameter tuning. E.g. the number of hidden nodes, learning rate and momentum should be varied and tested on the unseen data to know which combination leads to the lowest prediction error. scikit-learn has GridSearchCV and RandomizedSearchCV to do just that, but they only work on sklearn`s estimators. You can roll your own estimator though, which is described in the documentation. Personally, I think that model selection and model evaluation are two different tasks. For the first one you can just run a single GridSearchCV or RandomizedSearchCV and get a set of best hyperparameters for your task. For model evaluation you need to run a more sophisticated analysis, such as nested cross-validation or even repeated nested cross-validation if you want an even more accurate estimate.

3) I don't know much about LSTM networks, but I see that in the first example you assigned 25 hidden nodes, but for LSTM you only provide 5. Maybe it's not enough to learn the pattern. You could also drop the output bias as done in the example.

P.S. I think this question actually belongs to http://stats.stackexchange.com, where you're likely to get a more detailed answer to your problem.

EDIT: I just noticed that you're teaching the model for 10 million epochs! I think it's a lot and probably part of the overfitting problem. I think it's a good idea to implement early stopping, i.e. stop training if some predefined error is achieved.