Panel data in Keras LSTM
Asked Answered
T

2

6

I am looking at panel data, which is structured like this:

D = \{(x^{(k)}_{t},y^{(k)}_{t})\,|\, k=1,\dots,N\, , t=t_0,\dots,t_k \}_{k=1}^{N}

where x^{(k)} denotes the k'th sequence, x^{(k)}_{t} denotes the k'th sequences value at time t , furthermore x^{(k)}_{i,t} is the i'th entry in the vector x^{(k)}_{t}. That is x^{(k)}_{t} is the feature vector of the k'th sequence at time t. The sub- and super scripts mean the same things for the label data y^{(k)}_{t}, but here y^{(k)}_{t} \in \{0,1\}.

In plain words: The data set contains individuals observed over time, and for each time point at which an individual is observed, it is recorded whether he bought an item or not ( y\in \{0,1\}).

I would like to use a recurrent neural network with LSTM units from Keras for the task of predicting whether a person will buy an item or not, at a given time point. I have only been able to find examples of RNN's where each sequence has a label value (philipperemy link), not an example where each sequence element has a label value as in the problem I described.

My approach so far, has been to create a tensor with dimensions (samples,timesteps,features) but I cannot figure out how to format the labels, such that keras can match them with the features. It should be something like this (samples,timesteps,1), where the last dimension indicates a single dimension to contain the label value of 0 or 1.

Furthermore some of the approaches that I have come across splits sequences such that subsequences are add to the training data, thus increasing the need for memory tremendously (mlmastery link). This is infeasible in my case, as I have multiple GB's of data, and I would not be able to store it in memory if I added subsequences.

The model I would like to use is something like this:

mod = Sequential()
mod.add(LSTM(30,input_dim=116, return_sequences = True))
mod.add(LSTM(10))
mod.add(Dense(2))

Does anyone have experience working with panel data in keras?

Thermogenesis answered 9/3, 2017 at 11:52 Comment(3)
Math mode doesnt seem work, I followed this tutorial: meta.math.stackexchange.com/questions/5020/…Thermogenesis
I am wondering if you are still on stackoverflow and if you would mind posting your data and full model? I am trying to learn keras for panel and my data is similar to yours, but there is not much out there for panel keras examples.Mckibben
Hi John, unfortunately I don't have access to the data or the model anymore.Thermogenesis
D
5

Try:

mod = Sequential()
mod.add(LSTM(30, input_shape=(timesteps, features), return_sequences = True))
mod.add(LSTM(10, return_sequences = True))
mod.add(TimeDistributed(Dense(1, activation='sigmoid')))
# In newest Keras version you can change the line above to mod.add(Dense(1, ..))

mod.compile(loss='binary_crossentropy', optimizer='rmsprop')
Duhamel answered 10/3, 2017 at 11:43 Comment(1)
Does it matter what batch size you use for panel data? Can the batch size be more than 1 individual?Cryptogam
B
0

It looks like the only option is to run the lstm for each individual (here it is a sequence) separately when the data is not balanced as I assume this since time depends on k in your question.

Bobsleigh answered 29/5, 2018 at 19:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.