I am looking at panel data, which is structured like this:
D = \{(x^{(k)}_{t},y^{(k)}_{t})\,|\, k=1,\dots,N\, , t=t_0,\dots,t_k \}_{k=1}^{N}
where x^{(k)}
denotes the k
'th sequence, x^{(k)}_{t}
denotes the k
'th sequences value at time t
, furthermore x^{(k)}_{i,t}
is the i
'th entry in the vector x^{(k)}_{t}
. That is x^{(k)}_{t}
is the feature vector of the k
'th sequence at time t
. The sub- and super scripts mean the same things for the label data y^{(k)}_{t}
, but here y^{(k)}_{t} \in \{0,1\}
.
In plain words: The data set contains individuals observed over time, and for each time point at which an individual is observed, it is recorded whether he bought an item or not ( y\in \{0,1\}
).
I would like to use a recurrent neural network with LSTM units from Keras for the task of predicting whether a person will buy an item or not, at a given time point. I have only been able to find examples of RNN's where each sequence has a label value (philipperemy link), not an example where each sequence element has a label value as in the problem I described.
My approach so far, has been to create a tensor with dimensions (samples,timesteps,features) but I cannot figure out how to format the labels, such that keras
can match them with the features. It should be something like this (samples,timesteps,1), where the last dimension indicates a single dimension to contain the label value of 0 or 1.
Furthermore some of the approaches that I have come across splits sequences such that subsequences are add to the training data, thus increasing the need for memory tremendously (mlmastery link). This is infeasible in my case, as I have multiple GB's of data, and I would not be able to store it in memory if I added subsequences.
The model I would like to use is something like this:
mod = Sequential()
mod.add(LSTM(30,input_dim=116, return_sequences = True))
mod.add(LSTM(10))
mod.add(Dense(2))
Does anyone have experience working with panel data in keras
?