How to structure panel data for LSTM and Keras?
Asked Answered
E

0

7

I am trying to figure out how to structure my dataset and build the X and y such that it will work with Keras' Stacked LSTM for sequence classification.

I have panel data where I am trying to predict classifications. I am not entirely sure how to understand timesteps or how to properly craft data's shape given my panel data.

# Libraries
from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np
import pandas as pd

# Here is an example of my data
df = pd.read_csv('https://raw.githubusercontent.com/rocketfish88/democ/master/sample2.csv')
df
# Contains a handful of features, a target, year, and id of the observation
   id        year  x1 x2  x3  y
0   A       2015   1   1   1  1
1   A       2016   2   2   2  1
2   A       2017   3   3   3  2
3   A       2018   4   4   4  2
4   B       2015   1   1   1  3
5   B       2016   2   2   2  2
6   B       2017   3   3   3  1
7   B       2018   4   4   4  1
8   C       2015   1   1   1  2
9   C       2016   2   2   2  2
10  C       2017   3   3   3  3
11  C       2018   4   4   4  2

Keras.io presents the following with example:

data_dim = 16
timesteps = 8
num_classes = 10

# expected input data shape: (batch_size, timesteps, data_dim)
model = Sequential()
model.add(LSTM(32, return_sequences=True,
               input_shape=(timesteps, data_dim)))  # returns a sequence of vectors of dimension 32
model.add(LSTM(32, return_sequences=True))  # returns a sequence of vectors of dimension 32
model.add(LSTM(32))  # return a single vector of dimension 32
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# Generate dummy training data
x_train = np.random.random((1000, timesteps, data_dim))
y_train = np.random.random((1000, num_classes))

# Generate dummy validation data
x_val = np.random.random((100, timesteps, data_dim))
y_val = np.random.random((100, num_classes))

model.fit(x_train, y_train,
          batch_size=64, epochs=5,
          validation_data=(x_val, y_val))

I am fairly lost as to how to take my dataset and transform it into the proper shape of (size, timesteps, dimensions)

I appreciate any help!

Engineer answered 2/2, 2019 at 19:40 Comment(4)
LSTM is good for modelling sequences, it's unclear how your data is a sequential classification task? You can just use a feed-forward model with the features as input and y as output.Gere
My data is time series cross sectional, or panel, which means we have a sequence of repeated inputs on the groups of units over time. Feedfoward examples do not seem to have a time component or repeated observation on the same entity.Engineer
If it is a fixed number of observations, you can flatten it such that it becomes (batch_size, timesteps*features) to use a feed-forward which is a good thing to try. Otherwise, it is not clear as to what your timesteps in the data correspond to from the question.Gere
Interesting! Could you please offer some additional advice on the batch_size and how to translate my current dataset to fit the LTSM expectations with timesteps*features? From reading about this, it seems batch_size means the number of sequences trained together. So I am not sure if I train all the year 2014 entries together or all of unit A entries together. Then, how would I go about computing timesteps*features? It doesnt make sense to multiply the timestep by x1, x2, etcEngineer

© 2022 - 2024 — McMap. All rights reserved.