I am trying to figure out how to structure my dataset and build the X and y such that it will work with Keras' Stacked LSTM for sequence classification.
I have panel data where I am trying to predict classifications. I am not entirely sure how to understand timesteps or how to properly craft data's shape given my panel data.
# Libraries
from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np
import pandas as pd
# Here is an example of my data
df = pd.read_csv('https://raw.githubusercontent.com/rocketfish88/democ/master/sample2.csv')
df
# Contains a handful of features, a target, year, and id of the observation
id year x1 x2 x3 y
0 A 2015 1 1 1 1
1 A 2016 2 2 2 1
2 A 2017 3 3 3 2
3 A 2018 4 4 4 2
4 B 2015 1 1 1 3
5 B 2016 2 2 2 2
6 B 2017 3 3 3 1
7 B 2018 4 4 4 1
8 C 2015 1 1 1 2
9 C 2016 2 2 2 2
10 C 2017 3 3 3 3
11 C 2018 4 4 4 2
Keras.io presents the following with example:
data_dim = 16
timesteps = 8
num_classes = 10
# expected input data shape: (batch_size, timesteps, data_dim)
model = Sequential()
model.add(LSTM(32, return_sequences=True,
input_shape=(timesteps, data_dim))) # returns a sequence of vectors of dimension 32
model.add(LSTM(32, return_sequences=True)) # returns a sequence of vectors of dimension 32
model.add(LSTM(32)) # return a single vector of dimension 32
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
# Generate dummy training data
x_train = np.random.random((1000, timesteps, data_dim))
y_train = np.random.random((1000, num_classes))
# Generate dummy validation data
x_val = np.random.random((100, timesteps, data_dim))
y_val = np.random.random((100, num_classes))
model.fit(x_train, y_train,
batch_size=64, epochs=5,
validation_data=(x_val, y_val))
I am fairly lost as to how to take my dataset and transform it into the proper shape of (size, timesteps, dimensions)
I appreciate any help!
(batch_size, timesteps*features)
to use a feed-forward which is a good thing to try. Otherwise, it is not clear as to what your timesteps in the data correspond to from the question. – Gerebatch_size
and how to translate my current dataset to fit the LTSM expectations withtimesteps*features
? From reading about this, it seemsbatch_size
means the number of sequences trained together. So I am not sure if I train all the year2014
entries together or all of unitA
entries together. Then, how would I go about computingtimesteps*features
? It doesnt make sense to multiply the timestep by x1, x2, etc – Engineer