Why Bother With Recurrent Neural Networks For Structured Data?
Asked Answered
V

1

32

I have been developing feedforward neural networks (FNNs) and recurrent neural networks (RNNs) in Keras with structured data of the shape [instances, time, features], and the performance of FNNs and RNNs has been the same (except that RNNs require more computation time).

I have also simulated tabular data (code below) where I expected a RNN to outperform a FNN because the next value in the series is dependent on the previous value in the series; however, both architectures predict correctly.

With NLP data, I have seen RNNs outperform FNNs, but not with tabular data. Generally, when would one expect a RNN to outperform a FNN with tabular data? Specifically, could someone post simulation code with tabular data demonstrating a RNN outperforming a FNN?

Thank you! If my simulation code is not ideal for my question, please adapt it or share a more ideal one!

from keras import models
from keras import layers

from keras.layers import Dense, LSTM

import numpy as np
import matplotlib.pyplot as plt

Two features were simulated over 10 time steps, where the value of the second feature is dependent on the value of both features in the prior time step.

## Simulate data.

np.random.seed(20180825)

X = np.random.randint(50, 70, size = (11000, 1)) / 100

X = np.concatenate((X, X), axis = 1)

for i in range(10):

    X_next = np.random.randint(50, 70, size = (11000, 1)) / 100

    X = np.concatenate((X, X_next, (0.50 * X[:, -1].reshape(len(X), 1)) 
        + (0.50 * X[:, -2].reshape(len(X), 1))), axis = 1)

print(X.shape)

## Training and validation data.

split = 10000

Y_train = X[:split, -1:].reshape(split, 1)
Y_valid = X[split:, -1:].reshape(len(X) - split, 1)
X_train = X[:split, :-2]
X_valid = X[split:, :-2]

print(X_train.shape)
print(Y_train.shape)
print(X_valid.shape)
print(Y_valid.shape)

FNN:

## FNN model.

# Define model.

network_fnn = models.Sequential()
network_fnn.add(layers.Dense(64, activation = 'relu', input_shape = (X_train.shape[1],)))
network_fnn.add(Dense(1, activation = None))

# Compile model.

network_fnn.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Fit model.

history_fnn = network_fnn.fit(X_train, Y_train, epochs = 10, batch_size = 32, verbose = False,
    validation_data = (X_valid, Y_valid))

plt.scatter(Y_train, network_fnn.predict(X_train), alpha = 0.1)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

plt.scatter(Y_valid, network_fnn.predict(X_valid), alpha = 0.1)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

LSTM:

## LSTM model.

X_lstm_train = X_train.reshape(X_train.shape[0], X_train.shape[1] // 2, 2)
X_lstm_valid = X_valid.reshape(X_valid.shape[0], X_valid.shape[1] // 2, 2)

# Define model.

network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(64, activation = 'relu', input_shape = (X_lstm_train.shape[1], 2)))
network_lstm.add(layers.Dense(1, activation = None))

# Compile model.

network_lstm.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Fit model.

history_lstm = network_lstm.fit(X_lstm_train, Y_train, epochs = 10, batch_size = 32, verbose = False,
    validation_data = (X_lstm_valid, Y_valid))

plt.scatter(Y_train, network_lstm.predict(X_lstm_train), alpha = 0.1)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

plt.scatter(Y_valid, network_lstm.predict(X_lstm_valid), alpha = 0.1)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
Vange answered 25/8, 2018 at 19:50 Comment(10)
added +1 and hope it'll encourage someone, although I don't expect a useful answer unfortunately: your question a bit too broad and opinionated answers are against the rules here: stackoverflow.com/help/on-topic (that can explain someones -1). Some say RNN are good for sequences only, others that CNN are even better and less computationally expensive, etc. The truth is that finding a good method is still a bit of an art, rather than "plumbing", so there are no guaranteed recipes, just experience and analogies. I hope someone will share those. Stack exchange might be a better placeFiddlehead
@fromkerasimportmichael Your question is more concerned with theoretical aspects of machine learning. Please ask these kind of questions on Cross Validated or Data Science SE.Stoat
Thanks, @Fiddlehead and today. I agree that that my question isn't strictly concerned with programming; I didn't know SO wasn't for these types of questions, so thanks for telling me. Every time I have a question, I end up on SO, so I took it for granted that it was for everything. I now wish I hadn't offered half of my reputation as a bounty! When I use a CNN, I find that the performance is similar to a RNN but much faster, as you note isp-zax. Still, a FNN performs just as well and I haven't been able to find or simulate structured data where a RNN or CNN outperforms a FNN.Vange
Cross-posted: datascience.stackexchange.com/q/37690/8560, https://mcmap.net/q/458424/-why-bother-with-recurrent-neural-networks-for-structured-data/781723. Please do not post the same question on multiple sites. Each community should have an honest shot at answering without anybody's time being wasted.Dynah
Sorry! I thought I was asked here to post my question SE Data Science. I definitely do not want to waste anyone's time. My hope is that the answer to my question might save people time, if the extra compute time with a RNN is not necessary for structured data.Vange
@today, may I make a request for the future? If you're going to suggest another site, please let the poster know not to cross-post. You can suggest they delete the copy here before they post elsewhere. Hopefully this will provide a better experience for all. Thank you for listening!Dynah
@Dynah I totally understand this and It was all my fault. Thanks for bringing this up and let me know that. Surely, I would consider this in the future.Stoat
Actually, it was my fault @Stoat for not knowing I was asking an out of scope question. I appreciate everyone trying to help!Vange
@fromkerasimportmichael I'm embarrassed to say that my answer had a couple fatal errors, which made most of the results erroneous. However, not all what I said is wrong. I'll fix my errors and write a new answer soon.Eudemonia
@qusai-alothman That sounds great! If the problem is with my simulation, please feel free to adapt it or change it.Vange
C
9

In practice even in NLP you see that RNNs and CNNs are often competitive. Here's a 2017 review paper that shows this in more detail. In theory it might be the case that RNNs can handle the full complexity and sequential nature of language better but in practice the bigger obstacle is usually properly training the network and RNNs are finicky.

Another problem that might have a chance of working would be to look at a problem like the balanced parenthesis problem (either with just parentheses in the strings or parentheses along with other distractor characters). This requires processing the inputs sequentially and tracking some state and might be easier to learn with a LSTM then a FFN.

Update: Some data that looks sequential might not actually have to be treated sequentially. For example even if you provide a sequence of numbers to add since addition is commutative a FFN will do just as well as a RNN. This could also be true of many health problems where the dominating information is not of a sequential nature. Suppose every year a patient's smoking habits are measured. From a behavioral standpoint the trajectory is important but if you're predicting whether the patient will develop lung cancer the prediction will be dominated by just the number of years the patient smoked (maybe restricted to the last 10 years for the FFN).

So you want to make the toy problem more complex and to require taking into account the ordering of the data. Maybe some kind of simulated time series, where you want to predict whether there was a spike in the data, but you don't care about absolute values just about the relative nature of the spike.

Update2

I modified your code to show a case where RNNs perform better. The trick was to use more complex conditional logic that is more naturally modeled in LSTMs than FFNs. The code is below. For 8 columns we see that the FFN trains in 1 minute and reaches a validation loss of 6.3. The LSTM takes 3x longer to train but it's final validation loss is 6x lower at 1.06.

As we increase the number of columns the LSTM has a larger and larger advantage, especially if we added more complicated conditions in. For 16 columns the FFNs validation loss is 19 (and you can more clearly see the training curve as the model isn't able to instantly fit the data). In comparison the LSTM takes 11 times longer to train but has a validation loss of 0.31, 30 times smaller than the FFN! You can play around with even larger matrices to see how far this trend will extend.

from keras import models
from keras import layers

from keras.layers import Dense, LSTM

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import time

matplotlib.use('Agg')

np.random.seed(20180908)

rows = 20500
cols = 10

# Randomly generate Z
Z = 100*np.random.uniform(0.05, 1.0, size = (rows, cols))

larger = np.max(Z[:, :cols/2], axis=1).reshape((rows, 1))
larger2 = np.max(Z[:, cols/2:], axis=1).reshape((rows, 1))
smaller = np.min((larger, larger2), axis=0)
# Z is now the max of the first half of the array.
Z = np.append(Z, larger, axis=1)
# Z is now the min of the max of each half of the array.
# Z = np.append(Z, smaller, axis=1)

# Combine and shuffle.

#Z = np.concatenate((Z_sum, Z_avg), axis = 0)

np.random.shuffle(Z)

## Training and validation data.

split = 10000

X_train = Z[:split, :-1]
X_valid = Z[split:, :-1]
Y_train = Z[:split, -1:].reshape(split, 1)
Y_valid = Z[split:, -1:].reshape(rows - split, 1)

print(X_train.shape)
print(Y_train.shape)
print(X_valid.shape)
print(Y_valid.shape)

print("Now setting up the FNN")

## FNN model.

tick = time.time()

# Define model.

network_fnn = models.Sequential()
network_fnn.add(layers.Dense(32, activation = 'relu', input_shape = (X_train.shape[1],)))
network_fnn.add(Dense(1, activation = None))

# Compile model.

network_fnn.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Fit model.

history_fnn = network_fnn.fit(X_train, Y_train, epochs = 500, batch_size = 128, verbose = False,
    validation_data = (X_valid, Y_valid))

tock = time.time()

print()
print(str('%.2f' % ((tock - tick) / 60)) + ' minutes.')

print("Now evaluating the FNN")

loss_fnn = history_fnn.history['loss']
val_loss_fnn = history_fnn.history['val_loss']
epochs_fnn = range(1, len(loss_fnn) + 1)
print("train loss: ", loss_fnn[-1])
print("validation loss: ", val_loss_fnn[-1])

plt.plot(epochs_fnn, loss_fnn, 'black', label = 'Training Loss')
plt.plot(epochs_fnn, val_loss_fnn, 'red', label = 'Validation Loss')
plt.title('FNN: Training and Validation Loss')
plt.legend()
plt.show()

plt.scatter(Y_train, network_fnn.predict(X_train), alpha = 0.1)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('training points')
plt.show()

plt.scatter(Y_valid, network_fnn.predict(X_valid), alpha = 0.1)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('valid points')
plt.show()

print("LSTM")

## LSTM model.

X_lstm_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_lstm_valid = X_valid.reshape(X_valid.shape[0], X_valid.shape[1], 1)

tick = time.time()

# Define model.

network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(32, activation = 'relu', input_shape = (X_lstm_train.shape[1], 1)))
network_lstm.add(layers.Dense(1, activation = None))

# Compile model.

network_lstm.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Fit model.

history_lstm = network_lstm.fit(X_lstm_train, Y_train, epochs = 500, batch_size = 128, verbose = False,
    validation_data = (X_lstm_valid, Y_valid))

tock = time.time()

print()
print(str('%.2f' % ((tock - tick) / 60)) + ' minutes.')

print("now eval")

loss_lstm = history_lstm.history['loss']
val_loss_lstm = history_lstm.history['val_loss']
epochs_lstm = range(1, len(loss_lstm) + 1)
print("train loss: ", loss_lstm[-1])
print("validation loss: ", val_loss_lstm[-1])

plt.plot(epochs_lstm, loss_lstm, 'black', label = 'Training Loss')
plt.plot(epochs_lstm, val_loss_lstm, 'red', label = 'Validation Loss')
plt.title('LSTM: Training and Validation Loss')
plt.legend()
plt.show()

plt.scatter(Y_train, network_lstm.predict(X_lstm_train), alpha = 0.1)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('training')
plt.show()

plt.scatter(Y_valid, network_lstm.predict(X_lstm_valid), alpha = 0.1)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title("validation")
plt.show()
Cattegat answered 7/9, 2018 at 4:31 Comment(12)
Thanks, @emschorsch! Could you suggest how I could add interactions and dependencies into a simulation that might lead to a RNN outperforming a FNN? My interest is in non-language data, specifically. With actual structured data (health data over time with a number of features), a RNN takes around 12 times longer than a FNN to train for no increase in performance (which I did not expect since it is known that past values affect future values). I thought it might be best to untangle why the performance was equivalent and whether RNNs are worth the compute time by starting with simulated data.Vange
And just to add to that, a 1D CNN runs as fast as a FNN, and has the same performance.Vange
I added some more ideas. Its a good idea to try and apply these to simulated data to try and understand the methods better for your curiosity. But, I would say if you have a real problem in mind it's better to just work directly on that. CNNs are probably a great place to start with, and I wouldn't both with RNNs unless you're seeing a bottleneck in performance and you think theres some sequential information that the CNN is failing on. You could try predicting with just subsets of the feature and see which the network is not actually able to learn and make use of.Cattegat
Thanks! As part of your answer, would you be willing to adapt my simulation code, or provide a better one? I have tried a number of other simulations, similar to what you have suggested, and in each case, the performance of FNNs, RNNs, and CNNs is the same. I am starting to wonder if structured data might be too simple to benefit from RNNs. If you think about the complexity of language and the number of features in an embedding, it makes sense that a complex model is necessary; but maybe structured data is not complex enough to require a recurrent architecture? Sorry for the back and forth.Vange
I probably won't have time, but if I do I'll try. Could you add descriptions of the other simulations to the questions so I know what hasn't worked for you?Cattegat
Sure! I have tried variations on sequences, basically. Variation 1: If value A comes before value B in the sequence, then the output value is different than if B comes before A. Variation 2: The function itself is dependent on a binary feature somewhere else in the sequence; kinda like a switch, where I expected a RNN to remember the switch value but a FFN not to. Variation 3: The current feature value is dependent on a large number of its prior values. Variation 4: One feature impacts how many time steps of the other features the output considers.Vange
I can't remember everything I tried because I kept updating the code with each new attempt, so I don't have code to refer to. I guess I didn't expect it to be this tricky.Vange
Thank you for the simulation code! I was able to modify it to find a problem where LSTMs outperform FFNs. I hope that was what you were looking for.Cattegat
Thanks @emschorsch! I am excited to try this! So, in your simulation code, Z = np.append(Z, larger, axis = 1), the output Y is the largest value in the first 4 columns? For me, Z[0] looks like: array([23.8178429 , 57.98392158, 13.89135917, 39.3997332 , 91.5659519, 17.42502143, 25.13522941, 45.0659933, 57.98392158]).Vange
Yes, exactly Y is the max of the first 4 columns (first half in general). I actually meant to make Y be the smaller of the max of each of half but it seems to demonstrate that the LSTM is better in either case.Cattegat
Looks really good!! Thanks!! It seems like one can also use the max of the series as Y, and a RNN will outperform a FNN. Adding additional layers to a FNN will help its performance but the predictions still seem to have higher variability at the high end than the low end. Neat!Vange
From playing around some more, if Y is the max of two consecutive numbers in the series, a FNN will model correctly. If Y is the max of three or more consecutive numbers, a RNN will outperform a FNN.Vange

© 2022 - 2024 — McMap. All rights reserved.