Tensorflow dynamic_rnn deprecation
Asked Answered
V

1

3

It seems that the tf.nn.dynamic_rnn has been deprecated:

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Please use keras.layers.RNN(cell), which is equivalent to this API

I have checked out keras.layers.RNN(cell) and it says that it can use masking which I assume can act as a replacement for dynamic_rnn's sequence_length parameter?

This layer supports masking for input data with a variable number of timesteps. To introduce masks to your data, use an Embedding layer with the mask_zero parameter set to True.

But there is no further information even in the Embedding docs for how I can use mask_zero=True to accommodate variable sequence lengths. Also, if I am using an embedding layer just to add a mask, how do I prevent the Embedding from changing my input and being trained?

Similar to this question RNN in Tensorflow vs Keras, depreciation of tf.nn.dynamic_rnn() but I want to know how to use the mask to replace sequence_length

Vishinsky answered 20/3, 2019 at 15:39 Comment(0)
M
3

I needed an answer to this too, and figured out what I needed through the link at the bottom of your question.

In short, you do as the answer in the link says, but you 'simply' leave out the embedding layer if you're not interested in using one. I'd highly recommend reading and understanding the linked answer as it goes into more detail, and the docs on Masking, but here's a modified version which uses a masking layer over the sequence inputs to replace 'sequence_length':

import numpy as np
import tensorflow as tf

pad_value = 0.37
# This is our input to the RNN, in [batch_size, max_sequence_length, num_features] shape
test_input = np.array(
[[[1.,   1.  ],
  [2,    2.  ],
  [1.,   1.  ],
  [pad_value, pad_value], # <- a row/time step which contains all pad_values will be masked through the masking layer
  [pad_value, pad_value]],

 [[pad_value, pad_value],
  [1.,   1.  ],
  [2,    2.  ],
  [1.,   1.  ],
  [pad_value, pad_value]]])

# Define the mask layer, telling it to mask all time steps that contain all pad_value values
mask = tf.keras.layers.Masking(mask_value=pad_value)
rnn = tf.keras.layers.GRU(
    1,
    return_sequences=True,
    activation=None, # <- these values and below are just used to initialise the RNN in a repeatable way for this example
    recurrent_activation=None,
    kernel_initializer='ones',
    recurrent_initializer='zeros',
    use_bias=True,
    bias_initializer='ones'
)

x = tf.keras.layers.Input(shape=test_input.shape[1:])
m0 = tf.keras.Model(inputs=x, outputs=rnn(x))
m1 = tf.keras.Model(inputs=x, outputs=mask(x))
m2 = tf.keras.Model(inputs=x, outputs=rnn(mask(x)))

print('raw inputs\n', test_input)
print('raw rnn output (no mask)\n', m0.predict(test_input).squeeze())
print('masked inputs\n', m1.predict(test_input).squeeze())
print('masked rnn output\n', m2.predict(test_input).squeeze())

out:

raw inputs
 [[[1.   1.  ]
  [2.   2.  ]
  [1.   1.  ]
  [0.37 0.37]
  [0.37 0.37]]

 [[0.37 0.37]
  [1.   1.  ]
  [2.   2.  ]
  [1.   1.  ]
  [0.37 0.37]]]
raw rnn output (no mask)
 [[  -6.        -50.       -156.       -272.7276   -475.83362 ]
 [  -1.2876     -9.862801  -69.314    -213.94202  -373.54672 ]]
masked inputs
 [[[1. 1.]
  [2. 2.]
  [1. 1.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [1. 1.]
  [2. 2.]
  [1. 1.]
  [0. 0.]]]
masked rnn output
 [[  -6.  -50. -156. -156. -156.]
 [   0.   -6.  -50. -156. -156.]]

Notice how with the mask applied, the calculations are not performed on a time step where the mask is active (i.e. where the sequence is padded out). Instead, state from the previous time step is carried forward.

A few other points to note:

  • In the linked (and this) example, the RNN is created with various activation and initializer parameters. I assume this is to initialize the RNN to a known state for repeatability for the example. In practice, you would initialize the RNN how you would like.
  • The pad value can be any value you specify. Typically, padding using zeros is used. In the linked (and this) example, a value of 0.37 is used. I can only assume it is an arbitrary value to show the difference in the raw and masked RNN outputs, as a zero input value with this example RNN initialisation gives little/no difference in output, therefore 'some' value (i.e. 0.37) demonstrates the effect of the masking.
  • The Masking docs state that rows/time steps are masked only if all of the values for that time step contain the mask value. For example, in the above, a time step of [0.37, 2] would still be fed to the network with those values, however, a time step of [0.37, 0.37] would be skipped over.
  • An alternative approach to this problem instead of masking would be to train several times by batching the different sequence lengths together. For example, if you have a mix of sequence lengths of 10, 20, and 30, instead of padding them all out to 30 and masking, train using all your 10 sequence lengths, then your 20s, then 30s. Or if you have say lots of 100 sequence lengths, and also lots of 3, 4, 5 sequence lengths, you may want to pad your smaller ones to all 5 length and train twice using 100s and padded/masked 5s. You will likely gain training speed, but at the trade-off of less accuracy as you won't be able to shuffle between batches of different sequence lengths.
Madonnamadora answered 19/6, 2019 at 9:27 Comment(2)
Do you have any example about your last approach? I cannot bring myself to understand how come without padding that would work. model.fit() has a fixed batch size right? What if different length sequences form different batch size?Spruce
@Spruce The batch size is how many examples the network sees per update - (you can use train_on_batch too). The last dimension - the sequence length per feature per item in the batch - is what is important for an RNN. Each call to fit() will require a batch of all the same sequence length. Therefore, you would partition your data into data_seq_len_1 and data_seq_len_2 for example, and call fit(data_seq_len_1) and fit(data_seq_len_2) (or however you're feeding data to the model). However, as you can see the data isn't able to be shuffled then between the sequence lengths...Madonnamadora

© 2022 - 2024 — McMap. All rights reserved.