tf.datasets input_fn getting error after 1 epoch
Asked Answered
E

2

0

So I am trying to switch to an input_fn() using tf.datasets as described in this question. While I have been able to get superior steps/sec using tf.datasets with the input_fn() below, I appear to run into an error after 1 epoch when running this experiment on GCMLE. Consider this input_fn():

def input_fn(...):
    files = tf.data.Dataset.list_files(filenames).shuffle(num_shards)

    dataset = files.apply(tf.contrib.data.parallel_interleave(lambda filename: tf.data.TextLineDataset(filename).skip(1), cycle_length=num_shards))
    dataset = dataset.apply(tf.contrib.data.map_and_batch(lambda row:
        parse_csv_dataset(row, hparams = hparams), 
        batch_size = batch_size, 
        num_parallel_batches = multiprocessing.cpu_count())) 
    dataset = dataset.prefetch(1)
    if shuffle:
        dataset = dataset.shuffle(buffer_size = 10000)
    dataset = dataset.repeat(num_epochs)

    iterator = dataset.make_initializable_iterator()
    features = iterator.get_next()
    tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer)

    labels = {key: features.pop(key) for key in LABEL_COLUMNS}

    return features, labels

I receive the following error on GCMLE:

disable=protected-access InvalidArgumentError (see above for traceback): Inputs to operation loss/sparse_softmax_cross_entropy_loss/num_present/Select of type Select must have the same size and shape. Input 0: [74] != input 1: [110] [[Node: loss/sparse_softmax_cross_entropy_loss/num_present/Select = Select[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss/sparse_softmax_cross_entropy_loss/num_present/Equal, loss/sparse_softmax_cross_entropy_loss/num_present/zeros_like, loss/sparse_softmax_cross_entropy_loss/num_present/ones_like)]] [[Node: global_step/add/_1509 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3099_global_step/add", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

This implies that there is a shape mismatch Input 0: [74] != input 1: [110], however my old queue based input_fn() works fine on the same exact data, so I do not believe it is any issue with the underlying data. This is taking place at what I believe to be the end of the epoch (because the num_steps when th GCMLE error ends is right around th num_train_examples/batch_size so I am guessing that the issue might be that the final batch is not equal the batch_size which is 110 (as it shows up in the error) and instead there are only 74 examples. Can anybody confirm that this is the error? Assuming that it is, is there some other flag that I need to set so that the last batch can be something other than the spcified batch size of 110?

For what it's worth, I have replicated this behavior with two different datasets (trains for multiple epochs with the old queue based input_fn, gets hung up at end of first epoch for the tf.datasets input_fn)

Enfeeble answered 10/5, 2018 at 0:50 Comment(0)
B
1

As Robbie suggests in the other answer, it looks like your old implementation used fixed batch sizes throughout (presumably using an API like tf.train.batch() or one of its wrappers with the default argument of allow_smaller_final_batch=False), and the default behavior of batching in tf.data (via tf.data.Dataset.batch() and tf.contrib.data.map_and_batch()) is to include the smaller final batch.

The bug is most likely in the model_fn. Without seeing that function, it is difficult to guess, but I suspect that there is either an explicit (and incorrect) assertion of a tensor's shape via Tensor.set_shape() (possibly in library code) or a bug in the implementation of tf.losses.sparse_softmax_cross_entropy().

First, I am assuming that the features and labels tensors returned from input_fn() have statically unknown batch size. Can you confirm that by printing the features and labels objects, and ensuring that their reported Tensor.shape properties have None for the 0th dimension?

Next, locate the call to tf.losses.sparse_softmax_cross_entropy() in your model_fn. Print the object that is passed as the weights argument to this function, which should be a tf.Tensor, and locate its static shape. Given the error you are seeing, I suspect it will have a shape like (110,), where 110 is your specified batch size. If that is the case, there is a bug in model_fn that incorrectly asserts that the shape of the weights is a full batch, when it might not be. (If that is not the case, then there's a bug in tf.losses.sparse_softmax_cross_entropy()! Please open a GitHub issue with an example that enables us to reproduce the problem.)

Aside: Why would this explain the bug? The code that calls the failing tf.where() op looks like this (edited for readability):

num_present = tf.where(tf.equal(weights, 0.0),  # This input is shape [74]
                       tf.zeros_like(weights),  # This input is shape [110]
                       tf.ones_like(weights)    # This input is probably [110]
)

This flavor of tf.where() op (named "Select" in the error message for historical reasons) requires that all three inputs have the same size. Superficially, tf.equal(weights, 0.0), tf.ones_like(weights), and tf.zeros_like(weights) all have the same shape, which is the shape of weights. However, if the static shape (the result of Tensor.shape) differs from the dynamic shape, then the behavior is undefined.

What actually happens? In this particular case, let's say the static shape of weights is [110], but the dynamic shape is [74]. The static shape of our three arguments to tf.where() will be [110]. The implementation of tf.equal() doesn't care that there's a mismatch, so its dynamic shape will be [74]. The implementations of tf.zeros_like() and tf.ones_like() use an optimization that ignores that dynamic shape when the static shape is fully defined, and so their dynamic shapes will be [110], causing the error you are seeing.

The proper fix is to locate the code that is asserting a fixed batch size in your model_fn, and remove it. The optimization and evaluation logic in TensorFlow is robust to variable batch sizes, and this will ensure that all of your data is used in the training and evaluation processes.

A less desirable short-term fix would be to drop the small batch at the end of the data. There are a couple of options here:

  • Drop some data randomly at the end of each epoch:

    • With TF 1.8 or later, pass drop_remainder=False to tf.contrib.data.map_and_batch().
    • With TF 1.7 or earlier, use dataset = dataset.filter(lambda features: tf.equal(tf.shape(features[LABEL_COLUMNS[0]])[0], batch_size)) after the map_and_batch.
  • Drop the very last batch of data:

    • Move the dataset.repeat(NUM_EPOCHS) before the map_and_batch() and then apply one of the two fixes mentioned above.
Babara answered 18/5, 2018 at 15:12 Comment(4)
sorry that I took so long to reply and I really appreciate your thoughtful response. The thing is that I actually was using tf.train.batch() with allow_smaller_final_batch=True instead of false and the code works fine. That is why I am a little confused. If my previous code required allow_smaller_final_batch=False I would agree that this was hard-coded but if it works fine with allowing a smaller final batch then wouldn't you expect the tf.dataset approach to work as well even with a smaller batch since the queue approach does? Nothing in the model_fn changes between approaches...Enfeeble
I also printed features and labels -- my input_fn() is returning a dict of tensors and I can see that the shapes all have ? in the 0th dimension. For example: FEATURES: {'input_sequence': <tf.Tensor 'IteratorGetNext:1' shape=(?, 100) dtype=int64>, 'input_x': <tf.Tensor 'IteratorGetNext:2' shape=(?,) dtype=float32>} LABELS: {'label1': <tf.Tensor 'IteratorGetNext:3' shape=(?,) dtype=int64>, 'label2': <tf.Tensor 'IteratorGetNext:4' shape=(?,) dtype=int64>}. Moving on to the sparse_softmax call next..Enfeeble
The weights taht are being passed also have ? dimension: WEIGHTS: Tensor("IteratorGetNext:3", shape=(?,), dtype=float32, device=/device:CPU:0) -- I have also kicked off a new experiment with drop_remainder=True to see if this solves the issue. However, given that the queue based method works with allow_smaller_final_batch=True and all of the shapes appear to have None as the dimension, is it possible that the dimension constraint is being introduced by something in my tf.data input_fn?Enfeeble
Update: In TF 1.8 (recently released for GCMLE) if I pass drop_remainder=True then everything works, but I'm still suspicious of what was going on previously that was causing issues...Enfeeble
C
1

It seems that some operation in your graph (from the error message, likely sparse_softmax_cross_entropy_loss), is expecting a fixed batch size. It may be your code (not part of the input_fn) that is enforcing this (e.g. passing batch_size as the shape of some tensor that is used in an op), or it may be one of the TF libraries.

This is not always a problem per se. However, the fact that the documented behavior of tf.data.Dataset.batch is:

NOTE: If the number of elements (N) in this dataset is not an exact multiple of batch_size, the final batch contain smaller tensors with shape N % batch_size in the batch dimension. If your program depends on the batches having the same shape, consider using the tf.contrib.data.batch_and_drop_remainder transformation instead.

As currently written your (non-input_fn) code is in the category of depending on the batch with the same shape.

Your options are to track down where the code is passing through a static batch size or to "drop the remainder". I believe the former is preferable, but more work.

If you choose the latter, note that you are not actually using tf.data.Dataset.batch, but rather tf.contrib.data.map_and_batch which accepts a drop_remainder parameter.

Cookgeneral answered 10/5, 2018 at 5:37 Comment(7)
thank you for the response! In my old input_fn I found that tf.train.shuffle_batch() is set to allow_smaller_final_batch=True which "allows the final batch to be smaller if there are insufficient items left in the queue." So this makes sense to me why the old input_fn is working, but I'm not sure that drop_remainder is the same behavior as it is a boolean flag "representing whether the last batch should be dropped". I would prefer not to just drop the last batch -- is there a way to implement the same allow_smaller_final_batch?Enfeeble
Did you try using dataset.make_oneshot_iterator() instead of dataset.make_initializable_iterator() (I think you would also be able to drop the line that adds the initializer to the collection)Cookgeneral
I got an error when running drop_remainder=True as it doesn't appear to be in TF 1.7 and I believe that is the newest GCMLE version (github.com/tensorflow/tensorflow/blob/r1.7/tensorflow/contrib/…).Enfeeble
I also tried to use dataset.make_oneshot_iterator() as I know it should be easier, but that also led to an error ``Dataset.make_one_shot_iterator()` does not support datasets that capture stateful objects, such as a Variable or LookupTable. In these cases, use Dataset.make_initializable_iterator(). (Original error: Cannot capture a stateful node (name:create_lookup_tables/string_to_index/hash_table, type:HashTableV2) by value.)`Enfeeble
I still haven't had any luck here. Maybe the fused map_and_batch() is forcing some sort of batch size and perhaps if I split them up it might not be as rigid?Enfeeble
You could try to track down the part of your code that requires a fixed batch size. That's a bit frustrating since your old code worked, but it may be the best solution since we can't seem to get the others to quite work as you'd like. Also, TF 1.8 should be available in ~1 week.Cookgeneral
Contacted mrry@ who have a much more thorough answer :)Cookgeneral
B
1

As Robbie suggests in the other answer, it looks like your old implementation used fixed batch sizes throughout (presumably using an API like tf.train.batch() or one of its wrappers with the default argument of allow_smaller_final_batch=False), and the default behavior of batching in tf.data (via tf.data.Dataset.batch() and tf.contrib.data.map_and_batch()) is to include the smaller final batch.

The bug is most likely in the model_fn. Without seeing that function, it is difficult to guess, but I suspect that there is either an explicit (and incorrect) assertion of a tensor's shape via Tensor.set_shape() (possibly in library code) or a bug in the implementation of tf.losses.sparse_softmax_cross_entropy().

First, I am assuming that the features and labels tensors returned from input_fn() have statically unknown batch size. Can you confirm that by printing the features and labels objects, and ensuring that their reported Tensor.shape properties have None for the 0th dimension?

Next, locate the call to tf.losses.sparse_softmax_cross_entropy() in your model_fn. Print the object that is passed as the weights argument to this function, which should be a tf.Tensor, and locate its static shape. Given the error you are seeing, I suspect it will have a shape like (110,), where 110 is your specified batch size. If that is the case, there is a bug in model_fn that incorrectly asserts that the shape of the weights is a full batch, when it might not be. (If that is not the case, then there's a bug in tf.losses.sparse_softmax_cross_entropy()! Please open a GitHub issue with an example that enables us to reproduce the problem.)

Aside: Why would this explain the bug? The code that calls the failing tf.where() op looks like this (edited for readability):

num_present = tf.where(tf.equal(weights, 0.0),  # This input is shape [74]
                       tf.zeros_like(weights),  # This input is shape [110]
                       tf.ones_like(weights)    # This input is probably [110]
)

This flavor of tf.where() op (named "Select" in the error message for historical reasons) requires that all three inputs have the same size. Superficially, tf.equal(weights, 0.0), tf.ones_like(weights), and tf.zeros_like(weights) all have the same shape, which is the shape of weights. However, if the static shape (the result of Tensor.shape) differs from the dynamic shape, then the behavior is undefined.

What actually happens? In this particular case, let's say the static shape of weights is [110], but the dynamic shape is [74]. The static shape of our three arguments to tf.where() will be [110]. The implementation of tf.equal() doesn't care that there's a mismatch, so its dynamic shape will be [74]. The implementations of tf.zeros_like() and tf.ones_like() use an optimization that ignores that dynamic shape when the static shape is fully defined, and so their dynamic shapes will be [110], causing the error you are seeing.

The proper fix is to locate the code that is asserting a fixed batch size in your model_fn, and remove it. The optimization and evaluation logic in TensorFlow is robust to variable batch sizes, and this will ensure that all of your data is used in the training and evaluation processes.

A less desirable short-term fix would be to drop the small batch at the end of the data. There are a couple of options here:

  • Drop some data randomly at the end of each epoch:

    • With TF 1.8 or later, pass drop_remainder=False to tf.contrib.data.map_and_batch().
    • With TF 1.7 or earlier, use dataset = dataset.filter(lambda features: tf.equal(tf.shape(features[LABEL_COLUMNS[0]])[0], batch_size)) after the map_and_batch.
  • Drop the very last batch of data:

    • Move the dataset.repeat(NUM_EPOCHS) before the map_and_batch() and then apply one of the two fixes mentioned above.
Babara answered 18/5, 2018 at 15:12 Comment(4)
sorry that I took so long to reply and I really appreciate your thoughtful response. The thing is that I actually was using tf.train.batch() with allow_smaller_final_batch=True instead of false and the code works fine. That is why I am a little confused. If my previous code required allow_smaller_final_batch=False I would agree that this was hard-coded but if it works fine with allowing a smaller final batch then wouldn't you expect the tf.dataset approach to work as well even with a smaller batch since the queue approach does? Nothing in the model_fn changes between approaches...Enfeeble
I also printed features and labels -- my input_fn() is returning a dict of tensors and I can see that the shapes all have ? in the 0th dimension. For example: FEATURES: {'input_sequence': <tf.Tensor 'IteratorGetNext:1' shape=(?, 100) dtype=int64>, 'input_x': <tf.Tensor 'IteratorGetNext:2' shape=(?,) dtype=float32>} LABELS: {'label1': <tf.Tensor 'IteratorGetNext:3' shape=(?,) dtype=int64>, 'label2': <tf.Tensor 'IteratorGetNext:4' shape=(?,) dtype=int64>}. Moving on to the sparse_softmax call next..Enfeeble
The weights taht are being passed also have ? dimension: WEIGHTS: Tensor("IteratorGetNext:3", shape=(?,), dtype=float32, device=/device:CPU:0) -- I have also kicked off a new experiment with drop_remainder=True to see if this solves the issue. However, given that the queue based method works with allow_smaller_final_batch=True and all of the shapes appear to have None as the dimension, is it possible that the dimension constraint is being introduced by something in my tf.data input_fn?Enfeeble
Update: In TF 1.8 (recently released for GCMLE) if I pass drop_remainder=True then everything works, but I'm still suspicious of what was going on previously that was causing issues...Enfeeble

© 2022 - 2024 — McMap. All rights reserved.