So I am trying to switch to an input_fn() using tf.datasets as described in this question. While I have been able to get superior steps/sec using tf.datasets with the input_fn() below, I appear to run into an error after 1 epoch when running this experiment on GCMLE. Consider this input_fn():
def input_fn(...):
files = tf.data.Dataset.list_files(filenames).shuffle(num_shards)
dataset = files.apply(tf.contrib.data.parallel_interleave(lambda filename: tf.data.TextLineDataset(filename).skip(1), cycle_length=num_shards))
dataset = dataset.apply(tf.contrib.data.map_and_batch(lambda row:
parse_csv_dataset(row, hparams = hparams),
batch_size = batch_size,
num_parallel_batches = multiprocessing.cpu_count()))
dataset = dataset.prefetch(1)
if shuffle:
dataset = dataset.shuffle(buffer_size = 10000)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_initializable_iterator()
features = iterator.get_next()
tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer)
labels = {key: features.pop(key) for key in LABEL_COLUMNS}
return features, labels
I receive the following error on GCMLE:
disable=protected-access InvalidArgumentError (see above for traceback): Inputs to operation loss/sparse_softmax_cross_entropy_loss/num_present/Select of type Select must have the same size and shape. Input 0: [74] != input 1: [110] [[Node: loss/sparse_softmax_cross_entropy_loss/num_present/Select = Select[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss/sparse_softmax_cross_entropy_loss/num_present/Equal, loss/sparse_softmax_cross_entropy_loss/num_present/zeros_like, loss/sparse_softmax_cross_entropy_loss/num_present/ones_like)]] [[Node: global_step/add/_1509 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3099_global_step/add", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
This implies that there is a shape mismatch Input 0: [74] != input 1: [110]
, however my old queue based input_fn() works fine on the same exact data, so I do not believe it is any issue with the underlying data. This is taking place at what I believe to be the end of the epoch (because the num_steps
when th GCMLE error ends is right around th num_train_examples/batch_size
so I am guessing that the issue might be that the final batch is not equal the batch_size
which is 110 (as it shows up in the error) and instead there are only 74 examples. Can anybody confirm that this is the error? Assuming that it is, is there some other flag that I need to set so that the last batch can be something other than the spcified batch size of 110?
For what it's worth, I have replicated this behavior with two different datasets (trains for multiple epochs with the old queue based input_fn, gets hung up at end of first epoch for the tf.datasets input_fn)
tf.train.batch()
withallow_smaller_final_batch=True
instead of false and the code works fine. That is why I am a little confused. If my previous code requiredallow_smaller_final_batch=False
I would agree that this was hard-coded but if it works fine with allowing a smaller final batch then wouldn't you expect the tf.dataset approach to work as well even with a smaller batch since the queue approach does? Nothing in the model_fn changes between approaches... – Enfeeble