When writing checkpoint files using a tf.train.MonitoredTrainingSession
it somehow writes multiple metagraphs. What am I doing wrong?
I stripped it down to the following code:
import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/",
save_steps = 10,
saver = saver))]
with tf.train.MonitoredTrainingSession(master = '',
is_chief = True,
checkpoint_dir = None,
hooks = hooks,
save_checkpoint_secs = None,
save_summaries_steps = None,
save_summaries_secs = None) as mon_sess:
for i in range(30):
if mon_sess.should_stop():
break
try:
gs, _ = mon_sess.run([global_step, train])
print(gs)
except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
break
finally:
pass
Running this will give duplicate metagraphs, as evidenced by the tensorboard warning:
$ tensorboard --logdir ../train/test1/ --port=6006
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)
This is in tensorflow 1.2.0 (I cannot upgrade).
Running the same thing without a monitored session gives the right checkpoint output:
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(30):
gs, _ = sess.run([global_step, train])
print(gs)
if i%10==0:
saver.save(sess, output_path+'/test2/my-model', global_step=gs)
print("Saved ckpt")
Results in no tensorboard errors:
$ tensorboard --logdir ../traitest2/ --port=6006
Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)
I'd like to fix this as I suspect I'm missing something fundamental, and this error may have some connection to other issues I have in distributed mode. I have to restart tensorboard anytime I want to update the data. Moreover, TensorBoard seems to get really slow over time when it puts out many of these warnings.
There is a related question: tensorflow Found more than one graph event per run In this case the errors were due to multiple runs (with different parameters) written to the same output directory. The case here is about a single run to a clean output directory.
Running the MonitoredTrainingSession version in distributed mode gives the same errors.
Update Oct-12
@Nikhil Kothari suggested to use tf.train.MonitoredSession
instead of the larger tf.train.MonitoredTrainSession
wrapper, as follows:
import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks[(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test3/ckpt/",
save_steps=10,
saver=saver))]
chiefsession = tf.train.ChiefSessionCreator(scaffold=None,
master='',
config=None,
checkpoint_dir=None,
checkpoint_filename_with_path=None)
with tf.train.MonitoredSession(session_creator=chiefsession,
hooks=hooks,
stop_grace_period_secs=120) as mon_sess:
for i in range(30):
if mon_sess.should_stop():
break
try:
gs, _ = mon_sess.run([global_step, train])
print(gs)
except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
break
finally:
pass
Unfortunately this still gives the same tensorboard errors:
$ tensorboard --logdir ../train/test3/ --port=6006
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)
btw, each codeblock is stand-alone, copy=paste it in a Jupyter notebook and you will replicate the problem.
checkpoint_dir = your_path
when creating theMonitoredTrainingSession
it will just work fine. – Gifu