MonitoredTrainingSession writes more than one metagraph event per run
Asked Answered
M

2

8

When writing checkpoint files using a tf.train.MonitoredTrainingSession it somehow writes multiple metagraphs. What am I doing wrong?

I stripped it down to the following code:

import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/",
                                          save_steps = 10,
                                          saver = saver))]

with tf.train.MonitoredTrainingSession(master = '',
                                       is_chief = True,
                                       checkpoint_dir = None,
                                       hooks = hooks,
                                       save_checkpoint_secs = None,
                                       save_summaries_steps = None,
                                       save_summaries_secs = None) as mon_sess:
    for i in range(30):
        if mon_sess.should_stop():
            break
        try:
            gs, _ = mon_sess.run([global_step, train])
            print(gs)
        except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
            break
        finally:
            pass

Running this will give duplicate metagraphs, as evidenced by the tensorboard warning:

$ tensorboard --logdir ../train/test1/ --port=6006

WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)

This is in tensorflow 1.2.0 (I cannot upgrade).

Running the same thing without a monitored session gives the right checkpoint output:

global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init_op)
    for i in range(30):
        gs, _ = sess.run([global_step, train])
        print(gs)
        if i%10==0:
            saver.save(sess, output_path+'/test2/my-model', global_step=gs)
            print("Saved ckpt")

Results in no tensorboard errors:

$ tensorboard --logdir ../traitest2/ --port=6006

Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)

I'd like to fix this as I suspect I'm missing something fundamental, and this error may have some connection to other issues I have in distributed mode. I have to restart tensorboard anytime I want to update the data. Moreover, TensorBoard seems to get really slow over time when it puts out many of these warnings.

There is a related question: tensorflow Found more than one graph event per run In this case the errors were due to multiple runs (with different parameters) written to the same output directory. The case here is about a single run to a clean output directory.

Running the MonitoredTrainingSession version in distributed mode gives the same errors.

Update Oct-12

@Nikhil Kothari suggested to use tf.train.MonitoredSession instead of the larger tf.train.MonitoredTrainSession wrapper, as follows:

import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks[(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test3/ckpt/",
                                            save_steps=10,
                                            saver=saver))]

chiefsession = tf.train.ChiefSessionCreator(scaffold=None,
                                            master='',
                                            config=None,
                                            checkpoint_dir=None,
                                            checkpoint_filename_with_path=None)
with tf.train.MonitoredSession(session_creator=chiefsession,
                hooks=hooks,
                stop_grace_period_secs=120) as mon_sess:
    for i in range(30):
        if mon_sess.should_stop():
            break
        try:
            gs, _ = mon_sess.run([global_step, train])
            print(gs)
        except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
            break
        finally:
            pass

Unfortunately this still gives the same tensorboard errors:

$ tensorboard --logdir ../train/test3/ --port=6006

WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)

btw, each codeblock is stand-alone, copy=paste it in a Jupyter notebook and you will replicate the problem.

Misdemeanor answered 8/10, 2017 at 22:10 Comment(2)
I'm sure you aware of this but just in case... If you specify checkpoint_dir = your_path when creating the MonitoredTrainingSession it will just work fine.Gifu
Thanks for the suggestion, I tried it just now, to add the checkpoint_dir in both the MonitoredSession and the hook. No difference though. It "works just fine", kinda fine,.. still have the issue of multiple graph events though.Misdemeanor
H
3

I wonder if this is because every node in your cluster is running the same code, declaring itself as a chief, and saving out graphs and checkpoints.

I don't if the is_chief = True is just illustrative in the post here on Stack Overflow or that is exactly what you are using... so guessing a bit here.

I personally used MonitoredSession instead of MonitoredTrainingSession and created a list of hooks based on whether the code is running on the master/chief or not. Example: https://github.com/TensorLab/tensorfx/blob/master/src/training/_trainer.py#L94

Heida answered 11/10, 2017 at 5:20 Comment(3)
The code above replicates the problem in single thread, yes ultimately I want to work in distributed but this question is about just getting it right in the simple case. I'll take a look at MonitoredSession, do you know what the difference is, conceptionally?Misdemeanor
MonitoredTrainingSession adds various hooks based on is_chief vs. not ... but in my case, I just wanted full control + use my own hook implementations, so I didn't use the derived class.Heida
Thanks for the suggestion, I tried MonitoredSession but it gives the same errors still. I updated the question with your suggestions.Misdemeanor
W
0

You should set the parameter chief_only_hooks in 'MonitoredTrainingSession', the code as follows:

hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/",
                                      save_steps = 10,
                                      saver = saver))]

with tf.train.MonitoredTrainingSession(master = '',
                                   is_chief = True,
                                   checkpoint_dir = None,
                                   chief_only_hooks = hooks,
                                   save_checkpoint_secs = None,
                                   save_summaries_steps = None,
                                   save_summaries_secs = None) as mon_sess:
Wandering answered 23/10, 2018 at 4:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.