How to make Google Cloud AI Platform detect `tf.summary.scalar` calls during training?

Asked 28/4, 2020 at 12:20 Answered 20/7, 2020 at 17:41

tensorflow keras google-cloud-platform google-cloud-ml gcp-ai-platform-training

(Note: I have also asked this question here)

Problem

I have been trying to get Google Cloud's AI platform to display the accuracy of a Keras model, trained on the AI platform. I configured the hyperparameter tuning with hptuning_config.yaml and it works. However I can't get AI platform to pick up tf.summary.scalar calls during training.

Documentation

I have been following the following documentation pages:

1. Overview of hyperparameter tuning

2. Using hyperparameter tuning

According to [1]:

How AI Platform Training gets your metric You may notice that there are no instructions in this documentation for passing your hyperparameter metric to the AI Platform Training training service. That's because the service monitors TensorFlow summary events generated by your training application and retrieves the metric."

And according to [2], one way of generating such a Tensorflow summary event is by creating a callback class as so:

class MyMetricCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
        tf.summary.scalar('metric1', logs['RootMeanSquaredError'], epoch)

My code

So in my code I included:

# hptuning_config.yaml

trainingInput:
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 4
    maxParallelTrials: 2
    hyperparameterMetricTag: val_accuracy
    params:
    - parameterName: learning_rate
      type: DOUBLE
      minValue: 0.001
      maxValue: 0.01
      scaleType: UNIT_LOG_SCALE

# model.py

class MetricCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs):
        tf.summary.scalar('val_accuracy', logs['val_accuracy'], epoch)

I even tried

# model.py

class MetricCallback(tf.keras.callbacks.Callback):
    def __init__(self, logdir):
        self.writer = tf.summary.create_file_writer(logdir)

    def on_epoch_end(self, epoch, logs):
        with writer.as_default():
            tf.summary.scalar('val_accuracy', logs['val_accuracy'], epoch)

Which successfully saved the 'val_accuracy' metric to Google storage (I can also see this with TensorBoard). But this does not get picked up by the AI platform, despite the claim made in [1].

Partial solution:

Using the Cloud ML Hypertune package, I created the following class:

# model.py

class MetricCallback(tf.keras.callbacks.Callback):
    def __init__(self):
        self.hpt = hypertune.HyperTune()

    def on_epoch_end(self, epoch, logs):
        self.hpt.report_hyperparameter_tuning_metric(
            hyperparameter_metric_tag='val_accuracy',
            metric_value=logs['val_accuracy'],
            global_step=epoch
        )

which works! But I don't see how, since it all it seems to do is write to a file on the AI platform worker at /tmp/hypertune/*. There is nothing in the Google Cloud documentation that explains how this is getting picked up by the AI platform...

Am I missing something in order to get tf.summary.scalar events to be displayed?

Exeunt answered 28/4, 2020 at 12:20 Comment(8)

For the cloudml-hypertune case, the file is read by the service to report hyperparameter tuning metrics for your job. This is the recommended way to report hyperparameter tuning metrics if the summary events aren't getting picked up. For the tf.summary.scalar case, which runtime version are you using? This call is only monitored for runtime version 2.1 or above. – Perrotta 28/4, 2020 at 20:46

I am getting the same issue that somehow tf.summary.scalar doesn't seems to be propagated to the hp tuning engine. I am using runtine 2.1 and python 3.7. Yes there are other way to feed the metric directly to the HyperTune(). I also realized that "region" in the yaml file is not propagated to for training for example – Abyss 2/5, 2020 at 16:26

I can see the new metric in tensorboard with the proper name which is the same as in the yaml file. – Abyss 2/5, 2020 at 17:5

@Perrotta Thanks. I am using runtime version 2.1. For the cloudml-hypertune case, do you mean that AI platform is pre-configured to read from the /tmp/hypertune folder in the replicas? – Exeunt 4/5, 2020 at 9:32

Yes, that's correct. – Perrotta 5/5, 2020 at 16:51

"The way hyper-tuning works is in two ways: Using hyper-tune client : When to use this : - If using custom container - If using python package and using a framework that does not call TF summary. Out-of-box (no-changes) - If you are using a Python package and using a framework that writes TF summary files then hyper-parameter tuning will just work out of the box." GCP support is looking at the issue with TF summary (out of the box). The env variable 'CLOUD_ML_HP_METRIC_FILE': '/var/hypertune/output.metric' define the flocation of the file but it doesn't exit here or somwhere else. – Abyss 8/5, 2020 at 7:59

@JulianFerry do you create you Keras model with a tf.distribute.MirroredStrategy() scope ? In my case after removing "with strategy.scope():" then I see that the metric is now display in the AI Platform training job dashboard. It is NOT the case if I create my Keras model with the strategy scope. I still have a lot of warning and error relataed to file caching but seems to work. Very strange! I am using Tensorflow 2.2.0 but should be the same issue with TF 2.1.0. I reported this issue to the GCP AI Platform team. – Abyss 10/6, 2020 at 16:33

That's interesting. It makes sense that the training distribution could affect it in some way. I'll report back when I get the chance to experiment with it again. – Exeunt 19/6, 2020 at 8:38

I am having the same issue that I can't get AI platform to pick up tf.summary.scalar. I tried to debug it with the GCP support and AI Platform Engineering team for the last 2 months. They didn't manage to reproduce the issue even if we were using almost the same code. We even did one coding session but were still having different results.

Recommendation from the GCP AI Platform Engineering team: "don't use tf.summary.scalar" the main reason is that by using the other method:

it works fine for everybody
you can control and see what happen (not a blackbox)

They will update the documentation to reflect this new recommendation.

Setup:

Tensoflow 2.2.0
TensorBoard 2.2.2
keras model is created within the tf.distribute.MirroredStrategy() scope
keras callback for TensorBoard

With the following setup the "issue" is observed:

when using TensorBoard with update_freq='epoch' and with 1 epoch only

It seems to work with other setup. Anyway I will follow the recommendation from GCP and use the custom solution to avoid issue

Abyss answered 20/7, 2020 at 17:41 Comment(1)

here the public ticket concerning the release of the updated recommendation for hyper parameter tuning on GCP [1] [1] issuetracker.google.com/issues/162324970 – Abyss 1/8, 2020 at 13:5

-2

We tested this in TF 2.1 with TF Keras and AI Platform and works succesfully:

class CustomCallback(tf.keras.callbacks.TensorBoard):
    """Callback to write out a custom metric used by CAIP for HP Tuning."""

    def on_epoch_end(self, epoch, logs=None):  # pylint: disable=no-self-use
        """Write tf.summary.scalar on epoch end."""
        tf.summary.scalar('epoch_accuracy', logs['accuracy'], epoch)

# Setup TensorBoard callback.
custom_cb = CustomCallback(os.path.join(args.job_dir, 'metric_tb'),
                               histogram_freq=1)

# Train model
keras_model.fit(
        training_dataset,
        steps_per_epoch=int(num_train_examples / args.batch_size),
        epochs=args.num_epochs,
        validation_data=validation_dataset,
        validation_steps=1,
        verbose=1,
        callbacks=[custom_cb])

trainingInput:
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 4
    maxParallelTrials: 2
    hyperparameterMetricTag: epoch_accuracy
    params:
    - parameterName: batch-size
      type: INTEGER
      minValue: 8
      maxValue: 256
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: learning-rate
      type: DOUBLE
      minValue: 0.01
      maxValue: 0.1
      scaleType: UNIT_LOG_SCALE

Seems to be identical to your code, except I don't have access in how are you passing the callbacks. I remember seeing some issue when not specifying the callbacks directly.

Code here

Suellensuelo answered 28/4, 2020 at 21:34 Comment(2)

Sorry but I can't get this to work. I am also using TF 2.1. I removed all my callbacks and implemented it exactly as you did but the metric was not picked up by AI platform. Furthermore (I think?) your example only logs training accuracy. Or at least, that's the only folder I see at the path specified by os.path.join(args.job_dir, 'metric_tb'). I also don't see how this callback is doing anything additional to what we could do by just creating a TensorBoard callback without wrapping a class around it. – Exeunt 4/5, 2020 at 9:26

You have also implemented this differently to the docs. The docs [2] create a callback which inherits from tf.keras.callbacks.Callback whereas you are wrapping around tf.keras.callbacks.TensorBoard. If I include your callback with my already existing TensorBoard callback I get the following error: ValueError: Must enable trace before export. I can fix this by removing my TensorBoard callback and adding super().on_epoch_end(epoch, logs) to keep the TensorBoard functionality. But this doesn't seem very clean and seems unnecessary as I mentioned above. – Exeunt 4/5, 2020 at 9:27

Problem

Documentation

My code

Partial solution:

Recommended topics

Hot tags