How to control amount of checkpoint kept by tensorflow estimator?
Asked Answered
J

2

5

I've noticed that the new Estimator API automatically saves checkpoints during the training and automatically restarts from the last checkpoint when training was interrupted. Unfortunately, it seems it only keeps the last 5 checkpoints.

Do you know how to control the number of checkpoints that are kept during the training?

Jiujitsu answered 29/12, 2017 at 20:46 Comment(0)
B
8

Tensorflow tf.estimator.Estimator takes config as an optional argument, which can be a tf.estimator.RunConfig object to configure runtime settings.You can achieve this as follows:

# Change maximum number checkpoints to 25
run_config = tf.estimator.RunConfig()
run_config = run_config.replace(keep_checkpoint_max=25)

# Build your estimator
estimator = tf.estimator.Estimator(model_fn,
                                   model_dir=job_dir,
                                   config=run_config,
                                   params=None)

config parameter is available in all classes (DNNClassifier, DNNLinearCombinedClassifier, LinearClassifier, etc.) that extend estimator.Estimator.

Birl answered 30/12, 2017 at 7:59 Comment(1)
Exactly the info I needed, and the RunConfig has additional params like save_checkpoints_secs and save_checkpoints_steps, perfect! Thank you!Jiujitsu
I
0

As a side note I would like to add that in TensorfFlow2 the situation is a little bit simpler. To keep a certain number of checkpoint files you can modify the model_main_tf2.py source code. First you can add and define an integer flag as

# Keep last 25 checkpoints
flags.DEFINE_integer('checkpoint_max_to_keep', 25,
                     'Integer defining how many checkpoint files to keep.')

Then use this pre-defined value in a call to model_lib_v2.train_loop:

# Ensure training loop keeps last 25 checkpoints
model_lib_v2.train_loop(...,
                        checkpoint_max_to_keep=FLAGS.checkpoint_max_to_keep,
                        ...)

The symbol ... above denotes other options to model_lib_v2.train_loop.

Impractical answered 8/9, 2021 at 10:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.