How to checkpoint a long-running function pythonically?

Asked 8/12, 2015 at 12:24 Answered 28/8, 2020 at 15:11

The typical situation in computational sciences is to have a program that runs for several days/weeks/months straight. As hardware/OS failures are inevitable, one typically utilize checkpointing, i.e. saves the state of the program from time to time. In case of failure, one restarts from the latest checkpoint.

What is the pythonic way to implement checkpointing?

For example, one can dump function's variables directly.

Alternatively, I am thinking of transforming such function into a class (see below). Arguments of the function would become arguments of a constructor. Intermediate data that constitute state of the algorithm would become class attributes. And pickle module would help with the (de-)serialization.

import pickle

# The file with checkpointing data
chkpt_fname = 'pickle.checkpoint'

class Factorial:
    def __init__(self, n):
        # Arguments of the algorithm
        self.n = n

        # Intermediate data (state of the algorithm)
        self.prod = 1
        self.begin = 0

    def get(self, need_restart):
        # Last time the function crashed. Need to restore the state.
        if need_restart:
            with open(chkpt_fname, 'rb') as f:
                self = pickle.load(f)

        for i in range(self.begin, self.n):
            # Some computations
            self.prod *= (i + 1)
            self.begin = i + 1

            # Some part of the computations is completed. Save the state.
            with open(chkpt_fname, 'wb') as f:
                pickle.dump(self, f)

            # Artificial failure of the hardware/OS/Ctrl-C/etc.
            if (not need_restart) and (i == 3):
                return

        return self.prod


if __name__ == '__main__':
    f = Factorial(6)
    print(f.get(need_restart=False))
    print(f.get(need_restart=True))

Mcdowell answered 8/12, 2015 at 12:24 Comment(2)

I think that your idea with making the long running function into a class is a good one. – Affer 8/12, 2015 at 13:6

If the state of your long running function is well-defined then you could store it explicitly (create a class and implement pickle support if necessary). It is likely to fail in a more complex case; unless you design your program from the ground up to support it (Lisp, Smalltalk)¶ To remove power temporarily, you could hibernate your computer¶ In general, you could run your computations inside a VM and then move it if necessary. There is an experimental checkpoint & restore support in docker. – Kotto 26/11, 2016 at 3:25

Usually the answer is serialize with your favourite serialization method be that cpickle json or xml. Pickle has the advantage that you can deserialize a whole object without much extra work.

Additionally it's a good idea to separate your process from your state, so you simply serialize your state object. Lots of objects can't be pickled for example threads, but you may want to run many workers(although beware of the GIL), so pickling will throw an exception of you try to pickle them. You can work around this with _getstate_ and _setstate_ deleting entries which cause a problem- but if you just keep process and state separate this is no longer a problem.

To checkpoint, save your checkpoint file to a known location, when your program begins, check if this file exists, if it doesn't the process hasn't started, otherwise load and run it. Create a thread that periodically checkpoints your running task by draining a queue any worker threads are processing and then saving your state object, then reuse the resume logic you use in case to resume from after checkpointing.

To safely checkpoint you need to insure your program doesn't corrupt the checkpoint file by dying mid pickle. To do this

Create checkpoint at temporary location and finish writing
Rename last checkpoint to checkpoint.old
Rename new checkpoint to checkpoint.pickle
Remove checkpoint.old
If you start the program and there is no checkpoint but there is a checkpoint.old, then the program died after step 2, so load checkpoint.old, rename checkpoint.old to checkpoint.pickle and run as normal. If the program died anywhere else, you can simply reload checkpoint.pickle.

Inclinatory answered 15/9, 2017 at 4:3 Comment(0)

If you're willing to write your code with checkpointing as a top concern, the class structure you propose is a good one. If you want to introduce checkpointing as an afterthought though you can use a special purpose checkpointing Linux package like this one or a Python-specific one like this one (which I wrote).

Piassava answered 19/5, 2020 at 2:46 Comment(1)

The Python package does exactly what is requested with a whole explanation article in the description. One question: does it work with Jupyter Notebooks? – Puisne 12/10, 2023 at 8:35

This is a comment, not a answer.

A general purpose mechanism to checkpoint and restore a Python function, class or program at any time will be hard to build. The easy part would be a mechanism to checkpoint and restore a class instance whose attributes are all elementary datatypes or collections of elementary datatypes. Then a get_state method could package up the state of a running instance of the class and a restore_state(checkpointed_state) method could restore the state.

But many parts of a program are not collections of elementary datatypes. These include:

The program pointer
References to methods and functions
Open file handles, network sockets, etc.
Running subprocesses

These are all hard to checkpoint and restore because it is difficult to obtain the entirety of their states, and it is difficult to restore their states.

Therefore, given your desire to checkpoint/restore long-running analyses I recommend that you use a higher-level mechanism, such as checkpoint/restore of Docker containers or Linux processes CRIU. Note that these do not completely checkpoint & restore all state, but they do more than you would be able to build.

Vernation answered 28/8, 2020 at 15:11 Comment(0)

Recommended topics

Hot tags