A program resistent to power/hardware/OS failures
Asked Answered
M

1

6

I need to write a program that performs a parallel search in a large space of possible states, with new areas being discovered (and their exploration started) in the process, and exploration of some areas being terminated early as intermediate results obtained elsewhere eliminate a possibility of discovering new useful results in them. The search is performed using multiple threads running in a heavy cooperation with each other to avoid recalculation of intermediate data.

A complex internal state (including call stacks of several threads and state synchronization primitives they use) has to be maintained and updated during the whole process, and there is no apparent way to split the computation into isolated chunks that can be executed sequentially, each saving and passing a small intermediate result to the next. Also, there is no way to split the computation into independent parallel threads not communicating with each other, without imposing a prohibitive overhead due to recalculation of large amount of intermediate data.

Because of the large search domain, the program would possibly run for months before producing a final result. Hence, there is a significant risk of power, hardware or OS failure during the program execution that can lead to a complete loss of all work that has been done to the moment. In such a case the program will need to restart all its computations from scratch.

I need a solution that can prevent a complete data loss in such cases. I thought of an execution engine/platform that continuously saves the current state of the process to a failure-resistant storage like a redundant disk array or database. But I understand that this approach can significantly slow down the process, even to a degree when there would be no benefit compared to an expected computation time including restarts due to possible failures.

In fact, I do not need an ideal solution that continuously saves the program state, and I can easily bear a loss of hours or maybe even days of work. A possible heavyweight solution that comes to my mind is to run the program inside a virtual machine, saving its snapshots from time to time, and restoring the machine after a possible host failure from a recent snapshot. This approach can also help to recover the program state after a random or preventable guest OS failure.

Is there a similar, but more lightweight solution limited to preserving a state of a single process? Or could you suggest any other approaches that can solve my problem?

Molech answered 5/8, 2013 at 21:11 Comment(2)
The VM solution is not really heavyweight IMO (although i'm thinking from a developer's perspective, and adding complex state-saving and reloading to a multi-threaded process is making me cringe), and if you can integrate a simple "pause" mechanism to the exploration process while the VM is saving a snapshot, just to be extra-safe - that would be my preferred choice.Fridlund
There is a variety of software to save "checkpoints" out there... but you don't mention what OS or even programming language your intend to use.Saffron
S
1

You may want to look at using Erlang which allows large numbers of threads to run at relatively low cost. Because the thread cost is low, redundancy can be used to achieve increased reliability.

For the problem you present, a triple-redundancy scheme may be the way to go, where periodic checks for synchronization across the three (or more) systems would determine by vote who has failed.

Stormi answered 10/9, 2013 at 19:8 Comment(1)
And using mnesia tables you could store your state redundantly across >1 machines.Magdamagdaia

© 2022 - 2024 — McMap. All rights reserved.