Perhaps the parallel computation machinery could be used for this? Here is a crude set-up that illustrates the idea:
Needs["SubKernels`LocalKernels`"]
doSomeWork[input_] := {$KernelID, Length[input], RandomReal[]}
getTheJobDone[] :=
Module[{subkernel, initsub, resultSoFar = {}}
, initsub[] :=
( subkernel = LaunchKernels[LocalMachine[1]]
; DistributeDefinitions["Global`"]
)
; initsub[]
; While[Length[resultSoFar] < 1000
, DistributeDefinitions[resultSoFar]
; Quiet[ParallelEvaluate[doSomeWork[resultSoFar], subkernel]] /.
{ $Failed :> (Print@"Ouch!"; initsub[])
, r_ :> AppendTo[resultSoFar, r]
}
]
; CloseKernels[subkernel]
; resultSoFar
]
This is an over-elaborate setup to generate a list of 1,000 triples of numbers. getTheJobDone
runs a loop that continues until the result list contains the desired number of elements. Each iteration of the loop is evaluated in a subkernel. If the subkernel evaluation fails, the subkernel is relaunched. Otherwise, its return value is added to the result list.
To try this out, evaluate:
getTheJobDone[]
To demonstrate the recovery mechanism, open the Parallel Kernel Status window and kill the subkernel from time-to-time. getTheJobDone
will feel the pain and print Ouch! whenever the subkernel dies. However, the overall job continues and the final result is returned.
The error-handling here is very crude and would likely need to be bolstered in a real application. Also, I have not investigated whether really serious error conditions in the subkernels (like running out of memory) would have an adverse effect on the main kernel. If so, then perhaps subkernels could kill themselves if MemoryInUse[]
exceeded a predetermined threshold.
Update - Isolating the Main Kernel From Subkernel Crashes
While playing around with this framework, I discovered that any use of shared variables between the main kernel and subkernel rendered Mathematica unstable should the subkernel crash. This includes the use of DistributeDefinitions[resultSoFar]
as shown above, and also explicit shared variables using SetSharedVariable
.
To work around this problem, I transmitted the resultSoFar
through a file. This eliminated the synchronization between the two kernels with the net result that the main kernel remained blissfully unaware of a subkernel crash. It also had the nice side-effect of retaining the intermediate results in the event of a main kernel crash as well. Of course, it also makes the subkernel calls quite a bit slower. But that might not be a problem if each call to the subkernel performs a significant amount of work.
Here are the revised definitions:
Needs["SubKernels`LocalKernels`"]
doSomeWork[] := {$KernelID, Length[Get[$resultFile]], RandomReal[]}
$resultFile = "/some/place/results.dat";
getTheJobDone[] :=
Module[{subkernel, initsub, resultSoFar = {}}
, initsub[] :=
( subkernel = LaunchKernels[LocalMachine[1]]
; DistributeDefinitions["Global`"]
)
; initsub[]
; While[Length[resultSoFar] < 1000
, Put[resultSoFar, $resultFile]
; Quiet[ParallelEvaluate[doSomeWork[], subkernel]] /.
{ $Failed :> (Print@"Ouch!"; CloseKernels[subkernel]; initsub[])
, r_ :> AppendTo[resultSoFar, r]
}
]
; CloseKernels[subkernel]
; resultSoFar
]
MathLink
- but my way really very hard, extremely non-elegant, relies on great number of undocumented features which are potentially version-specific. And the code is really huge! I hope that newScheduledTasks
functionality can give much more elegant way to solve this problem. – Overtrick