I just found that using Amazon's Elastic Map Reduce, I can specify a step to have one of three ActionOnFailure choices:
- TERMINATE_JOB_FLOW
- CANCEL_AND_WAIT
- CONTINUE
TERMINATE_JOB_FLOW is the default and obvious - it shuts down the entire cluster upon a failure in the step.
What is the difference between CANCEL_AND_WAIT and CONTINUE? It appears to me that both will keep the cluster running and simply move on to the next step when it is added.