Amazon SWF: at least one worker has to be running, why?

require 'yaml' require 'aws' config_file_path = File.join(File.dirname(File.expand_path(__FILE__)), 'config.yaml') config = YAML::load_file(config_file_path) swf = AWS::SimpleWorkflow.new(config) domain = swf.domains['test-domain'] puts("waiting for an activity") domain.activity_tasks.poll('hello-tasklist') do |activity_task| puts activity_task.activity_type.name activity_task.complete! :result => name puts("waiting for an activity") end

You've raised two issues: one regarding start of an execution with no active deciders and the other regarding actors crashing in the middle of a task. Let me address them in order.

I have carried out an experiment based on your observations and indeed, when a new workflow execution starts and no deciders are polling SWF still thinks that a new decision task gets started. The following is my event log from the AWS console. Note what happens:

Fri Feb 22 22:15:38 GMT+000 2013 1 WorkflowExecutionStarted
Fri Feb 22 22:15:38 GMT+000 2013 2 DecisionTaskScheduled
Fri Feb 22 22:15:38 GMT+000 2013 3 DecisionTaskStarted
Fri Feb 22 22:20:39 GMT+000 2013 4 DecisionTaskTimedOut
Fri Feb 22 22:20:39 GMT+000 2013 5 DecisionTaskScheduled
Fri Feb 22 22:22:26 GMT+000 2013 6 DecisionTaskStarted
Fri Feb 22 22:22:27 GMT+000 2013 7 DecisionTaskCompleted
Fri Feb 22 22:22:27 GMT+000 2013 8 ActivityTaskScheduled
Fri Feb 22 22:22:29 GMT+000 2013 9 ActivityTaskStarted
Fri Feb 22 22:22:30 GMT+000 2013 10 ActivityTaskCompleted
...

The first decision task was immediately scheduled (which is expected) and started right away (i.e. allegedly dispatched to a decider, even though no decider was running). I started a decider in the meantime, but the workflow didn't move until the timeout of the original decision task, 5 minutes later. I can't think of a scenario where this would be the desired behavior. Two possible defenses against that: have deciders running before starting a new execution or set an acceptably low timeout on a decision task (these tasks should be immediate anyway).

The issue of crashing actor (either decider or worker) is one that I'm familiar with. A short background note first:

Both activity and decision tasks are recored by the service in 3 stages:

Scheduled = ready to be picked up by an actor.
Started = already picked up by an actor.
Completed/Failed or Timed out = the actor either or completed failed or not finished the task within deadline.

Once the actor picked up a task and crashed, it is obviously not going to report anything back to the service (unless it is able to recover and still remembers task token of the dispatched task - but most crashing actors wouldn't be that smart). The next time a decision task will be scheduled, will be upon time-out of the recently dispatched task, which is why all actors seem to be blocked for the duration of a task timeout. This is actually the desired behavior: The service can't know whether the task is being worked on or not as long as the worker still works within its deadline. There is a simple way to deal with this: fit your actors with a try-catch block and fail a task when an unexpected crash happens. I would discourage from using separate tasklists for each integ test. Instead, I'd recommend failing the task in the teardown() block. SWF allows to specify a reason for failing a task, which is one way of logging failures and viewing them later through the AWS console.

Recommended topics

Hot tags