I've just started out using the AWS Ruby SDK to manage as simple workflow. One behavior I noticed right away is that at least one relevant worker and one relevant decider must be running prior to submitting a new workflow execution.
If I submit a new workflow execution before starting my worker and decider, then the tasks are never picked up, even when I'm still well within time-out limits. Why is this? Based on the description of how the HTTP long polling works, I would expect either app to receive the relevant tasks when the call to poll() is reached.
I encounter other deadlocking situations after a job fails (e.g. due to a worker or decider bug, or due to being terminated). Sometimes, re-running or even just starting an entirely new workflow execution will result in a deadlocked workflow execution. The initial decision tasks are shown in the workflow execution history in the AWS console, but the decider never receives them. Admittedly, I'm having trouble confirming/reducing this issue to a test case, but I suspect it is related to the above issue. This happens roughly 10 to 20% of the time; the rest of the time, everything works.
Some other things to mention: I'm using a single task list for two separate activity tasks that run in sequence. Both the worker and the decider are polling the same task list.
Here is my worker:
require 'yaml'
require 'aws'
config_file_path = File.join(File.dirname(File.expand_path(__FILE__)), 'config.yaml')
config = YAML::load_file(config_file_path)
swf = AWS::SimpleWorkflow.new(config)
domain = swf.domains['test-domain']
puts("waiting for an activity")
domain.activity_tasks.poll('hello-tasklist') do |activity_task|
puts activity_task.activity_type.name
activity_task.complete! :result => name
puts("waiting for an activity")
end
EDIT
Another user on the AWS forums commented:
I think the cause is in SWF not immediately recognizing a long poll connection shutdown. When you kill a worker its connection for some time can be considered open by the service. So it still can dispatch a task to it. To you it looks like the new worker never getting it. The way to verify it is to check the workflow history. You'll see activity task started event with identify field that contains host and pid of the dead worker. Eventually such task is going to time out and can be retried by the decider.
Note that such condition is common during unit tests that frequently terminate connections and is not really a problem for any production applications. The common workaround is to use different task list for each unit test.
This seems to be a pretty reasonable explanation. I'm going to try to confirm this.