Cloudformation template for creating ECS service stuck in CREATE_IN_PROGRESS
Asked Answered
U

9

61

I am creating an AWS ECS service using Cloudformation.

Everything seems to complete successfully, I can see the instance being attached to the load-balancer, the load-balancer is declaring the instance as being healthy, and if I hit the load-balancer I am successfully taken to my running container.

Looking at the ECS control panel, I can see that the service has stabilised, and that everything is looking OK. I can also see that the container is stable, and is not being terminated/re-created.

However, the Cloudformation template never completes, it is stuck in CREATE_IN_PROGRESS until about 30-60 minutes later, when it rolls back claiming that the service did not stabilise. Looking at CloudTrail, I can see a number of RegisterInstancesWithLoadBalancer instantiated by ecs-service-scheduler, all with the same parameters i.e. same instance id and load-balancer. I am using standard IAM roles and permissions for ECS, so it should not be a permissions issue.

Anyone had a similar issue?

Undesirable answered 22/9, 2015 at 21:48 Comment(7)
what fails in the cloud formation? do you have any failed events? Can you copy paste cloud formation event loG?Kuo
this typically means your instances/tasks haven't come up properly.Evacuee
@Kuo It is the ECS service creation that fails with a message saying that service failed to stabilise. However looking in the ECS control panel there is a contradicting message saying that the service stabilised.Undesirable
@tedder42 That is what I would suspect, however, if I disable rollback of the stack I can access my service/container/task successfully so it does seem like it is able to come up. In terms of instances, the cluster and instances is already up as they are created in a different template. I have also been able to verify that they work as expected.Undesirable
There seems to be other people having the same issue: forums.aws.amazon.com/thread.jspa?threadID=190250Undesirable
@Undesirable Did you ever managed to get this solved ?Pyrrho
@Pyrrho No, in the end I resorted to writing a script that would invoke the CLI, and gave up on that particular Cloudformation templateUndesirable
D
30

Your AWS::ECS::Service needs to register the full ARN for the TaskDefinition (Source: See the answer from ChrisB@AWS on the AWS forums). The key thing is to set your TaskDefinition with the full ARN, including revision. If you skip the revision (:123 in the example below), the latest revision is used, but CloudFormation still goes out to lunch with "CREATE_IN_PROGRESS" for about an hour before failing. Here's one way to do that:

"MyService": {
    "Type": "AWS::ECS::Service",
    "Properties": {
        "Cluster": { "Ref": "ECSClusterArn" },
        "DesiredCount": 1,
        "LoadBalancers": [
            {
                "ContainerName": "myContainer",
                "ContainerPort": "80",
                "LoadBalancerName": "MyELBName"
            }
        ],
        "Role": { "Ref": "EcsElbServiceRoleArn" },
        "TaskDefinition": {
            "Fn::Join": ["", ["arn:aws:ecs:", { "Ref": "AWS::Region" },
            ":", { "Ref": "AWS::AccountId" },
            ":task-definition/my-task-definition-name:123"]]}
        }
    }
}

Here's a nifty way to grab the latest revision of MyTaskDefinition via the aws cli and jq:

aws ecs list-task-definitions --family-prefix MyTaskDefinition | jq --raw-output .taskDefinitionArns[0][-1:]
Destiny answered 18/2, 2016 at 4:49 Comment(2)
my command to retrieve the latest revision: aws ecs list-task-definitions --family-prefix dev-device-settings --sort DESC | jq --raw-output .taskDefinitionArns[0] | tr ':' '\n' | tail -1Cantrell
A much simpler way would be to use the !Ref function to return the ARN of your AWS::ECS::TaskDefinition. Building the ARN like that is very overly complicated. Look at the return values on this page: docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/…Mandarin
A
27

I found another related scenario that will cause this and thought I'd put it here in case anyone else runs into it. If you define a TaskDefinition with an Image that doesn't actually exist in its ContainerDefinition and then you try to run that TaskDefinition as a Service, you'll run into the same hang issue (or at least something that looks like the same issue).

NOTE: The example YAML chunks below were all in the same CloudFormation template

So as an example, I created this Repository:

MyRepository:
    Type: AWS::ECR::Repository

And then I created this Cluster:

MyCluster:
    Type: AWS::ECS::Cluster

And this TaskDefinition (abridged):

MyECSTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
        # ...
        ContainerDefinitions:
            # ...
              Image: !Join ["", [!Ref "AWS::AccountId", ".dkr.ecr.", !Ref "AWS::Region", ".amazonaws.com/", !Ref MyRepository, ":1"]]
            # ...

With those defined, I went to create a Service like this:

MyECSServiceDefinition:
    Type: AWS::ECS::Service
    Properties:
        Cluster: !Ref MyCluster
        DesiredCount: 2
        PlacementStrategies:
            - Type: spread
              Field: attribute:ecs.availability-zone
        TaskDefinition: !Ref MyECSTaskDefinition

Which all seemed sensible to me, but it turns out there two issues with this as written/deployed that caused it to hang.

  1. The DesiredCount is set to 2 which means it will actually try to spin up the service and run it, not just define it. If I set DesiredCount to 0, this works just fine.
  2. The Image defined in MyECSTaskDefinition doesn't exist yet. I made the repository as part of this template, but I didn't actually push anything to it. So when the MyECSServiceDefinition tried to spin up the DesiredCount of 2 instances, it hung because the image wasn't actually available in the repository (because the repository literally just got created in the same template).

So, for now, the solution is to create the CloudFormation stack with a DesiredCount of 0 for the Service, upload the appropriate Image to the repository and then update the CloudFormation stack to scale up the service. Or alternately, have a separate template that sets up core infrastructure like the repository, upload builds to that and then have a separate template to run that sets up the Services themselves.

Hope that helps anyone having this issue!

Abstriction answered 16/6, 2017 at 18:8 Comment(5)
Also if the Task Definition doesn't have the appropriate ExecutionRole permissions, the service will hang in the CREATING state. I had this happen when I tried creating a LogConfiguration.Roundabout
Also happens if image tag doesn't exist in the repository, e.g. perhaps a typoPaphian
"Hope that helps anyone having this issue!" It indeed did! Thank you so much!Takeover
I have everything in one stack, set DesiredCount to 0 fixed ECS::Service CREATE_IN_PROGRESS take long time then build feil, thanks :)Mischief
An alternative if you just want to have one script that doesn't have to be updated is to take advantage of the long time CloudFormation hangs for (it is actually retrying and retrying to find the image when it hangs). This gives ample time to manually upload the image to ECR and then CloudFormation will find it pretty much as soon as it has been uploaded.Blow
B
14

No need to register the full ARN for the TaskDefinition, because when the logical ID of this resource is provided to the Ref intrinsic function, Ref returns the Amazon Resource Name (ARN).

In the following sample, the Ref function returns the ARN of the MyTaskDefinition task, such as arn:aws:ecs:us-west-2:123456789012:task/1abf0f6d-a411-4033-b8eb-a4eed3ad252a.

{ "Ref": "MyTaskDefinition" }

Source http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-ecs-taskdefinition.html

Billen answered 19/7, 2016 at 9:39 Comment(1)
works great ... as long as the task definition is in the same stack. Otherwise, the Fn::ImportValue is a nice way to do this across stacks. docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/…Clearstory
N
8

I think I had similar issue. Try looking at the "DesiredCount" property in the Service template. I think CloudFormation will indicate that the creation/update is still in progress until the Service reach that number of "DesiredCount" in your cluster.

Nagel answered 11/10, 2015 at 11:30 Comment(1)
The service is reporting as stabilised in the ECS UI, and both the desired count and the running count is set to 1. Hitting the container works as expected as well, and the ELB is reporting the instance correctly. It is like the notification just is not getting through to CloudformationUndesirable
M
6

Anything that prevents the ECS Service definition from reaching the Desired Count. One example is missing permissions in the policies attached to the role used by the instances. Check the instances ECS agent logs (/var/log/ecs/ecs-agent.log.timestamp).

Another example: Instances don't have enough memory available to match the requested Desired Count.... events will show something like this:

"...service myService was unable to place a task because no container instance met all of its requirements. The closest matching container-instance 123456789 has insufficient memory available..."

Machute answered 14/8, 2017 at 15:50 Comment(0)
T
4

To add another data point, I've seen AWS::ECS::Service get permanently stuck in CREATE_IN_PROGRESS if the ECR docker image is not both a) available from the ECR repo and b) pass the health check.

I've tried multiple times to boot an AWS::ECS::Service with a valid-image-hash-but-failing-health-check container, then fix the image and do the various "set desired count to zero", "set it back", etc., and nothing AFAICT gets it unstuck.

I eventually have to delete the stack, and start over with an image that immediately passes the health check. Then it works fine.

Super flakey.

Triangular answered 4/10, 2019 at 20:43 Comment(1)
This was it for my Django application - #37032249 was the underlying issueRodeo
R
1

I had the same problem. I solved by increasing my allocated memory size for the task definition.

The container(s) you're running must not exceed the available memory on your ECS instance.

Retired answered 27/6, 2018 at 2:11 Comment(0)
R
0

To add another possibility, I ran into this issue one time where everything was fine with the template, desired task count = # of running tasks, etc. It turned out that one of the underlying EC2 instances was stuck near 100% CPU state (but EC2 saw it as "healthy"). It was preventing CloudFormation from validating that particular instance. I killed the bad EC2 instance, and ECS spun up a truly healthy one.

Rosenstein answered 11/12, 2018 at 17:39 Comment(0)
C
0

For me the problem was that the tasks for the ECS service were stopped.

To see the failed tasks go to the "tasks" tab of the ECS service and change the value of "filter desired status" to "Any desired status"

Now you can click on the "Status" of the container and a popup will show you the exact issue

Chainsmoke answered 9/7 at 16:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.