Why are AWS Batch Jobs stuck in RUNNABLE?
Asked Answered
A

10

36

I use a computing environment of 0-256 m3.medium on demand instances. My Job definition requires 1 CPU and 3 GB of Ram, which m3.medium has.

What are possible reasons why AWS Batch Jobs are stuck in state RUNNABLE?

AWS says:

A job that resides in the queue, has no outstanding dependencies, and is therefore ready to be scheduled to a host. Jobs in this state are started as soon as sufficient resources are available in one of the compute environments that are mapped to the job’s queue. However, jobs can remain in this state indefinitely when sufficient resources are unavailable.

but that does not answer my question

Aeneus answered 8/1, 2018 at 13:29 Comment(0)
S
43

There are other reasons why a Job can get stuck in RUNNABLE:

  • Insufficient permissions for the role associated to the Computed Environment
  • No internet access from the Compute Environment instance. You will need to associate a NAT or Internet Gateway to the Compute Environment subnet.
    • Make sure to check the "Enable auto-assign public IPv4 address" setting on your Compute Environment's subnet. (Pointed out by @thisisbrians in the comments)
  • Problems with your image. You need to use an ECS optimized AMI or make sure you have the ECS container agent working. More info at aws docs
  • You're trying to launch instances for which you account is limited to 0 instances (EC2 console > limits, in the left menu). (Read more on gergely-danyi comment)
  • And as mentioned insufficient resources

Also, make sure to read the AWS Batch troubleshooting

Solley answered 9/2, 2018 at 12:37 Comment(6)
In my particular case, I had to check the "Enable auto-assign public IPv4 address" setting on my Compute Environment's subnet to get my jobs to run.Applicable
For me Batch tried to launch an instance that was limited to 0 instances in EC2 limits settings. Check this: forums.aws.amazon.com/thread.jspa?threadID=263152 I had c5 instance type specified in my compute resources and even though c5.large had a limit of 5 Batch decided to launch a bigger type with limit of 0 (instead of spinning up multiple c5.large ones). I narrowed down my compute resource to c5.large which resolved the issue. Alternatively you can request limit adjustment.Persistence
On the 'Problems with your image', this does not refer to the docker image does it? If I select managed batch instances, it will automatically spin up aws linux AMIs, and then run my docker image defined in a job definition on the aws auto generated ECS? Do I need to specify / have running any ECS or EC2 when I select managed batch option? Thus it will be ok if my docker runs from openjdk:8-jre-slim ?Aksel
In my aws I can see the the new ec2 instance is created, however, the batch job is still stuck in runnable, the instance is then just left running, however, no jobs get executed.Aksel
If you are using a Docker image, make sure you include the tag (for me it was ":latest" suffix) in the image reference. As soon as I fixed that, AWS Batch was able to detect that my ComputeEnvironment is invalid -- I had used EcsInstanceRole as InstanceRole, but you should be using EcsInstanceProfile (which references EcsInstanceRole) instead. After fixing these two, it was no more than 5 minutes and the jobs kicked off.Cainozoic
In my particular case, I was requesting more CPUs (due to a mistake, 1024 in this case :D ) than I had available.Minimalist
A
14

The roles should be defined using, at least, the next policies and trusted relationships. If not, they will get stuck in RUNNABLE as they don't have the enough privileges to start:

 AWSBatchServiceRole

  • Attached policies: AWSBatchServiceRole
  • Trusted relationship: batch.amazonaws.com

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
             "Service": "batch.amazonaws.com"
           },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    

ecsInstanceRole

  • Attached policies: AmazonEC2ContainerServiceforEC2Role
  • Trusted relationship: ec2.amazonaws.com

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
             "Service": "ec2.amazonaws.com"
           },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    
Aga answered 5/7, 2018 at 13:1 Comment(2)
Are these two rules maintained by AWS?Littell
In my case, I needed a few additional policies: AmazonECS_FullAccess, AmazonEC2ContainerRegistryReadOnly, AmazonEKSFargatePodExecutionRolePolicy (all AWS managed). I also needed an additional Allow trust relationship for "Service": "ecs-tasks.amazonaws.com", "Action": "sts:AssumeRole"Cheeky
G
4

I just fought with this for a while, and found the answer.

One possible reason jobs can get stuck in Runnable is because there are no instances to run the job on. If this is the case, looking at the auto scaling group as mentioned in the above answer can show you the actual error that's preventing instances from being started, guiding you to the exact problem rather than leaving you to try any number solutions to problems you don't have. Error messages are our friends.

Galloot answered 12/9, 2018 at 19:8 Comment(0)
M
3

In case it is useful, wanted to share this really helpful vid from AWS Cloud Support Engineer:

https://aws.amazon.com/premiumsupport/knowledge-center/batch-job-stuck-runnable-status/

Mercurialize answered 27/3, 2020 at 15:29 Comment(0)
T
2

Your compute environment might be invalid. Check AWS Batch -> Compute Environments -> Status column. Mine said invalid, and this symbol was next to the compute environment name:

enter image description here

Clicking on the compute environment gave me more information - my AMI ID was wrong.

Trackless answered 8/1, 2019 at 19:33 Comment(0)
D
0

I fumble with this one last night, pulling the last hair on my head when I realized something. I check everything everyone mention above with no success. In a final try, I decided to create a new compute environment just in case (I used Cloud Formation template to create my stuff but more on this later) and BOOM, my new submitted job run right away! So I ran a "aws batch describe-compute-environments" to compare it with my CF created one. The only did that was different was the number of subnet associated with my VPC. In the one a create with the console a used the default selection (3 subnets) but with my CF template, I was lazy so I enter just one! To confirm this, I modified my original compute env. to add the 2 other subnets and guest what? My news jobs run right away also. BUT, the things is, I use to CF template in many others accounts/projects and I never had this issue. But suddenly, with this particular setup (CA-CENTRAL-1 region?!), one subnet will not make it ?!?!?! And this VPC contains no EC2 running machine so it's impossible that this subnet runout of IP or somethings else. So, I when back to my template and added the missing subnets to not get bite in the ashtray again! Hope this help someone and his hairs.

Detoxify answered 23/1, 2023 at 2:46 Comment(0)
B
0

We follow the workshop as https://github.com/aws-samples/aws-genomics-nextflow-workshop

And follow below 2 cloud formation:

https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=Nextflow&templateURL=https://s3.amazonaws.com/pwyming-demo-templates/nextflow-workshop/cloud9.cfn.yaml

https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=Nextflow&templateURL=https://s3.amazonaws.com/pwyming-demo-templates/nextflow-workshop/nextflow/nextflow-aio.template.yaml

Followed below steps:

Access IAM User Role with admin privilege / Access root user

Create “VpcStack” template not available error cleared following the second cloud formation template. Successfully created all resources from both cloud formation.

Covered all the points with expected results on Cloud9 Environment Setup: docs/modules/module-0__cloud9-environment.md from GitHub source code.

Verified all the resources created as AWS Resources: docs/modules/module-2__aws-resources.md from GitHub source code. Automatically created all the sources following the second cloud formation template.

Run bash command as “nextflow run hello” from Running Nextflow: docs/modules/module-1__running-nextflow.md from GitHub source code. AWS Batch Jobs started as “Runnable” in AWS Batch Dashboard on the hour. Basically, the AWS Batch job is stuck in RUNNABLE status. (Creating ASG and EC2 instances)

The expected result to complete the workshop followed from GitHub source code and process the .fastq files and get a resuls.

Benedict answered 24/3, 2023 at 13:10 Comment(1)
This is a very old and out of date workshop, and is in the process of being retired. This workshop using the Amazon Genomics CLI may be more useful for you: catalog.workshops.aws/agc-pipelines/en-USQuaver
C
0

In my case, I was using an ECS-optimizied image that didn't have GPU support. To find the recommended GPU-enabled AMI, I had to run the following (source):

aws ssm get-parameter --name /aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended --region ap-southeast-2 --output json

...and use the image_id to setup my Launch Template AMI. Note that you want to replace the region with your region.

Cheeky answered 3/4, 2023 at 19:47 Comment(0)
B
0

I tried all these suggestions, but in my case there was something wrong with a role definition for the instance role in the compute environment. I tried creating a new compute environment with the option "create a new role" for this Instance Role question and it worked. To be honest I did not discover the reason of this issue, because policies and trust relationships are the same with the role created by the form.

Bukovina answered 9/6, 2023 at 20:41 Comment(0)
K
0

In my case assumed role did not have permission on ecs. Adding required ecs permissions resolved the issue and all jobs went from Runnable to Running.

Korry answered 16/4 at 18:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.