I'm trying to run a 100 node AWS Batch job, when I set my computing environment to use only m4.xlarge
and m5.xlarge
instances everything works fine and my job is picked up and runs.
However, when I begin to include other instance types in my compute environment such as m5.2xlarge
, the job is stuck in the runnable
state indefinitely. The only variable I am changing in these updates is the instance types in the compute environment.
I'm not sure what is causing this job to not be picked up when I include other instance types in the computing environment. In the documentation for Compute Environment Parameters the only note is:
When you create a compute environment, the instance types that you select for the compute environment must share the same architecture. For example, you can't mix x86 and ARM instances in the same compute environment.
The JobDefinition
is multi-node:
- Node 0
- vCPUs: 1
- Memory: 15360 MiB
- Node 1:
- vCPUs: 2
- Memory: 15360 MiB
My computing environment max vCPUs is set to 10,000
, is always in a VALID
state and always ENABLED
. Also my EC2 vCPU limit is 6,000
. CloudWatch provides no logs because the job has not started, I'm not sure what else to try here. I am also not using the optimal
setting for instance types because I ran into issues with not getting enough instances.