Best practices to manage docker containers in AWS ECS service
Asked Answered
A

2

5

Tech Stack

Python (Monolith API) - Flask Framework PostgreSQL

We have deployed docker container as follows

  • Docker image is stored in ECR
  • Docker container is deployed in ECS
  • In total 25 docker container is deployed in 3 R5 large EC2 instances (2 vCPU, 16 GB)
  • 1024/3072 Minimum & Maximum memory is allocated to each container, so each EC2 instance hold 15 containers

We are facing downtime now a days with an issue of OOM (Out of Memory) and then containers within given EC2 instance start moving to another EC2 instance, This happens when due to some reason 3rd EC2 instance is not available, so until 2nd instance is up and running, we are facing downtime for given set of containers.

So want to check if the strategy that we are using is correct one?

We are also now planning to have small EC2 instances holding lesser number of containers, so if issue happens then at least small numbers sites are down instead of all 15 sites are down, are we going in right direction ?

Should we move to Fargate ? What will be the cost implication compared to using ECS ?

It will be great if somebody help me out to get the perfect solution for this kind of issue.

In near future, we will have containers in 100s & may reach to 500s, so we have to decide on best strategy for deployment, failover, high availability.

Alphitomancy answered 3/2, 2021 at 6:57 Comment(3)
Fargate is usually more expensive than your own EC2 cluster, you can run the calculations yourself. The rest of this question is a bit too vague to be able to say much about really.Inconveniency
Thanks Deceze for quick reply, ok about Fargate. What more information is required to get proper answer?Alphitomancy
Having smaller instances and slightly over-provisioning your autoscaling group is probably a good idea (I.e. you should scale out before you hit the wall, not when it's already too late). It sound to me though that you should look into your memory limits first - those numbers don't add up, if you assume that each container instance can actually use the max allocated memory (i.e 3gb).Pleasurable
S
8

If you're getting OOM errors it means that your EC2 machines are being over provisioned- you're running too many containers on them. At 15 containers per instance and a maximum of 3072 for each container you're talking about 46GB of possible memory usage on machines that cap out at 16GB. Once enough containers use the memory they're allocated your machines are going to fall over, taking out all the other tasks with it.

So the first thing you can do right now is lower the number of tasks per machine or lower the max memory so the tasks have less memory to use overall. Since you only have 2 CPUs on each machine I would suggest you to tune it so each machine runs two tasks total with memory split between them, making sure other settings (max connections, workers, etc) are raised accordingly.

You also asked about Fargate. My company uses both EC2 and Fargate for our containers, and we have a policy that if there isn't a specific reason to run things on EC2 (such as needing GPUs) we put it on Fargate. While it is a bit more money (not as much with Compute Savings Plans, but still more) the benefits are really nice. It means each task is run separately, reducing the chances of one task taking out a bunch of others. It also means a faster scale up period because we don't have to wait for the EC2 instances to scale up and join the cluster- which is really important if you're using app autoscaling to respond to a sudden influx of traffic.

The biggest benefit to Fargate is decreased complexity, which in turn means our team has less to worry about- the time and stress savings on the devs can be far more valuable then the extra money spent. The simple fact that we never have to worry about things like upgrading the ECS Agent, integrating with Patch Manager for security updates, and that we don't need to cycle machines regularly to replace them with new builds means we can spend time on other parts of our infrastructure instead.

As I mentioned above though there are cases where Fargate isn't appropriate. For us the biggest use case for running on EC2 instances being able to select the GPU types we use for running ML. For this we built our own AWS Machine Image that works with the various GPU instances AWS offers. This is basically the only place where I'm not using Fargate, as those models need the EC2 instance GPUs.

Syndactyl answered 8/2, 2021 at 6:46 Comment(0)
O
1

joisar,

We also faced the same thing. So here I can give some info on how I see it.

After reading your specs I can draw some numbers. As you mentioned you are using 3 EC2 of type R5 Large (2 CPU, 16 Memory). This means you have,

Total CPU = 6 GB units and Total Memory = 48 GB Memory

Max Memory specified in your configuration = 3072. Then you have mentioned 25 Container which is deployed over these 3 Instances. [ Not sure how, unless some of the containers have less memory]

First of all, in a single EC2 you can not have more than 5 Containers with these specs. Find Calculations as below:

16 GB = 1024*16 = 16384.

16384/3072 = 5.3 [means 5 Container at most in Single EC2]

But remember you are launching containers in ECS's EC2, EC2 requires its own free space and memory in the system for its operations. But you are NOT are giving much free memory to EC2 as you allocated all the memory to your containers. [I am assuming the worst case when all 5 containers utilizing 3072 MB Memory.] There you are out of memory. You have to decide the max memory number in such a way that EC2 has some free memory for its own operation.

The advantages of reducing max memory are:

  1. There is more space for EC2
  2. You can go to 2 Task Definitions with reduced size for each service in ECS. In this way, you achieved High Availability.

Try to analyze which container uses more memory, allocate more to that and to others, specify less. You have to balance the number of container's memory. That can also be the pain point for many and here comes the Fargate which can save us.

And you also mention that you are planning to change the EC2 Size. Go for Memory Optimised Instances. And yes Fargate can be best, but it comes with a great cost.

Then for High Scalability, define Autoscaling Polices. Also, policies should be in such a way that in Nights we usually have less traffic so you can reduce the number of EC2 Machines in the Cluster. With these, you will save the cost and with the saved cost; you can spend it during Peak Hours on more Availability of EC2 Machines.

In the end, you have to come up with your Numbers and monitor them and yes it's not a one-day process. It is an evolving process.

Ovariectomy answered 12/2, 2021 at 15:35 Comment(1)
Thanks Bhavuk for detailed reply. If you look at my question, when i say 3072, its about max one container can utilize memory from same EC2 instance, So min is 1024 and that's why one EC2 can contain 15 containersAlphitomancy

© 2022 - 2025 — McMap. All rights reserved.