Tech Stack
Python (Monolith API) - Flask Framework PostgreSQL
We have deployed docker container as follows
- Docker image is stored in ECR
- Docker container is deployed in ECS
- In total 25 docker container is deployed in 3 R5 large EC2 instances (2 vCPU, 16 GB)
- 1024/3072 Minimum & Maximum memory is allocated to each container, so each EC2 instance hold 15 containers
We are facing downtime now a days with an issue of OOM (Out of Memory) and then containers within given EC2 instance start moving to another EC2 instance, This happens when due to some reason 3rd EC2 instance is not available, so until 2nd instance is up and running, we are facing downtime for given set of containers.
So want to check if the strategy that we are using is correct one?
We are also now planning to have small EC2 instances holding lesser number of containers, so if issue happens then at least small numbers sites are down instead of all 15 sites are down, are we going in right direction ?
Should we move to Fargate ? What will be the cost implication compared to using ECS ?
It will be great if somebody help me out to get the perfect solution for this kind of issue.
In near future, we will have containers in 100s & may reach to 500s, so we have to decide on best strategy for deployment, failover, high availability.