We hosted a site in AWS EC2 of type c4.8xlarge. It is a fairly large system with lot of memory and compute resources. Thousands of users tried to access the system during a 2 hour timeframe this weekend. While it did not crash, it slowed down quite a bit and failed to perform at the expected level. Analyzing the stats showed that limited network bandwidth is the main cause of the slowdown. The CPU usage stayed below 6%, but NetworkIn and NetworkOut seem to have peaked at 60MB and 200MB respectively during that timeframe. While I'm not an networking expect, some reading online seemed to indicate that all the traffic going through one NIC could be the main cause of limited network bandwidth. Is this true? Would hosting the site on a different type of EC2 instance help increase the network bandwidth? Here is how the networkIn and networkOut metrics looked like under heavy load.
If you were limited by bandwidth, that graph would become flat when you hit the limit. Further, as others pointed out that is only 1 MB/s out and 3 MB/s in, and I can do more than that on a t2.micro to the external internet.
What is the system doing with each request? Here are a list of things I would look at, in order:
- Threading: are there bottlenecks in your application where only one thread can access a resource? That would keep CPU use low but cause exactly the pattern you saw.
- Bad concurrency patterns in your application or server. Load test and look for it getting slower and slower as connections increase, while doing nothing.
- Individual CPU: is one CPU loaded to 100% while others are mostly idle? (with 30+ cores, a saturated CPU would only give you 3% CPU use). One saturated CPU + others idle usually means a concurrency issue, probably in connection handling.
- What is memory use like? Are you using swap at all? (that is a VERY bad sign if so, and would cause the issue). If memory use is excessive, often session storage in memory or excessively sized handler thread pools are at fault.
- Disk I/O or external network requests: are you reading or writing with each request? vmstat will tell if you are spending a long time waiting for I/O to be serviced. If that is the case, I'd look at logging before anything.
- The c4.8xlarge instances use EBS only, if the storage is magnetic and you write to access logs, you get a few hundred writes per second. General purpose SSDs give you 3 IO/s per GB base, but can burst to 3000 until they run out of IO credits.
- The OS will try to combine writes, but with thousands of concurrent
It's not impossible but very unlikely that you might be bottlenecked at the network layer with connection creation or packets-per-second, if your requests are very small.
Yes Amazon has a concept of ENI - Elastic Network Interface. While you can additional NIC to the instance; it is still a logical interface. The provisioning and availability of the network pipe highly depends on (well purely depends on) the type instance you choose. Amazon has several types / family of instance like R, I, C, D, G - optimized at Memory, IO, Compute, Dense Storage, GPU respectively. You can see if you can squeeze max. out of them.
Irrespective of what ever you choose as instance type, you would essentially hit a threshold and wouldn't be able to scale beyond a certain point. Scalability particularly unique against other scalability factors like Memory / CPU.
Modify your architecture and rather than having very big / bigger instance have several of the medium or large instances behind and ELB.
Your NetworkIn and Out is actually >50mb/s. If your CPU and Memory stayed within reasonable ranges then your instance is fine. You should also check the connection log on your database (assuming you running an RDB with your system) the slow down actually could be caused from the slow response on your database that makes the web server respond slower.
Also, you should run your system with a AWS Loadbalancer and setup and autoscaler with a trigger on the network in/out. That way a secondary instance gets launched to assist with the temporary increase in load on the network. If the root cause is indeed the increase in connection on your database, then the load balancer will not help with the problem. Instead you want to improve your cache setup so there is less burden on the database per user/connection to your website.
© 2022 - 2024 — McMap. All rights reserved.