How to increase network bandwidth of AWS EC2 instance?

Asked 31/3, 2015 at 14:37 Answered 25/10, 2016 at 21:20

amazon-web-services amazon-ec2 bandwidth

We hosted a site in AWS EC2 of type c4.8xlarge. It is a fairly large system with lot of memory and compute resources. Thousands of users tried to access the system during a 2 hour timeframe this weekend. While it did not crash, it slowed down quite a bit and failed to perform at the expected level. Analyzing the stats showed that limited network bandwidth is the main cause of the slowdown. The CPU usage stayed below 6%, but NetworkIn and NetworkOut seem to have peaked at 60MB and 200MB respectively during that timeframe. While I'm not an networking expect, some reading online seemed to indicate that all the traffic going through one NIC could be the main cause of limited network bandwidth. Is this true? Would hosting the site on a different type of EC2 instance help increase the network bandwidth? Here is how the networkIn and networkOut metrics looked like under heavy load.

networkIn and networkOut metrics chart

Estival answered 31/3, 2015 at 14:37 Comment(6)

Why just one instance? Can you scale horizontally? – Caltrop 31/3, 2015 at 14:39

I could and may be I should. I understand the risks associated with single instance, but the application has little business value and those are acceptable risks. It's once a year thing. Scaling horizontally to meet the CPU or memory or storage limitations is understandable, but having to do that just to achieve higher bandwidth seems like a bummer. 200MB NetworkIn and 60MB NetworkOut seems too low though, may be I'm wrong. And I'm not even sure if it per second. AWS CloudWatch doesn't specify that clearly. – Estival 31/3, 2015 at 17:16

While your instance does have a 10 Gbit network interface, Its unclear it should be able to achive that performance from ec2 to the internet or if the performance is limited to inter instance communication. The throughout you are getting is around 1.8 Gbps with overhead. Have you enabled enhanced networking? docs.aws.amazon.com/AWSEC2/latest/UserGuide/… – Logan 31/3, 2015 at 23:11

Apparently AWS measures bandwidth in 60 sec interval by default. So in common terms what I really got from ec2 instance at it's peak usage is 1MB/sec NetworkOut and 3.3MB/sec NetworkIn. Wow! that's unbelievably low. Still not sure how to fix it though. forums.aws.amazon.com/message.jspa?messageID=389391 – Estival 2/4, 2015 at 11:36

@MikeBrant How would scaling horizontally help if you still have to go through a load balancer with similar or even lower bandwidth limitations? – Clean 20/12, 2015 at 10:36

@MikeBrant While scale-out for failover is a good idea, something is very wrong if a server with 36 cores, 60 GB of RAM, and a 10 GBit interface can't handle thousands of users without breaking a sweat. A reasonably configured c4.large should handle a few thousand requests per second easily, and it is 1/16 as powerful (I have seen them do this, and also less-potent VMs). – Centiliter 1/4, 2016 at 23:41

If you were limited by bandwidth, that graph would become flat when you hit the limit. Further, as others pointed out that is only 1 MB/s out and 3 MB/s in, and I can do more than that on a t2.micro to the external internet.

What is the system doing with each request? Here are a list of things I would look at, in order:

Threading: are there bottlenecks in your application where only one thread can access a resource? That would keep CPU use low but cause exactly the pattern you saw.
Bad concurrency patterns in your application or server. Load test and look for it getting slower and slower as connections increase, while doing nothing.
Individual CPU: is one CPU loaded to 100% while others are mostly idle? (with 30+ cores, a saturated CPU would only give you 3% CPU use). One saturated CPU + others idle usually means a concurrency issue, probably in connection handling.
What is memory use like? Are you using swap at all? (that is a VERY bad sign if so, and would cause the issue). If memory use is excessive, often session storage in memory or excessively sized handler thread pools are at fault.
Disk I/O or external network requests: are you reading or writing with each request? vmstat will tell if you are spending a long time waiting for I/O to be serviced. If that is the case, I'd look at logging before anything.
- The c4.8xlarge instances use EBS only, if the storage is magnetic and you write to access logs, you get a few hundred writes per second. General purpose SSDs give you 3 IO/s per GB base, but can burst to 3000 until they run out of IO credits.
- The OS will try to combine writes, but with thousands of concurrent

It's not impossible but very unlikely that you might be bottlenecked at the network layer with connection creation or packets-per-second, if your requests are very small.

Centiliter answered 1/4, 2016 at 23:7 Comment(0)

Yes Amazon has a concept of ENI - Elastic Network Interface. While you can additional NIC to the instance; it is still a logical interface. The provisioning and availability of the network pipe highly depends on (well purely depends on) the type instance you choose. Amazon has several types / family of instance like R, I, C, D, G - optimized at Memory, IO, Compute, Dense Storage, GPU respectively. You can see if you can squeeze max. out of them.

Irrespective of what ever you choose as instance type, you would essentially hit a threshold and wouldn't be able to scale beyond a certain point. Scalability particularly unique against other scalability factors like Memory / CPU.

Modify your architecture and rather than having very big / bigger instance have several of the medium or large instances behind and ELB.

Yerga answered 31/3, 2015 at 15:4 Comment(3)

Thanks. Any other thoughts based on my comments above? – Estival 31/3, 2015 at 18:14

How would having several instances help if you still have to go through a load balancer with similar or even lower bandwidth limitations? (Assuming you would still use an ec2 instance as your load balancer with something like haproxy installed). – Clean 20/12, 2015 at 10:41

While it's not hip, scale-up is a viable solution. This entire site and all of Stack Exchange runs on just 25 servers. They've stated that they can actually run with just a single web server, and their servers have specs very similar to the c4.8xlarge (but with better storage). I seriously doubt that they have hit a vertical scaling limit here, it's probably a configuration or code problem, not a hardware limitation. – Centiliter 1/4, 2016 at 23:22

Your NetworkIn and Out is actually >50mb/s. If your CPU and Memory stayed within reasonable ranges then your instance is fine. You should also check the connection log on your database (assuming you running an RDB with your system) the slow down actually could be caused from the slow response on your database that makes the web server respond slower.

Also, you should run your system with a AWS Loadbalancer and setup and autoscaler with a trigger on the network in/out. That way a secondary instance gets launched to assist with the temporary increase in load on the network. If the root cause is indeed the increase in connection on your database, then the load balancer will not help with the problem. Instead you want to improve your cache setup so there is less burden on the database per user/connection to your website.

Annalisaannalise answered 25/10, 2016 at 21:20 Comment(0)

Recommended topics

Hot tags