Lambda throttling below concurrency limit

Asked 21/6, 2018 at 4:5 Answered 26/10, 2018 at 22:42

Solved amazon-web-services aws-lambda aws-api-gateway throttling

We use Lambda to power APIs (via API Gateway) accessed via news media websites, receiving a fluctuating but high load of traffic. We began experiencing throttles, so we raised our concurrency limit to 2000. However, we still experience throttles multiple times per day.

Oddly in CloudWatch metrics, the concurrent requests peak at around 600 or lower when we're throttled. See this CloudWatch chart as an example:

Has anyone experienced this before? Why do you think this is happening? What can we do about it?

More Information

This chart is across all Lambdas for our entire region.
When throttling occurs, it happens across all Lambda instances.
We primarily trigger Lambdas via API Gateway, but there's a few that are triggered via SNS (fairly high rate of data).
We have CloudFront in front of all APIs, and with some of them we have a 5 second cache time (for the super frequently requested APIs - saves us $$$)

Additionally, here's an image that also shows total invocation count and average duration over the same time period. It's hard to know what's causal (duration up because of throttling, or vice versa, because some of the lambdas do call other lambdas). Please see the appropriate axis because the scales are quite different.

Swaney answered 21/6, 2018 at 4:5 Comment(5)

Details about each of the metrics can be found here: docs.aws.amazon.com/lambda/latest/dg/… – Swaney 21/6, 2018 at 4:15

Is retry option enabled for your Lambda function? – Coussoule 21/6, 2018 at 7:6

I don't think API Gateway does retry invoking the Lambdas, it just returns an error code to the client. – Swaney 21/6, 2018 at 7:43

One thought I had is that maybe it's just a CloudWatch visualisation issue? If the concurrent count spike to 2000 for a few seconds it will throttle, but perhaps it's not sustained for long enough to be reported. – Swaney 25/6, 2018 at 5:56

We still don't have a solution for this, but our next line of investigation is that maybe CloudWatch is misleading us about the peak Lambda concurrency. Trying to get an answer out of AWS support on that one. – Swaney 27/6, 2018 at 3:7

I think this has to do with Lambda concurrency burst limits.

Basically, there's a limit on how many instances of your Lambda function you can run concurrently under sudden load and this limit is different to the overall per-region Lambda concurrency limit.

You can find more information about it here:

https://docs.aws.amazon.com/lambda/latest/dg/scaling.html

The relevant part:

AWS Lambda dynamically scales function execution in response to increased traffic, up to your concurrency limit. Under sustained load, your function's concurrency bursts to an initial level between 500 and 3000 concurrent executions that varies per region. After the initial burst, the function's capacity increases by an additional 500 concurrent executions each minute until either the load is accommodated, or the total concurrency of all functions in the region hits the limit.

Pneumococcus answered 26/10, 2018 at 22:42 Comment(0)

This seems very familiar. We had the exact same issue and we were baffled because our concurrency limit had been increased but unfortunately that's not the magic fix for infinite scalability of serverless apps.

My guess is that you're running out of ENI's (Elastic Network Interfaces) as each lambda function requires one before it's initialized. The default limit for this is 350 concurrently attached ENI's.

Your 600 concurrent lambas are grouped per minute so I imagine a couple of them overlap on a minute, hence more than 350.

To investigate this, go into the global settings for your API Gateway and provide it with an IAM role arn that has access to putlogs to CloudWatch. Then go into the individual API Gateway api and enable verbose logging.

Any errors that occur when API Gateway is trying to invoke a lambda function should show up here rather than be muffled (by default).

If the error looks somewhat like :

{
    "Message": "Lambda was not able to create an ENI in the VPC of the Lambda function because the limit for Network Interfaces has been reached.",
    "Type": "User"
}

If that's the case you'll need to request a limit increase on ENI's.

Photoemission answered 21/6, 2018 at 13:28 Comment(3)

Thanks @Tom, we'll investigate this week and let you know if this is our issue! – Swaney 24/6, 2018 at 23:30

We turned on all the logging, and unfortunately it wasn't about ENIs for us. The message in the logs contained "ConcurrentInvocationLimitExceeded", so it's definitely Lambda throttling. – Swaney 27/6, 2018 at 3:6

Glad to hear you got a step closer to resolving it, hopefully another limit increase resolves your issue – Photoemission 27/6, 2018 at 8:54