AWS Elastic Load Balancing: Seeing extremely long initial connection time
Asked Answered
B

11

28

For a couple of days, we often see an extremely long initial connection time (15s - 1.3 minutes) to our ELBs when making any request via ssl. Oddly, I was only able to observe this in Google Chrome (not Safari nor Firefox nor curl).

It does not occur every single request, but around 50% of requests. It occurs with the first request (OPTIONS-call).

Our setup is the following: Cross-Zone ELB that connects to a node.js backend (currently in 2 AZs in eu-west-1). All instances are healthy and once the request comes through, it is processed normally. Currently, there is basically no load on the system. Cloudwatch for ELB does not report any backend connection errors, neither a SurgeQueue (value 0) nor a spillover count. The ELB metrics show a low latency (< 100 ms). We have Route53 configured to route to the ELB (we don't see any dns trouble, see attached screenshot).

We have different REST-APIs that all have this setup. It occurs to all of the ELBs (each of them is connecting to an indipendent node.js backend). All of these ELBs are set up the same way via our cloudformation template.

The ELBs also do our SSL-termination.

What could lead to such a behavior? Is it possible that the ELBs are not configured properly? And why could it only appear on Google Chrome?

request timing

Bing answered 20/2, 2016 at 12:43 Comment(4)
You should install wireshark on the machine with the browser and try to identify at what point in the tcp handshake the latency is appearing. This seems very unusual.Ramillies
@gboda good find, pity it has no answers, either. Maybe we have another one here somewhere that does.Ramillies
Weird, here's probably another one also unanswered. Strange Chrome + ELB interaction?Ramillies
I just created a same issue, but not for ELB - rather for ALB here. We found a solution, but interestingly enough, all the symptoms were exactly the same as in this question.Reference
S
40

I think it is a possible ELB misconfiguration. I had the same problem when I put private subnets to ELB. Fixed it by changing private subnets to public. See https://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/elb-manage-subnets.html

Scheers answered 25/2, 2016 at 17:45 Comment(1)
For public facing ELBs, select only public subnets. For private facing ELBs, select only private subnets.Palmy
O
13

Just to follow up on @Nikita Ogurtsov's excellent answer; I had the same problem except that it was just one of my subnets that happened to be private and the rest public.

Even if you think your subnets are public, I recommend you double check the route tables to ensure that they all have a Gateway.

You can use a single Route Table that has a Gateway for all your LB subnets if this make sense

VPC/Subnets/(select subnet)/Route Table/Edit

Oswaldooswalt answered 8/5, 2016 at 6:9 Comment(1)
In my case one of the subnet's ACL was configured to deny all traffic.Fickle
K
3

For me the issue was that I had an unused "Availability Zone" in my Classic Load Balancer. Once I removed the unhealthy and unused Availability Zone the consistent 20 or 21 second delay in "Initial Connection" dropped to under 50ms.

Note: You may need to give it time to update. I had my DNS TTL set to 60 seconds so I would see the fix within a minute of removing the unused Availability Zone.

Kith answered 20/4, 2019 at 20:28 Comment(0)
C
1

This can be a problem with the elb of amazon. The elb scale the number of instances with the number of request. You should see some pick of requests at those times. Amazon adds some instances in order to fit the load. the instances are reachable during the launch process so your clients get those timeout. it's totally randomness so you should :

  • ping the elb in order to get all the ip used

  • use mtr on all ip found

  • Keep an eye on CloudWatch

  • Find some clues

Cretaceous answered 4/4, 2016 at 8:53 Comment(0)
C
1

Solution If you're DNS is configured to hit directly on the ELB -> you should reduce the TTL of the association (IP,DNS). The IP can change at any time with the ELB so you can have serious damage on your traffic.

The client keep Some IP from the ELB in cache so you can have those can of trouble.

Scaling Elastic Load Balancers Once you create an elastic load balancer, you must configure it to accept incoming traffic and route requests to your EC2 instances. These configuration parameters are stored by the controller, and the controller ensures that all of the load balancers are operating with the correct configuration. The controller will also monitor the load balancers and manage the capacity that is used to handle the client requests. It increases capacity by utilizing either larger resources (resources with higher performance characteristics) or more individual resources. The Elastic Load Balancing service will update the Domain Name System (DNS) record of the load balancer when it scales so that the new resources have their respective IP addresses registered in DNS. The DNS record that is created includes a Time-to-Live (TTL) setting of 60 seconds, with the expectation that clients will re-lookup the DNS at least every 60 seconds. By default, Elastic Load Balancing will return multiple IP addresses when clients perform a DNS resolution, with the records being randomly ordered on each DNS resolution request. As the traffic profile changes, the controller service will scale the load balancers to handle more requests, scaling equally in all Availability Zones.

Best Practices ELB on AWS

Cretaceous answered 17/2, 2017 at 14:41 Comment(3)
You can't set up the TTL in Route53 if the entry is an ELB alias.Fickle
Yes, but I didn't talk about Route53. Of course, Amazon preconfigured the DNS for its own ELB otherwise you'll have the error previously presented.Cretaceous
I've recently solved this problem by setting TTL of the alias record to 60 seconds.Castleman
G
1

ALB Loadbalancer need 2 Availability Zones. If you use a Privat/Public/Nat VPC setting, then must all public subnets have a connection to the Internet.

Gerfen answered 13/2, 2021 at 15:47 Comment(0)
R
0

Check a security group too. That was an issue in my case.

Retroactive answered 10/4, 2019 at 21:6 Comment(2)
Can you elaborate?Smelly
@MathiasLykkegaardLorenzen A security group acts as a virtual firewall for your instance to control inbound and outbound traffic. So make sure it is configured properly. Unfortunately I don't remember details.Retroactive
S
0

For me the issue was that the ALB was pointing to an Nginx instance, which had a misconfigured DNS resolver. This meant that Nginx tried to use the resolver, timed out, and then actually started working a bit later.

Not really super connected with Load Balancer itself, but maybe helps someone figure out the issue in their own setup.

Stria answered 18/10, 2019 at 5:30 Comment(0)
M
0

I see a similar problem in my Chrome logs (1.3m lag). It happens in an OPTIONS request, and from wireshark, I don't even see the request leaving the PC in the first place. Any suggestions as to what Chrome might be doing are welcome. enter image description here

Monmouth answered 30/5, 2022 at 20:31 Comment(0)
G
0

Thanks for @Nikita Ogurtsov answer!

In my case, I am using Traefik, cert-manager, ExternalDNS and DNS01 challenge. I hope to make this load balancer only use private subnets.

service.beta.kubernetes.io/aws-load-balancer-internal: "true" itself is not enough in our case, we have 8 private subnets. And 4 of them are for Amazon RDS special config.

In my case, I need set service.beta.kubernetes.io/aws-load-balancer-subnets: subnet-xxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxxxx to select normal private subnets. Without this, it will random select 4 subnets from 8, which causes the page sometimes not load.

Here is my final Traefik custom values.yaml

service:
  enabled: true
  type: LoadBalancer
  annotations:
    # https://cloud-provider-aws.sigs.k8s.io/service_controller
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-subnets: subnet-xxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxxxx
Gnaw answered 1/7 at 5:51 Comment(0)
E
-1

We have recently encountered chrome taking 1.3 mins to load pages but the cause was slightly different. Just popping it here incase it helps someone.

1.3 mins seems to be how long Chrome will wait when trying to connect to a specific IP. Our domain name has multiple IP addresses in the A record (similar to a CNAME setup) and one of those IP's belonged to a server that had crashed. So sometimes the browser would connect quickly because it used a valid IP and sometimes we would get the long wait as it tried to connect to the invalid IP, timed out, and then retried with a valid IP.

So it is worth checking that all the IP's listed when you dig your domain, are resolving correctly.

Ellerd answered 24/8, 2022 at 12:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.