nginx reverse proxy not detecting dropped load balancer

Asked 11/11, 2019 at 5:29 Answered 25/11, 2019 at 14:36

We have the following config for our reverse proxy:

location ~ ^/stuff/([^/]*)/stuff(.*)$ {
    set $sometoken $1;
    set $some_detokener "foo";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Authorization "Basic $do_token_decoding";
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_redirect https://place/ https://place_with_token/$1/;
    proxy_redirect http://place/ http://place_with_token/$1/;
    resolver 10.0.0.2 valid=10s;
    set $backend https://real_storage$2;
    proxy_pass $backend;
}

Now, all of this works .... until the real_storage rotates a server. For example, say real_storage comes from foo.com. This is a load balancer which directs to two servers: 1.1.1.1 and 1.1.1.2. Now, 1.1.1.1 is removed and replaced with 1.1.1.3. However, nginx continues to try 1.1.1.1, resulting in:

epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while connecting to upstream, client: ..., server: ..., request: "GET ... HTTP/1.1", upstream: "https://1.1.1.1:443/...", host: "..."

Note that the upstream is the old server, shown by a previous log:

[debug] 1888#1888: *570837 connect to 1.1.1.1:443, fd:60 #570841

Is this something misconfigured on our side or the host for our real_storage?

*The best I could find that sounds even close to my issue is https://mailman.nginx.org/pipermail/nginx/2013-March/038119.html ...

Further Details

We added proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504; and it still failed. I am now beginning to suspect that since it is two ELBs (ours and theirs) then the resolver we are using is the problem - since it is amazon specific (per https://serverfault.com/a/929517/443939)...and amazon still sees it as valid, but it won't resolve externally (our server trying to hit theirs..)

I have removed the resolver altogether from one configuration and will see where that goes. We have not been able to reproduce this using internal servers, so we must rely on waiting for the third party servers to cycle (about once per week).

I'm a bit uncertain about this resolver being the issue only because a restart of nginx will solve the problem and get the latest IP pair :/

Is it possible that I have to set the dns variable without the https?:

    set $backend real_storage$2;
    proxy_pass https://$backend;

I know that you have to use a variable or else the re-resolve won't happen, but maybe it is very specific which part of the variable - as I have only ever seen it set up as above in my queries....but no reason was ever given...I'll set that up on a 2nd server and see what happens...

And for my 3rd server I am trying this comment and moving the set outside of location. Of course if anybody else has a concrete idea then I'm open to changing my testing for this go round :D

set $rootbackend https://real_storage;
location ~ ^/stuff/([^/]*)/stuff(.*)$ {
    set $backend $rootbackend$2;
    proxy_pass $backend;
}

Note that I have to set it inside because it uses a dynamic variable, though.

Incongruent answered 11/11, 2019 at 5:29 Comment(6)

Please show upstream configuration – Putative 11/11, 2019 at 10:54

@AlexC It is a third party system, so it is outside of our control. – Incongruent 12/11, 2019 at 6:56

how did you remove 1.1 and replace it to 1.3, any health check servers in proxy_pass ? – Nowt 19/11, 2019 at 4:33

It is being removed by Amazon as far as I can tell - https://mcmap.net/q/364783/-does-amazon-ec2-elastic-load-balancer-39-s-ip-ever-change I am unsure about the health check servers. The health of the domain is fine...but the underlying IP is what times out @ThanhNguyenVan – Incongruent 19/11, 2019 at 4:37

are you using AWS ELB or just use nginx for load balancing ? not sure the link you gave to me . – Nowt 19/11, 2019 at 4:43

We are using ELB to route to 3 EC2 servers which are using nginx to redirect traffic...and in this case redirect to a third party, which as far as I can tell is using an ELB @ThanhNguyenVan – Incongruent 19/11, 2019 at 4:45

As it was correctly noted by @cnst, using a variable in proxy_pass makes nginx resolve address of real_storage for every request, but there are further details:

Before version 1.1.9 nginx used to cache DNS answers for 5 minutes.

After version 1.1.9 nginx caches DNS answers for a duration equal to their TTL, and the default TTL of Amazon ELB is 60 seconds.

So it is pretty legal that after rotation nginx keeps using old address for some time. As per documentation, the expiration time of DNS cache can be overridden:

resolver 127.0.0.1 [::1]:5353 valid=10s;

resolver 127.0.0.1 ipv6=off valid=10s;

Individually answered 25/11, 2019 at 14:36 Comment(0)

There's nothing special about using variables within http://nginx.org/r/proxy_pass — any variable use will make nginx involve the resolver on each request (if not found in a server group — perhaps you have a clash?), you can even get rid of $backend if you're already using $2 in there.

As to interpreting the error message — you have to figure out whether this happens because the existing connections get dropped, or whether it's because nginx is still trying to connect to the old addresses.

You might also want to look into lowering the _time values within http://nginx.org/en/docs/http/ngx_http_proxy_module.html; they all appear to be set at 60s, which may be too long for your use-case:

I'm not surprised that you're not able to reproduce this issue, because there doesn't seem to be anything wrong with your existing configuration; perhaps the problem manifested itself in an earlier revision?

Fathead answered 22/11, 2019 at 21:30 Comment(0)

Further Details

Recommended topics

Hot tags