Azure Application Gateway hitting 504 Gateway timeout randomly while doing JMeter load test
Asked Answered
T

3

9

I have 1 application gateway which having 2 backends (Azure VM) which is hosting ASP CORE REST API with IIS. And both is using port 80 to communicate.

Everything is working fine with manual test until when we use jmeter to do 2500 Threads POST request load test, some of the request get "504 gateway timeout" as response.

I tried to run the exactly same load test towards the backend straight and didnt received any bad response.

Am i misconfigured something on my application gateway?

Configurations Configurations

HTTP Settings HTTP Settings

Probes Probes

Tupiguarani answered 3/7, 2021 at 3:46 Comment(1)
Do you have premium sku resources with fully isolation?Ciliata
P
5

By default, the Azure Application Gateway returns a 504 Error when the time of the request exceeds 20 seconds. The explanation of this random 504 errors is, in my view, the too high overload of the system. Possible solutions are to increase that time, or increase the performance or the backend, or perform a smaller amount of requests in parallel.

Polytheism answered 9/8, 2022 at 7:39 Comment(1)
Increasing the time worked for meErgot
C
1

I believe I found a bug in the Azure Web Gateways.

There are 2 ways I see to get the 504 error.

  1. Your web page request takes longer than the configured request timeout on the back end.
  2. What I believe to be a bug where the web gateway is reusing windows sockets too quickly.

On #2 you can run the following in against the web gateway's monitoring logs:

AzureDiagnostics | where timeTaken_d > 55 and httpStatus_d in (504) | take 400

If you are seeing a lot of things throwing 504 right around 60 seconds it is probably this issue.

When I ran a wireshark trace on port 443 (we use ssl.. port 80 for you) for the backend pool inbound ip addresses I then run the following display filter against the capture:

tcp.analysis.retransmission and tcp.flags.syn == 1

Syn Retransmissions

You will see a bunch of retransmissions for the same windows socket/ tcp stream. You take one of the ones that retranmitted at least 2-3 times and run this display filter against the client side port:

tcp.port== < client side port >

Port Reuse

You will see a converstation ending with a fin and acks etc. Or maybe a rst and ack. Either way that conversation ended.

At this time the web server's socket will go into a time wait status usually for 60-240 seconds depending on the OS etc. Usually 2 time the max segment length.

However the web gateway is trying to reuse that port in < 30 seconds.. in my specific example I have seen as low as 22 seconds. It probably does not wait at all? The web server will ignore the syn packets b/c it is waiting to see if something comes over from the previous conversation. The is part of the standard and the gateway is ignoring it.

I can see if the gateway was having issues with port exhaustion however I run the following:

netstat -ano | findstr /i | find "TCP" /c

I get around 300 connections. There should be 10's of 1000's available and it is trying to reuse a previous port withing 20 or so seconds? Not following any standard even their own Windows Standard.

Finally what happens is it keep retrying the syn and the Web Gateway gives up at an unconfigurable 60 seconds and sends the 504 error.

You can decrease the time wait delay to 30 seconds min in the registry of the iis web server and reboot.

https://learn.microsoft.com/en-us/biztalk/technical-guides/settings-that-can-be-modified-to-improve-network-performance

This might get rid of the 504 errors but you will still have slowness if the gateway is tryng to reach out before the 30 seconds. For exmaple say the FIN connection ends at time 0 seconds. At time 20 seconds the gateway begins its reuse. You will get retries at around: some milliseconds 1 second 3 seconds 7 seconds

For a total of around 11 seconds and some change. You user is waiting during that time for the web request to come back. If the call normally takes 100 ms it just took 11 seconds and 100 ms on top of that.

If the gateway tries to use that port at 1-2 seconds the next retry would be around 15 seconds.. which would give you the 504 for the default of 20 seconds on the request timeout or if you have that set higher would extend your call by about 22 seconds.

Hopefully we will find a way to fix this. I have a case open on it right now.

Update for issue #2: We found that this occurs when the IIS server ends the connection with a FIN. The gateway can reuse the socket on a busy system immediately ignoring the default Time Wait Delay on the server. There are no configurations on the gateways that will correct this issue. We worked around it by setting the TcpTimedWaitDelay to 2 seconds which can be done on Windows 2012 R2 and above. I have seen articles for Windows 2022 that say you may have to set StrictTimeWaitSeqCheck as well but we have not done that yet on these Windows 2019 servers and have had successful outcomes and no more retransmissions of the SYN packet when that socket is reused.

Ciapas answered 10/1 at 14:23 Comment(0)
G
0

I believe you will need to contact Azure support to know error log generated when the load goes beyond a certain point.

Gagarin answered 5/7, 2021 at 18:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.