I believe I found a bug in the Azure Web Gateways.
There are 2 ways I see to get the 504 error.
- Your web page request takes longer than the configured request timeout on the back end.
- What I believe to be a bug where the web gateway is reusing windows sockets too quickly.
On #2 you can run the following in against the web gateway's monitoring logs:
AzureDiagnostics
| where timeTaken_d > 55 and httpStatus_d in (504)
| take 400
If you are seeing a lot of things throwing 504 right around 60 seconds it is probably this issue.
When I ran a wireshark trace on port 443 (we use ssl.. port 80 for you) for the backend pool inbound ip addresses I then run the following display filter against the capture:
tcp.analysis.retransmission and tcp.flags.syn == 1
Syn Retransmissions
You will see a bunch of retransmissions for the same windows socket/ tcp stream. You take one of the ones that retranmitted at least 2-3 times and run this display filter against the client side port:
tcp.port== < client side port >
Port Reuse
You will see a converstation ending with a fin and acks etc. Or maybe a rst and ack. Either way that conversation ended.
At this time the web server's socket will go into a time wait status usually for 60-240 seconds depending on the OS etc. Usually 2 time the max segment length.
However the web gateway is trying to reuse that port in < 30 seconds.. in my specific example I have seen as low as 22 seconds. It probably does not wait at all? The web server will ignore the syn packets b/c it is waiting to see if something comes over from the previous conversation. The is part of the standard and the gateway is ignoring it.
I can see if the gateway was having issues with port exhaustion however I run the following:
netstat -ano | findstr /i | find "TCP" /c
I get around 300 connections. There should be 10's of 1000's available and it is trying to reuse a previous port withing 20 or so seconds? Not following any standard even their own Windows Standard.
Finally what happens is it keep retrying the syn and the Web Gateway gives up at an unconfigurable 60 seconds and sends the 504 error.
You can decrease the time wait delay to 30 seconds min in the registry of the iis web server and reboot.
https://learn.microsoft.com/en-us/biztalk/technical-guides/settings-that-can-be-modified-to-improve-network-performance
This might get rid of the 504 errors but you will still have slowness if the gateway is tryng to reach out before the 30 seconds. For exmaple say the FIN connection ends at time 0 seconds. At time 20 seconds the gateway begins its reuse. You will get retries at around:
some milliseconds
1 second
3 seconds
7 seconds
For a total of around 11 seconds and some change. You user is waiting during that time for the web request to come back. If the call normally takes 100 ms it just took 11 seconds and 100 ms on top of that.
If the gateway tries to use that port at 1-2 seconds the next retry would be around 15 seconds.. which would give you the 504 for the default of 20 seconds on the request timeout or if you have that set higher would extend your call by about 22 seconds.
Hopefully we will find a way to fix this. I have a case open on it right now.
Update for issue #2:
We found that this occurs when the IIS server ends the connection with a FIN. The gateway can reuse the socket on a busy system immediately ignoring the default Time Wait Delay on the server. There are no configurations on the gateways that will correct this issue. We worked around it by setting the TcpTimedWaitDelay to 2 seconds which can be done on Windows 2012 R2 and above. I have seen articles for Windows 2022 that say you may have to set StrictTimeWaitSeqCheck as well but we have not done that yet on these Windows 2019 servers and have had successful outcomes and no more retransmissions of the SYN packet when that socket is reused.