Debugging 504 Gateway Timeout and its actual cause and solution
Asked Answered
R

3

18

We are running following stack on our web server Varnish + Nginx + FastCGI (php-fpm) on RHEL 6.6

Its a dynamic website with different result sets everytime and has around 2 million url's indexed with Google.

  • Its running on nginx/1.5.12 and PHP 5.3.3 (Will be upgraded to latest nginx and PHP soon)
  • Nginx connects to php-fpm running locally on same server on port 9000

We are getting 504 Gateway timeout intermittently on some pages which we are unable to resolve. The URL's which give 504 works fine after sometime. We get to know about 504 from our logs and we haven't been able to replicate this as it randomly happens on any URL and works after sometime.

I have had couple of discussions with developer but as per him the underlying php script hardly does anything and it should not take this long (120 seconds) but still it is giving 504 Gateway timeout.

Need to establish where exactly the issue occurs :

  • Is it a problem with Nginx ?
  • Is it a problem with php-fpm ?
  • Is it a problem with underlying php scripts ?
  • Is it possible that nginx is not able to connect to php-fpm ?
  • Would it resolve if we use Unix socket instead of TCP/IP connection to ?

The URL times out after 120 seconds with 504

Below is the error seen : 2016/01/04 17:29:20 [error] 1070#0: *196333149 upstream timed out (110: Connection timed out) while connecting to upstream, client: 66.249.74.95, server: x.x.x.x, request: "GET /Some/url HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "example.com"

Earlier with fastcgi_connect_timeout of 150 seconds - it used to give at 502 status code after 63 seconds with default net.ipv4.tcp_syn_retries = 5 on RHEL 6.6 ; afterwards we set net.ipv4.tcp_syn_retries = 6 and then it started giving 502 after 127 seconds.

Once I set fastcgi_connect_timeout = 120 it started giving 504 status code. I understand fastcgi_connect_timeout with such high value is not good.

Need to findout why exactly we are getting 504 (I know its timeout but the cause is unknown). Need to get to the root cause to fix it permanently.

How do I confirm where exactly the issue is ?

Here are some of the timeouts already defined :

Under server wide nginx.conf :

  • keepalive_timeout 5;
  • send_timeout 150;

under specific vhost.conf :

  • proxy_send_timeout 100
  • proxy_read_timeout 100
  • proxy_connect_timeout 100
  • fastcgi_connect_timeout 120
  • fastcgi_send_timeout 300
  • fastcgi_read_timeout 300

Different values for timeouts are used so I can figured out which timeout was exactly triggered.

Below are some of the settings from sysctl.conf :

  • net.ipv4.ip_local_port_range = 1024 65500
  • net.ipv4.tcp_fin_timeout = 10
  • net.ipv4.tcp_tw_reuse = 1
  • net.ipv4.tcp_syn_retries = 6
  • net.core.netdev_max_backlog = 8192
  • net.ipv4.tcp_max_tw_buckets = 2000000
  • net.core.somaxconn = 4096
  • net.ipv4.tcp_no_metrics_save = 1
  • vm.max_map_count = 256000

If its poorly written code then I need to inform the developer that 504 is happening due to issue in php code and not due to nginx or php-fpm and if its due to Nginx or Php-fpm then need to fix that.

Thanks in Advance!

======

Further update :

There are 2 cases :

  1. 504 @ 120 seconds coming with below mentioned error :

2016/01/05 03:50:54 [error] 1070#0: *201650845 upstream timed out (110: Connection timed out) while connecting to upstream, client: 66.249.74.99, server: x.x.x.x, request: "GET /some/url HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "example.com"

  1. 504 @ 300 seconds coming with below mentioned error :

2016/01/05 00:51:43 [error] 1067#0: *200656359 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 115.112.161.9, server: 192.168.12.101, request: "GET /some/url HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "example.com"

  • No errors found in php-fpm logs.
  • Number of php-fpm processes were also normal. Backend doesn't look overloaded as other requests were served out fine at the same time.

  • Only one php-fpm pool is being used. One php-fpm master (parent) process and other slave (child) processes are usually at normal range only when 5xx are observed. There is no significant growth in number of php-fpm processes and even if grows then server has enough capacity to fork new ones and serve the request.

Robbegrillet answered 4/1, 2016 at 14:11 Comment(0)
A
1

It must be assumed that you are rewriting URLs or otherwise redirecting through a gateway/firewall, which is generally how a 504 error arises.

504 means that a backend service (ie, on the other side of the gateway/firewall - the inside) is either down or cannot be addressed (bad internal URL). It can also be caused by a backend crash, but that should show up in the logs (if debug logs are turned on).

Check the following: (a) Check the application by accessing it on the internal network. Can it be addressed? Are the parameters right? Is it working as intended? (b) Check the gateway. How is it redirecting (rewriting) the URL? Have the required modules installed to allow redirection/rewriting? Is the resultant address correct internally? Is the redirection written correctly (correct type, arguments, etc)? Checking the access logs on the gateway may be useful.

However, there are many other ways this problem can occur, but this is the area you should be investigating. 504 is a routing error.

Atlantis answered 28/3, 2019 at 2:8 Comment(0)
L
1

Try increasing the fastcgi_read_timeout and proxy_read_timeout in your nginx config even more. You can add this to the top of any file that has a long task

ini_set('max_execution_time', '0'); // for infinite time of execution   
ini_set('max_execution_time', '300'); //300 seconds = 5 minutes
ini_set('memory_limit','2048M'); // For unlimited memory limit set -1
Legofmutton answered 29/6, 2021 at 10:38 Comment(0)
A
0

The long-term fix is to edit the file /etc/sysctl.conf to include the line:

fs.inotify.max_user_watches=1048576

You have to run sysctl -p to reload sysctl.conf

DONE.

Autocrat answered 18/10, 2021 at 1:43 Comment(2)
@Robbegrillet did you find the issue of your solution? I need advice for thisVail
More detail of what this does and why to do it would be helpful.Alcaide

© 2022 - 2024 — McMap. All rights reserved.