MySQL has gone away: Connection_errors_peer_address with high numbers
Asked Answered
C

1

10

We have MySQL 5.7 master - slaves replications and on the slave servers side, it hapens from time to time that our application monitoring tools (Tideways and PHP7.0) are reporting

MySQL has gone away.

Checking the MYSQL side:

show global status like '%Connection%';

+-----------------------------------+----------+
| Variable_name                     | Value    |
+-----------------------------------+----------+
| Connection_errors_accept          | 0        |
| Connection_errors_internal        | 0        |
| Connection_errors_max_connections | 0        |
| Connection_errors_peer_address    | 323      |
| Connection_errors_select          | 0        |
| Connection_errors_tcpwrap         | 0        |
| Connections                       | 55210496 |
| Max_used_connections              | 387      |
| Slave_connections                 | 0        |
+-----------------------------------+----------+

The Connection_errors_peer_address shows 323. How to further investigate on what is causing this issue on both sides:

MySQL has gone away

and

Connection_errors_peer_address

EDIT:

Master Server

net_retry_count = 10 
net_read_timeout = 120 
net_write_timeout = 120 
skip_networking = OFF
Aborted_clients = 151650

Slave Server 1

net_retry_count = 10
net_read_timeout = 30 
net_write_timeout = 60 
skip_networking = OFF
Aborted_clients = 3

Slave Server 2

net_retry_count = 10
net_read_timeout = 30 
net_write_timeout = 60 
skip_networking = OFF
Aborted_clients = 3
Coherent answered 25/9, 2018 at 11:16 Comment(3)
what's the value of wait_timeout and max_allowed_packet in my.cnf on the master node?Changchun
This may occur due to Server timed out. To fix this please check that wait_timeout mysql variable in configuration file is sufficient or not.Tridimensional
@Alexey: wait_timeout = 600 and max_allowed_packet = 100MCoherent
D
7

In MySQL 5.7, when a new TCP/IP connection reaches the server, the server performs several checks, implemented in sql/sql_connect.cc in function check_connection()

One of these checks is to get the IP address of the client side connection, as in:

static int check_connection(THD *thd)
{
...
  if (!thd->m_main_security_ctx.host().length)     // If TCP/IP connection
  {
...
    peer_rc= vio_peer_addr(net->vio, ip, &thd->peer_port, NI_MAXHOST);
    if (peer_rc)
    {
      /*
        Since we can not even get the peer IP address,
        there is nothing to show in the host_cache,
        so increment the global status variable for peer address errors.
      */
      connection_errors_peer_addr++;
      my_error(ER_BAD_HOST_ERROR, MYF(0));
      return 1;
    }
...
}

Upon failure, the status variable connection_errors_peer_addr is incremented, and the connection is rejected.

vio_peer_addr() is implemented in vio/viosocket.c (code simplified to show only the important calls)

my_bool vio_peer_addr(Vio *vio, char *ip_buffer, uint16 *port,
                      size_t ip_buffer_size)
{
  if (vio->localhost)
  {
...
  }
  else
  {
    /* Get sockaddr by socked fd. */

    err_code= mysql_socket_getpeername(vio->mysql_socket, addr, &addr_length);

    if (err_code)
    {
      DBUG_PRINT("exit", ("getpeername() gave error: %d", socket_errno));
      DBUG_RETURN(TRUE);
    }

    /* Normalize IP address. */

    vio_get_normalized_ip(addr, addr_length,
                          (struct sockaddr *) &vio->remote, &vio->addrLen);

    /* Get IP address & port number. */

    err_code= vio_getnameinfo((struct sockaddr *) &vio->remote,
                              ip_buffer, ip_buffer_size,
                              port_buffer, NI_MAXSERV,
                              NI_NUMERICHOST | NI_NUMERICSERV);

    if (err_code)
    {
      DBUG_PRINT("exit", ("getnameinfo() gave error: %s",
                          gai_strerror(err_code)));
      DBUG_RETURN(TRUE);
    }
...
  }
...
}

In short, the only failure path in vio_peer_addr() happens when a call to mysql_socket_getpeername() or vio_getnameinfo() fails.

mysql_socket_getpeername() is just a wrapper on top of getpeername().

The man 2 getpeername manual lists the following possible errors:

NAME

   getpeername - get name of connected peer socket

ERRORS

   EBADF  The argument sockfd is not a valid descriptor.

   EFAULT The addr argument points to memory not in a valid part of the process address space.

   EINVAL addrlen is invalid (e.g., is negative).

   ENOBUFS
          Insufficient resources were available in the system to perform the operation.

   ENOTCONN
          The socket is not connected.

   ENOTSOCK
          The argument sockfd is a file, not a socket.

Of these errors, only ENOBUFS is plausible.

As for vio_getnameinfo(), it is just a wrapper on getnameinfo(), which also according to the man page man 3 getnameinfo can fail for the following reasons:

NAME

   getnameinfo - address-to-name translation in protocol-independent manner

RETURN VALUE

   EAI_AGAIN
          The name could not be resolved at this time.  Try again later.

   EAI_BADFLAGS
          The flags argument has an invalid value.

   EAI_FAIL
          A nonrecoverable error occurred.

   EAI_FAMILY
          The address family was not recognized, or the address length was invalid for the specified family.

   EAI_MEMORY
          Out of memory.

   EAI_NONAME
          The name does not resolve for the supplied arguments.  NI_NAMEREQD is set and the host's name cannot be located, or neither 

hostname nor service name were requested.

   EAI_OVERFLOW
          The buffer pointed to by host or serv was too small.

   EAI_SYSTEM
          A system error occurred.  The error code can be found in errno.

   The gai_strerror(3) function translates these error codes to a human readable string, suitable for error reporting.

Here many failures can happen, basically due to heavy load or the network.

To understand the process behind this code, what the MySQL server is essentially doing is a Reverse DNS lookup, to:

  • find the hostname of the client
  • find the IP address corresponding to this hostname to later convert this IP address to a hostname again (see the call to ip_to_hostname() that follows).

Overall, failures accounted with Connection_errors_peer_address can be due to system load (causing transient failures like out of memory, etc) or due to network issues affecting DNS.

Disclosure: I happen to be the person who implemented this Connection_errors_peer_address status variable in MySQL, as part of an effort to have better visibility / observability in this area of the code.

[Edit] To follow up with more details and/or guidelines:

  • When Connection_errors_peer_address is incremented, the root cause is not printed in logs. That is unfortunate for troubleshooting, but also avoid flooding logs causing even more damage, there is a tradeoff here. Keep in mind that anything that happen before logging in is very sensitive ...
  • If the server really goes out of memory, it is very likely that many other things will break, and that the server will go down very quickly. By monitoring the total memory usage of mysqld, and monitoring the uptime, it should be fairly easy to determine if the failure "only" caused connections to be closed with the server staying up, or if the server itself failed catastrophically.
  • Assuming the server stays up on failure, the more likely culprit is the second call then, to getnameinfo.
  • Using skip-name-resolve will have no effect, as this check happens later (see specialflag & SPECIAL_NO_RESOLVE in the code in check_connection())
  • When Connection_errors_peer_address fails, note that the server cleanly returns the error ER_BAD_HOST_ERROR to the client, and then closes the socket. This is different from just closing abruptly a socket (like in a crash) : the former should be reported by the client as "Can't get hostname for your address", while the later is reported as "MySQL has gone away".
  • Whether the client connector actually treat ER_BAD_HOST_ERROR and a socket closed differently is another story

Given that this failure overall seems related to DNS lookups, I would check the following items:

  • See how many rows are in the performance_schema.host_cache table.
  • Compare this with the size of the host cache, see the host_cache_size system variable.
  • If the host cache appear full, consider increasing its size: this will reduce the number of DNS calls overall, relieving pressure on DNS, in hope (admittedly, this is just a shot in the dark) that DNS transient failures will disappear.
  • 323 out of 55 million connections indeed seems transient. Assuming the monitoring client sometime do get connected properly, inspect the row in table host_cache for this client: it may contains other failures reported.

Table performance_schema.host_cache documentation:

https://dev.mysql.com/doc/refman/5.7/en/host-cache-table.html

Further readings:

http://marcalff.blogspot.com/2012/04/performance-schema-nailing-host-cache.html

[Edit 2] Based on the new data available:

The Aborted_clients status variable shows some connections forcefully closed by the server. This typically happens when a session is idle for a very long time.

A typical scenario for this to happen is:

  1. A client opens a connection, and sends some queries
  2. Then the client does nothing for an extended amount of time (greater than the net_read_timeout)
  3. Due to lack of traffic, the server closes the session, and increments Aborted_connects
  4. The client then sends another query, sees a closed connection, and reports "MySQL has gone away"

Note that a client application forgetting to cleanly close sessions will execute 1-3, this could be the case for Aborted_clients on the master. Some cleanup here to fix clients applications using the master would help to decrease resource consumption, as leaving 151650 sessions open to die on timeout has a cost.

A client application executing 1-4 can cause Aborted_clients on the server and MySQL has gone away on the client. The client application reporting "MySQL has gone away" is most likely the culprit here.

If a monitoring application, say, checks the server every N seconds, then make sure the timeouts (here 30 and 60 sec) are significantly greater that N, or the server will kill the monitoring session.

Doris answered 1/10, 2018 at 9:28 Comment(15)
Is skip-name-resolve relevant? If so, is it a 'solution'?Neuberger
Thanks a lot for the detailed description Marc. The Connection_errors_peer_address was a potential source of information I was trying to investigate on while getting the MySQL has gone away message on the PHP side. You mentioned that the Connection_errors_peer_address can be due to system load or network issues affecting DNS. How to find out what really happened. Is ther something storred in the MySQL logs when the Connection_errors_peer_address gets incremented so that I can explore and find out what really happens.Coherent
@RickJames, You mean starting MySQL with skip-name-resolve so that the function check_connection is not triggered, then eliminate this source of errors?Coherent
@Miloš - I don't know for sure; I am guessing. (Hence, I did not present it as an 'Answer'.Neuberger
Roger that, thanks. I was more expecting to get an answer to that comment from @MarcAlff.Coherent
@RickJames, Milos - Sorry for the delay. See updated answer.Doris
@MarcAlff, thank you once again. host_cache_size = 703 but there are no entries in the performance_schema.host_cache tableCoherent
@Milos - Thanks for the update. The host_cache table can appear empty when the performance schema is disabled. This hinders observability, but is not a root cause here.Doris
@MarcAlff, any idea on what else to investigate in order to find out what could be the cause of the MySQL has gone away?Coherent
@milos - "MySQL has gone away" is a socket closed seen from the client side. Check for Aborted_clients, net_retry_count, net_read_timeout and net_write_timeout. ... this all points to the network. Make sure there are no forgotten clean up scripts that KILL sessions. Check also for disconnect_on_expired_password.Doris
@MarcAlff,thanks once again. I'm not at the highest level regarding MySQ. Checked those variables: disconnect_on_expired_password = ON net_retry_count = 10 net_read_timeout = 120 net_write_timeout = 120 skip_networking = OFFCoherent
@Milos - I don't see anything odd here. The main question is whether the system still has peer address / disconnect errors currently, or only had some in the past. If there was a network glitch ages ago (I assume the server uptime is significant), finding that is going to be very challenging ... What is the value of Aborted_clients ?Doris
@MarcAlff, my appologies, I was pasting the Master MYSQL server variables, and there is a good chance the MySQL has gone away appeared on one of the two slave servers we have. I updated my question with the variables you asked for. Note that the MySQL has gone away happened 3 times in the last 2 weeks, last time 4 days ago.Coherent
@Milos - Just to make sure, when you say "MySQL has gone away", it is only an error seen from a client, but the server is still up, right ? This is visible in the server uptime.Doris
@MarcAlff, that's right, the "SQLSTATE[HY000] [2006] MySQL server has gone away" is from the client side. The server is still up and running, uptime shows large values. The "SQLSTATE[HY000] [2006] MySQL server has gone away" just happened again on the client side.Coherent

© 2022 - 2024 — McMap. All rights reserved.