Psycopg2 db connection hangs on lost network connection
Asked Answered
E

4

20

Problem description

I'm using psycopg2 to connect to my PostgreSQL database on a remote host. I open a connection and wait for requests, then for each request I run queries on the connection and return data.

But when the network connection is lost after the connection is already open the next db query hangs and I have to kill the program manually.

Details:

  • it hangs for at least 2 hours (I couldn't wait longer)
  • the "network down" situation is actually VPN going down (db host is accesible only by VPN)
  • I can't use asynchronous connection because I need transactions
  • python 2.6
  • psycopg 2.2.1
  • debian linux 6 and 7, 64bit
  • postgresql 8.4 and 9.1

What I want/need

I need some reliable way to detect a failed connection before running a query, so my program won't hang, or a way to make cursor.execute(..) raise an exception on failed connection.

Example:

import psycopg2
import time

conn = psycopg2.connect("host='dbs' dbname='foo' user='joe' password='x'")
time.sleep(10) # I manually turn VPN off during this sleep..
cu = conn.cursor()
cu.execute('SELECT 1') # <- hangs here
print cu.fetchone()
cu.commit()

What have I tried (and what didn't work):

  • setting TCP timeout "globally" - before psycopg2 import, I added:

    import socket
    socket.setdefaulttimeout(10)
    
  • setting TCP timeout on psycopg.connection's socket:

    ..
    conn = psycopg2.connect(...
    s = socket.fromfd(conn.fileno(), socket.AF_INET, socket.SOCK_STREAM)
    s.settimeout(5)
    ..
    
  • enabling keepalive for psycopg.connection's socket:

    ...
    conn = psycopg2.connect(...
    s = socket.fromfd(conn.fileno(), socket.AF_INET, socket.SOCK_STREAM)
    s.settimeout(5)
    s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
    s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 1)
    s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 3)
    s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 5)
    ...
    
Eclampsia answered 4/11, 2014 at 17:19 Comment(2)
Jan, have you found a solution? We are facing the same problem here in 2016:)Redfin
@scythargon: well, sort of... I rewritten everything which needed transactions in plpgsql and switched to asynchronous connections. I wouldn't call it a solution.. more like a workaround.. it took me more time than I was comfortable with, but I saw no other way..Eclampsia
H
10

After a long and brutal struggle, I think I fixed this issue by simply doing the strategy others are talking about, but using the psycopg2 connect function itself:


from psycopg2 import connect


conn = connect(
        database=database,
        user=username,
        password=password,
        host=hostname,
        port=port,
        connect_timeout=3,
        # https://www.postgresql.org/docs/9.3/libpq-connect.html
        keepalives=1,
        keepalives_idle=5,
        keepalives_interval=2,
        keepalives_count=2)

I was seeing psycopg2 hang consistently on long-running queries, but now the issue seems to be fully resolved.

Note this may be new functionality, since this question is old.

Headache answered 21/4, 2020 at 21:13 Comment(2)
Unfortunately these setting only work on Windows: This parameter is ignored for connections made via a Unix-domain socket.Tabathatabb
@Tabathatabb no, that doesn't mean it only works on Windows. That means it doesn't work on Unix-domain sockets.Susceptible
A
1

OP's and Gabriel Salla's solutions which configure KEEPALIVE are not complete. This solution only works when the connection is idle (no data sent before network became down) and the network became down.

If some data have already sent over the network that is already down but not yet detected so by the KEEPALIVE feature there will be a hang. This happens because the RTO mechanism is used instead of KEEPALIVE when some data is send.

To set timeout for an RTO you must set TCP_USER_TIMEOUT timeout (in milliseconds) for socket.

The complete solution is (both KEEPALIVE and RTO timeouts configured to 10 seconds):

s = socket.fromfd(conn.fileno(), socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 6)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 2)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 2)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_USER_TIMEOUT, 10000)
Annettaannette answered 29/12, 2020 at 13:57 Comment(0)
A
0

In order to make sure a connection is still valid, read the property connection.isolation_level. This will raise an OperationalError with pgcode == "57P01" in case the connection is dead.

try: connection.isolation_level except OperationalError as oe: conn = psycopg2.connect(dsn)

Asare answered 5/12, 2014 at 3:13 Comment(1)
Unfortunately, it doesn't - I get valid value (1) of conn.isolation_level and no exception.Eclampsia
G
0

Took a look at the socket timeout and after reading this and this, these settings worked for me

s = socket.fromfd(connection.fileno(),
                  socket.AF_INET, socket.SOCK_STREAM)
# Enable sending of keep-alive messages
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
# Time the connection needs to remain idle before start sending
# keepalive probes
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, int(ceil(time)))
# Time between individual keepalive probes
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 1)
# The maximum number of keepalive probes should send before dropping
# the connection
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3)
Gutsy answered 8/6, 2017 at 11:36 Comment(3)
it seems too similar to what I tried 3 years ago - the last paragraph in the question - but because I have already rewritten the whole db code, I can't test if those minor differences (5 instead of 3 in TCP_KEEPCNT, 1 instead of 3 in TCP_KEEPINTVL and quite unknown int(ceil(time)) in TCP_KEEPIDLE`) really make it work. If scythargon or someone else can confirm this works, I'll accept.Eclampsia
Did you end up finding a solution ? From my tests, i think this occurs only with a VPN connection lost and restarted, on a real "network down" solution, it seems once the network is up again, the socket is operational again - can you confirm this ?Gadmann
@Gadmann this configuration only works on idle connection. In situation where the connection lost and some data sent before KEEPIDLE interval lasts the connection will hang for some minutes or hours. To solve this a TCP_USER_TIMEOUT must be setAnnettaannette

© 2022 - 2024 — McMap. All rights reserved.