Redshift + SQLAlchemy long query hangs
Asked Answered
B

2

9

I'm doing something among the lines of:

conn_string = "postgresql+pg8000://%s:%s@%s:%d/%s" % (db_user, db_pass, host, port, schema)
conn = sqlalchemy.engine.create_engine(conn_string,execution_options={'autocommit':True},encoding='utf-8',isolation_level="AUTOCOMMIT") 
rows = cur.execute(sql_query)

To run queries on a Redshift cluster. Lately, I've been doing maintenance tasks such as running vacuum reindex on large tables that get truncated and reloaded every day.

The problem is that that command above takes around 7 minutes for a particular table (the table is huge, 60 million rows across 15 columns) and when I run it using the method above it just never finishes and hangs. I can see in the cluster dashboard in AWS that parts of the vacuum command are being run for about 5 minutes and then it just stops. No python errors, no errors on the cluster, no nothing.

My guess is that the connection is lost during the command. So, how do I prove my theory? Anybody else with the issue? What do I change the connection string to keep it alive longer?

Babbette answered 24/4, 2017 at 17:25 Comment(5)
In my experience, a connection drop like that (if it indeed is a connection drop) is due to a NAT silently dropping a connection somewhere in the middle. Incidentally, the timeout on AWS NAT gateways is 5 minutes. The tricky thing here is, psycopg2/libpq apparently has TCP keepalive turned on by default, which should fix the issue. You can try setting it explicitly: create_engine(connect_args={"keepalives": 1, "keepalives_idle": 60, "keepalives_interval": 60}); it is possible that the default keepalive interval is too long.Decentralization
@Decentralization those keep alive arguments made this work today with no issues at all. If it works tomorrow I'll comment here again to ask you put this as an actual answer. This just makes me wonder why if it is turned on by default I still have to manually add it...Babbette
@Babbette Does the solution work consistently? I'm facing the same problem..Knapsack
@Knapsack Yes. Since applied it hasn't failed.Babbette
I am facing the same issues using airflow with Redshift. @univerio's comment solve it and should be an accepted answer hereKerf
R
1

It's most likely not a connection drop issue. To confirm this , try pushing a few million rows into a dummy table (something which takes more than 5 minutes) and see if the statement fails. Once a query has been submitted to redshift , regardless of your connection string shutting the query executes in the background.

Now, coming to the problem itself - my guess is that you are running out of memory or disk space, can you please be more elaborate and list out your redshift setup (How many nodes of dc1/ds2) ? Also, try running some admin queries and see how much space you have left on the disk. Sometimes when the cluster is loaded to the brim a disk full error is thrown but in your case since the connection might be dropped much before the error is thrown to your python shell.

Rhatany answered 26/4, 2017 at 11:42 Comment(1)
Memory or disk space is actually not an issue, I have a 4 node ds2.8xlarge and during the command the cpu usage doesn't go above 60% and the disk space remains steady at 30%.Babbette
C
1

It's a common issue, when you are behind a NAT proxy or a firewall, to be disconnected due to network inactivity (such as waiting for a long-running query). Because of the physical limits of NAT proxies or firewalls, they can only keep a finite number of connections in their memory. The most common and logical policy is to keep newest connections and to discard old/inactive connections first.

A general solution to this problem is to use TCP keepalive packets to trick intermediate hosts to not close the connection. psycopg2 allows to pass keepalive parameters for a connection to libpq, so you could add them to the connect_args dict for create_engine(), e.g.:

conn = sqlalchemy.engine.create_engine(
    conn_string,
    connect_args={
        "keepalives": 1,           # enable TCP keepalives
        "keepalives_idle": 30,     # inactivity seconds
        "keepalives_interval": 5,  # retransmission interval
        "keepalives_count": 5      # max number of keepalives to transmit
    },
)
Collencollenchyma answered 5/2 at 11:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.