Airflow psycopg2.OperationalError: FATAL: sorry, too many clients already
Asked Answered
V

2

8

I have a four node clustered Airflow environment that's been working fine for me for a few months now.

ec2-instances

  • Server 1: Webserver, Scheduler, Redis Queue, PostgreSQL Database
  • Server 2: Webserver
  • Server 3: Worker
  • Server 4: Worker

Recently I've been working on a more complex DAG that has a few dozen tasks in it compared to my relatively small ones I was working on beforehand. I'm not sure if that's why I'm just now seeing this error pop up or what but I'll sporadically get this error:

On the Airflow UI under the logs for the task:

psycopg2.OperationalError: FATAL: sorry, too many clients already

And on the Webserver (output from running airflow webserver) I get the same error too:

[2018-07-23 17:43:46 -0400] [8116] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2158, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 403, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 788, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 532, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1193, in _do_get
    self._dec_overflow()
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 187, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1190, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 350, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 477, in __init__
    self.__connect(first_connect_check=True)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 671, in __connect
    connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 106, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 410, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  sorry, too many clients already

I can fix this by running sudo /etc/init.d/postgresql restart and restarting the DAG but then after about three runs I'll start seeing the error again.

I can't find any specifics on this issue in regards to Airflow but from other posts I've found such as this one they're saying it's because my client (I guess in this case that's Airflow) is trying to open up more connections to PostgreSQL than what PostgreSQL is configured to handle. I ran this command to find that my PostgreSQL can accept 100 connections:

[ec2-user@ip-1-2-3-4 ~]$ sudo su
root@ip-1-2-3-4
[/home/ec2-user]# psql -U postgres
psql (9.2.24)
Type "help" for help.

postgres=# show max_connections;
 max_connections
-----------------
 100
(1 row)

In this solution the post says I can increase my PostgreSQL max connections but I'm wondering if I should instead set a value in my Airflow.cfg file so that I can match the Airflow allowed connections size to my PoastgreSQL max connections size. Does anyone know where I can set this value in Airflow? Here are the fields I think are relevant:

# The SqlAlchemy pool size is the maximum number of database connections
# in the pool.
sql_alchemy_pool_size = 5

# The SqlAlchemy pool recycle is the number of seconds a connection
# can be idle in the pool before it is invalidated. This config does
# not apply to sqlite.
sql_alchemy_pool_recycle = 3600

# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 32

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 32

Open to any suggestions for fixing this issue. Is this something related to my Airflow configuration or is it an issue with my PostgreSQL configuration?

Also, because I'm testing a new DAG I'll sometimes terminate the running tasks and start them over. Perhaps doing this is causing some of the processes to not die correctly and they're keeping dead connections open to PostgreSQL?

Volution answered 23/7, 2018 at 21:58 Comment(1)
According to this dba.stackexchange.com/questions/69438/… you could check what is still open with SELECT * FROM pg_stat_activity;. Might be interesting to investigate from there.Curvilinear
C
5

Ran into similar issue. I changed max_connections in postgres to 10000 and sql_alchemy_pool_size in airflow config to 1000. Now I am able to run hundreds of tasks in parallel.

PS: My machine has 32 cores and 60GB memory. Hence, its taking the load.

Clisthenes answered 20/9, 2018 at 11:59 Comment(1)
I have max_connections=100 and sql_alchemy_pool_size=5 -- and still Airflow manages to provoke the "too many clients". Therefore, I'm not sure whether this answer is (still) correct.Assimilable
I
0

Quoting the airflow documentation:

sql_alchemy_max_overflow: The maximum overflow size of the pool. When the number of checked-out connections reaches the size set in pool_size, additional connections will be returned up to this limit. When those additional connections are returned to the pool, they are disconnected and discarded. It follows then that the total number of simultaneous connections the pool will allow is pool_size + max_overflow, and the total number of “sleeping” connections the pool will allow is pool_size. max_overflow can be set to -1 to indicate no overflow limit; no limit will be placed on the total number of concurrent connections. Defaults to 10.

It seems that the variables you'll want to set on your airflow.cfg are both sql_alchemy_pool_size and sql_alchemy_max_overflow. Your PostgreSQL max_connections must be equal to or greater than the sum of those two Airflow configuration variables, since Airflow can have at most sql_alchemy_pool_size + sql_alchemy_max_overflow open connections with your database.

Interweave answered 18/2, 2021 at 18:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.