I have a four node clustered Airflow environment that's been working fine for me for a few months now.
ec2-instances
- Server 1: Webserver, Scheduler, Redis Queue, PostgreSQL Database
- Server 2: Webserver
- Server 3: Worker
- Server 4: Worker
Recently I've been working on a more complex DAG that has a few dozen tasks in it compared to my relatively small ones I was working on beforehand. I'm not sure if that's why I'm just now seeing this error pop up or what but I'll sporadically get this error:
On the Airflow UI under the logs for the task:
psycopg2.OperationalError: FATAL: sorry, too many clients already
And on the Webserver (output from running airflow webserver) I get the same error too:
[2018-07-23 17:43:46 -0400] [8116] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2158, in _wrap_pool_connect
return fn()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 403, in connect
return _ConnectionFairy._checkout(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 788, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 532, in checkout
rec = pool._do_get()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1193, in _do_get
self._dec_overflow()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 187, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1190, in _do_get
return self._create_connection()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 350, in _create_connection
return _ConnectionRecord(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 477, in __init__
self.__connect(first_connect_check=True)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 671, in __connect
connection = pool._invoke_creator(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 106, in connect
return dialect.connect(*cargs, **cparams)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 410, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 130, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL: sorry, too many clients already
I can fix this by running sudo /etc/init.d/postgresql restart
and restarting the DAG but then after about three runs I'll start seeing the error again.
I can't find any specifics on this issue in regards to Airflow but from other posts I've found such as this one they're saying it's because my client (I guess in this case that's Airflow) is trying to open up more connections to PostgreSQL than what PostgreSQL is configured to handle. I ran this command to find that my PostgreSQL can accept 100 connections:
[ec2-user@ip-1-2-3-4 ~]$ sudo su
root@ip-1-2-3-4
[/home/ec2-user]# psql -U postgres
psql (9.2.24)
Type "help" for help.
postgres=# show max_connections;
max_connections
-----------------
100
(1 row)
In this solution the post says I can increase my PostgreSQL max connections but I'm wondering if I should instead set a value in my Airflow.cfg file so that I can match the Airflow allowed connections size to my PoastgreSQL max connections size. Does anyone know where I can set this value in Airflow? Here are the fields I think are relevant:
# The SqlAlchemy pool size is the maximum number of database connections
# in the pool.
sql_alchemy_pool_size = 5
# The SqlAlchemy pool recycle is the number of seconds a connection
# can be idle in the pool before it is invalidated. This config does
# not apply to sqlite.
sql_alchemy_pool_recycle = 3600
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 32
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 32
Open to any suggestions for fixing this issue. Is this something related to my Airflow configuration or is it an issue with my PostgreSQL configuration?
Also, because I'm testing a new DAG I'll sometimes terminate the running tasks and start them over. Perhaps doing this is causing some of the processes to not die correctly and they're keeping dead connections open to PostgreSQL?
SELECT * FROM pg_stat_activity;
. Might be interesting to investigate from there. – Curvilinear