SQLAlchemy hangs during insert while querying [INFORMATION_SCHEMA].[TABLES]

Asked 30/10, 2020 at 14:27 Answered 30/1, 2024 at 22:46

Solved python sql-server pandas sqlalchemy pyodbc

I have a Python process that uses SQLAlchemy to insert some data into a MS SQL Server DB. When the Python process runs it hangs during the insert. I turned on SQLAlchemy logging to get some more information. I found that it hangs at this point where SQLAlchemy seems to be requesting table schema info about the entire DB:

2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')

I have other "stuff" going on in the DB at this time, including some open transactions and my guess is that for whatever reason querying [INFORMATION_SCHEMA].[TABLES] creates some deadlock or blocks somehow.

I've also read (here) that [INFORMATION_SCHEMA].[TABLES] is a view that cannot cause a deadlock which would contradict my guess of what is causing this issue.

My question is: Can I alter the configuration/settings of SQLAlchemy so that it does not make this query in the first place?

UPDATE 1: The Python code for the insert is like this:

with sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params).connect() as connection:
    # df is a Pandas DataFrame
    df.to_sql(name=my_table, con=connection, if_exists='append', index=False)

Note that the code works without any problems when I run the Python script during other times of the day when I don't have those other DB transactions going on. In those cases, the log continues immediately like this, listing all the tables in the DB:

2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._init_metadata(result.py:810) DEBUG: Col ('TABLE_NAME',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table1',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table2',)
...

UPDATE 2: Apparently when a table or other object is created in an open transaction and not committed yet, querying [INFORMATION_SCHEMA].[TABLES] will get blocked (source). Is anyone familiar with the internals of SQLAlchemy to suggest how to prevent it from making this query in the first place?

UPDATE 3: After posting this issue on the SQLAlchemy github (issue link) the SQLAlchemy devs confirmed that the query of [INFORMATION_SCHEMA].[TABLES] is in fact being caused by the Pandas function to_sql().

So, my new question is does anyone know how to disable this behavior in the Pandas to_sql() function? I looked over the documentation and could not find anything that would seem to help.

Summertree answered 30/10, 2020 at 14:27 Comment(14)

Possibly related: dba.stackexchange.com/q/164429/21390 – Achromatin 30/10, 2020 at 17:53

Related: #59636771 – Raisaraise 31/10, 2020 at 15:36

Discussed on GitHub here. – Achromatin 31/10, 2020 at 17:43

@GordThompson the SQLAlchemy dev mentions using either ALTER DATABASE MyDatabase SET ALLOW_SNAPSHOT_ISOLATION ON or ALTER DATABASE MyDatabase SET READ_COMMITTED_SNAPSHOT ON to not lock as aggressively by turning on snapshot isolation. I'm concerned about making these changes on the database level. Do you know of a setting in SQL server to make queries on [INFORMATION_SCHEMA].[TABLES] not block when an open transaction exists that has created a new table in that database? – Summertree 31/10, 2020 at 17:52

You could try create_engine(connection_uri, isolation_level="READ_UNCOMMITTED") and see if that prevents the hang. – Achromatin 31/10, 2020 at 18:47

Unfortunately that did not prevent the query from hanging. – Summertree 31/10, 2020 at 20:32

Hmm, it works for me. (Hangs without isolation_level and does not hang when I add it.) Perhaps try engine = create_engine(connection_uri) and the omit the with block and call to_sql with engine as the second parameter. That works for me, too. – Achromatin 31/10, 2020 at 21:1

That is strange, it doesn't work for me. Also just tried engine = ... as you suggested and it also did not work. The code hangs on table_names = [r[0] for r in connection.execute(s)] in get_table_names of sqlalchemy.dialects.mssql.base.py. Just to be clear, when you test this, you have an open transaction on the same database where you have created a new table in that database in the open transaction? That's what I have and as soon as I rollback or commit that transaction in SQL the Python code continues running again. – Summertree 31/10, 2020 at 22:45

@Summertree - "Just to be clear, when you test this, you have an open transaction on the same database where you have created a new table in that database in the open transaction?" - Yes, exactly. – Achromatin 1/11, 2020 at 0:38

Darn. Still doesn't work for me. Perhaps their are some database related settings that are causing the difference. I wonder if their is some way I can confirm the isolation_level="READ_UNCOMMITTED" option is being used in general. – Summertree 1/11, 2020 at 17:11

SELECT transaction_isolation_level FROM sys.dm_exec_sessions WHERE session_id = @@SPID; returns the numeric value. 1 means READ_UNCOMMITTED. – Achromatin 2/11, 2020 at 14:5

Anyone ever figure out a fix for this? We've been stuck for a couple of days because of this. – Huxley 29/4, 2021 at 19:27

I have not...just ended up monkey patching Pandas for the time being. – Summertree 21/5, 2021 at 17:39

Encountered this too while a separate process is creating a large columnstore index. It locks information_schema and prevents to_sql from running. – Boatyard 20/3, 2023 at 20:55

As of pandas v.2.2.0 you can override the pandas method that runs the check which causes the block/deadlock. Add this before calling to_sql:

from pandas.io.sql import SQLDatabase
def pass_check_case_sensitive(*args, **kwargs):
    pass
SQLDatabase.check_case_sensitive = pass_check_case_sensitive

Summertree answered 30/1, 2024 at 22:46 Comment(2)

Good workaround - this should also work in v1.3+, right? github.com/pandas-dev/pandas/blob/1.3.x/pandas/io/sql.py – Boatyard 31/1, 2024 at 2:13

@Boatyard Yes, should work there as well. FWIW none of the other workarounds were options for more or worked for me which is why I posted this solution. – Summertree 31/1, 2024 at 13:35

I'm not very familiar with SQLAlchemy, but I can tell you about the Pandas side of this issue.

Pandas automatically creates a new table if the table doesn't exist. The way it figures out whether the table exists is that it calls has_table() in SQL Alchemy. The way has_table() works is that it queries the information schema. (At least, it works that way in MySQL and MSSQL.)

Implementation details

Here's what I found tracing the logic for this in Pandas and SQLAlchemy. We start in pandas/io/sql.py, inside to_sql().

        table = SQLTable(
            name,
            self,
            frame=frame,
            index=index,
            if_exists=if_exists,
            index_label=index_label,
            schema=schema,
            dtype=dtype,
        )
        table.create()

SQLTable.create() is defined here:

class SQLTable(PandasObject):
    [...]
    def create(self):
        if self.exists():
            if self.if_exists == "fail":
                raise ValueError(f"Table '{self.name}' already exists.")
            elif self.if_exists == "replace":
                self.pd_sql.drop_table(self.name, self.schema)
                self._execute_create()
            elif self.if_exists == "append":
                pass
            else:
                raise ValueError(f"'{self.if_exists}' is not valid for if_exists")
        else:
            self._execute_create()

Notice that it calls exists() unconditionally. Inside SQLTable.exists(), you'll find this:

    def exists(self):
        return self.pd_sql.has_table(self.name, self.schema)

This eventually calls has_table() in SQLAlchemy: https://docs.sqlalchemy.org/en/13/core/internals.html#sqlalchemy.engine.default.DefaultDialect.has_table

For MSSQL, this is implemented in sqlalchemy/dialects/mssql/base.py in SQLAlchemy:

    @_db_plus_owner
    def has_table(self, connection, tablename, dbname, owner, schema):
        if tablename.startswith("#"):  # temporary table
            [...]
        else:
            tables = ischema.tables

            s = sql.select(tables.c.table_name).where(
                sql.and_(
                    tables.c.table_type == "BASE TABLE",
                    tables.c.table_name == tablename,
                )
            )

            if owner:
                s = s.where(tables.c.table_schema == owner)

            c = connection.execute(s)

            return c.first() is not None

(ischema is an abbreviation of information_schema, and this code is running a select on that table.)

How to fix this

I don't see a good, simple way to fix this. Pandas assumes that has_table() is a cheap operation. MSSQL doesn't follow that assumption. No matter what if_exists is set to, Pandas will call has_table() during to_sql().

I can think of a hacky way of doing this, though. If you were to monkey-patch pandas.io.sql.SQLTable.create() so that it's a no-op, then you could trick Pandas into thinking that the table already exists. The downside of this is that Pandas won't automatically create tables.

Raisaraise answered 31/10, 2020 at 16:5 Comment(6)

I think you are basically on the right track, but the issue occurs after the insert. When Pandas to_sql does a check on whether the table might have case sensitivity issues (look for: # check for potentially case sensitivity issues (GH7815) in to_sql) it ends up calling the function get_table_names in SQLAlchemy which is ultimately what causes the block. – Summertree 31/10, 2020 at 17:16

one relevant (albeit closed) github issue here: github.com/pandas-dev/pandas/issues/36542 – Boatyard 27/3, 2023 at 21:9

also here: github.com/pandas-dev/pandas/issues/52601 – Boatyard 17/7, 2023 at 16:4

A simple workaround is to break the transactions into two statements: 1) a small initial insert that commits the table schema DDL and 2) an append of the remaining records. Not the prettiest of solutions but it works. – Wikiup 16/1, 2024 at 22:49

@TonyDiFranco In principle, yes, but to_sql() doesn't give you that option. If you are doing the insert yourself then you have the freedom to do this. – Raisaraise 16/1, 2024 at 23:44

@NickODell it doesn't explicitly give you this option but executing df.iloc[:1, :].to_sql(if_exists='replace') an then df.iloc[1:, :].to_sql(if_exists='append') is the workaround that I mention. It will prevent from holding a lock on the information schema in ms sql server. – Wikiup 18/1, 2024 at 20:17

I created this issue to keep track

https://github.com/pandas-dev/pandas/issues/52601

This only occurs if the table name has any uppercase letters. MyTable will get stuck, while mytable will finish successfully.

highlighting code lines below.

https://github.com/pandas-dev/pandas/blob/7212ecd7791d92d83b7a5d48ababae6b67e4670e/pandas/io/sql.py#L1943
- due to this PR - https://github.com/pandas-dev/pandas/pull/8180
https://github.com/sqlalchemy/sqlalchemy/blob/b468775b5c8eb7ac17638d02fdca98b73c02bb94/lib/sqlalchemy/dialects/mssql/base.py#L3348
Note, this only gets blocked if the table name has an uppercase letter or all characters are digits.

Is this an issue that Pandas needs to correct? or SQLAchemy? I think Pandas

Workaround

Option 1: Use lowercase names in the table.

Option 2. set Transaction Isolation Level to Read Uncommitted in the SQLAchemy engine.

Option 3: some other user mentioned specifying the driver version

https://docs.sqlalchemy.org/en/20/dialects/mssql.html#transaction-isolation-level

engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params)
                          , fast_executemany=True
                           , isolation_level="READ UNCOMMITTED"
                          )

Boatyard answered 31/8, 2023 at 20:28 Comment(0)

As of pandas v.2.2.0 you can override the pandas method that runs the check which causes the block/deadlock. Add this before calling to_sql:

from pandas.io.sql import SQLDatabase
def pass_check_case_sensitive(*args, **kwargs):
    pass
SQLDatabase.check_case_sensitive = pass_check_case_sensitive

Summertree answered 30/1, 2024 at 22:46 Comment(2)

Good workaround - this should also work in v1.3+, right? github.com/pandas-dev/pandas/blob/1.3.x/pandas/io/sql.py – Boatyard 31/1, 2024 at 2:13

@Boatyard Yes, should work there as well. FWIW none of the other workarounds were options for more or worked for me which is why I posted this solution. – Summertree 31/1, 2024 at 13:35

execute set transaction isolation level read uncommitted before calling to_sql/

Urbina answered 10/8, 2022 at 18:45 Comment(0)

Mentioning the proper version of the ODBC driver in the connection string has solved this for me. I don't know how it solved it but the error is gone.

Sample code

from sqlalchemy import create_engine

db_connection_str = ( 
    'mssql+pyodbc://test_user:test_password@localhost/test_database'
    '?driver=ODBC+Driver+17+for+SQL+Server&TrustServerCertificate=yes'
)
engine = create_engine(db_connection_str)

df = pd.DataFrame({'id': [1, 2, 3], 'name': ['John', 'Mary', 'Peter']})
df.to_sql("my_empty_table", engine,if_exists="append", index=False)

Maxentia answered 15/12, 2023 at 13:14 Comment(2)

What is the db_connection_str that failed? – Summertree 30/1, 2024 at 20:6

It's without the driver version – Maxentia 24/4, 2024 at 6:29

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Implementation details

How to fix this

Recommended topics

Hot tags