SQLAlchemy hangs during insert while querying [INFORMATION_SCHEMA].[TABLES]
Asked Answered
S

5

7

I have a Python process that uses SQLAlchemy to insert some data into a MS SQL Server DB. When the Python process runs it hangs during the insert. I turned on SQLAlchemy logging to get some more information. I found that it hangs at this point where SQLAlchemy seems to be requesting table schema info about the entire DB:

2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')

I have other "stuff" going on in the DB at this time, including some open transactions and my guess is that for whatever reason querying [INFORMATION_SCHEMA].[TABLES] creates some deadlock or blocks somehow.

I've also read (here) that [INFORMATION_SCHEMA].[TABLES] is a view that cannot cause a deadlock which would contradict my guess of what is causing this issue.

My question is: Can I alter the configuration/settings of SQLAlchemy so that it does not make this query in the first place?

UPDATE 1: The Python code for the insert is like this:

with sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params).connect() as connection:
    # df is a Pandas DataFrame
    df.to_sql(name=my_table, con=connection, if_exists='append', index=False)

Note that the code works without any problems when I run the Python script during other times of the day when I don't have those other DB transactions going on. In those cases, the log continues immediately like this, listing all the tables in the DB:

2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._init_metadata(result.py:810) DEBUG: Col ('TABLE_NAME',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table1',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table2',)
...

UPDATE 2: Apparently when a table or other object is created in an open transaction and not committed yet, querying [INFORMATION_SCHEMA].[TABLES] will get blocked (source). Is anyone familiar with the internals of SQLAlchemy to suggest how to prevent it from making this query in the first place?

UPDATE 3: After posting this issue on the SQLAlchemy github (issue link) the SQLAlchemy devs confirmed that the query of [INFORMATION_SCHEMA].[TABLES] is in fact being caused by the Pandas function to_sql().

So, my new question is does anyone know how to disable this behavior in the Pandas to_sql() function? I looked over the documentation and could not find anything that would seem to help.

Summertree answered 30/10, 2020 at 14:27 Comment(14)
Possibly related: dba.stackexchange.com/q/164429/21390Achromatin
Related: #59636771Raisaraise
Discussed on GitHub here.Achromatin
@GordThompson the SQLAlchemy dev mentions using either ALTER DATABASE MyDatabase SET ALLOW_SNAPSHOT_ISOLATION ON or ALTER DATABASE MyDatabase SET READ_COMMITTED_SNAPSHOT ON to not lock as aggressively by turning on snapshot isolation. I'm concerned about making these changes on the database level. Do you know of a setting in SQL server to make queries on [INFORMATION_SCHEMA].[TABLES] not block when an open transaction exists that has created a new table in that database?Summertree
You could try create_engine(connection_uri, isolation_level="READ_UNCOMMITTED") and see if that prevents the hang.Achromatin
Unfortunately that did not prevent the query from hanging.Summertree
Hmm, it works for me. (Hangs without isolation_level and does not hang when I add it.) Perhaps try engine = create_engine(connection_uri) and the omit the with block and call to_sql with engine as the second parameter. That works for me, too.Achromatin
That is strange, it doesn't work for me. Also just tried engine = ... as you suggested and it also did not work. The code hangs on table_names = [r[0] for r in connection.execute(s)] in get_table_names of sqlalchemy.dialects.mssql.base.py. Just to be clear, when you test this, you have an open transaction on the same database where you have created a new table in that database in the open transaction? That's what I have and as soon as I rollback or commit that transaction in SQL the Python code continues running again.Summertree
@Summertree - "Just to be clear, when you test this, you have an open transaction on the same database where you have created a new table in that database in the open transaction?" - Yes, exactly.Achromatin
Darn. Still doesn't work for me. Perhaps their are some database related settings that are causing the difference. I wonder if their is some way I can confirm the isolation_level="READ_UNCOMMITTED" option is being used in general.Summertree
SELECT transaction_isolation_level FROM sys.dm_exec_sessions WHERE session_id = @@SPID; returns the numeric value. 1 means READ_UNCOMMITTED.Achromatin
Anyone ever figure out a fix for this? We've been stuck for a couple of days because of this.Huxley
I have not...just ended up monkey patching Pandas for the time being.Summertree
Encountered this too while a separate process is creating a large columnstore index. It locks information_schema and prevents to_sql from running.Boatyard
S
2

As of pandas v.2.2.0 you can override the pandas method that runs the check which causes the block/deadlock. Add this before calling to_sql:

from pandas.io.sql import SQLDatabase
def pass_check_case_sensitive(*args, **kwargs):
    pass
SQLDatabase.check_case_sensitive = pass_check_case_sensitive
Summertree answered 30/1 at 22:46 Comment(2)
Good workaround - this should also work in v1.3+, right? github.com/pandas-dev/pandas/blob/1.3.x/pandas/io/sql.pyBoatyard
@Boatyard Yes, should work there as well. FWIW none of the other workarounds were options for more or worked for me which is why I posted this solution.Summertree
R
2

I'm not very familiar with SQLAlchemy, but I can tell you about the Pandas side of this issue.

Pandas automatically creates a new table if the table doesn't exist. The way it figures out whether the table exists is that it calls has_table() in SQL Alchemy. The way has_table() works is that it queries the information schema. (At least, it works that way in MySQL and MSSQL.)

Implementation details

Here's what I found tracing the logic for this in Pandas and SQLAlchemy. We start in pandas/io/sql.py, inside to_sql().

        table = SQLTable(
            name,
            self,
            frame=frame,
            index=index,
            if_exists=if_exists,
            index_label=index_label,
            schema=schema,
            dtype=dtype,
        )
        table.create()

SQLTable.create() is defined here:

class SQLTable(PandasObject):
    [...]
    def create(self):
        if self.exists():
            if self.if_exists == "fail":
                raise ValueError(f"Table '{self.name}' already exists.")
            elif self.if_exists == "replace":
                self.pd_sql.drop_table(self.name, self.schema)
                self._execute_create()
            elif self.if_exists == "append":
                pass
            else:
                raise ValueError(f"'{self.if_exists}' is not valid for if_exists")
        else:
            self._execute_create()

Notice that it calls exists() unconditionally. Inside SQLTable.exists(), you'll find this:

    def exists(self):
        return self.pd_sql.has_table(self.name, self.schema)

This eventually calls has_table() in SQLAlchemy: https://docs.sqlalchemy.org/en/13/core/internals.html#sqlalchemy.engine.default.DefaultDialect.has_table

For MSSQL, this is implemented in sqlalchemy/dialects/mssql/base.py in SQLAlchemy:

    @_db_plus_owner
    def has_table(self, connection, tablename, dbname, owner, schema):
        if tablename.startswith("#"):  # temporary table
            [...]
        else:
            tables = ischema.tables

            s = sql.select(tables.c.table_name).where(
                sql.and_(
                    tables.c.table_type == "BASE TABLE",
                    tables.c.table_name == tablename,
                )
            )

            if owner:
                s = s.where(tables.c.table_schema == owner)

            c = connection.execute(s)

            return c.first() is not None

(ischema is an abbreviation of information_schema, and this code is running a select on that table.)

How to fix this

I don't see a good, simple way to fix this. Pandas assumes that has_table() is a cheap operation. MSSQL doesn't follow that assumption. No matter what if_exists is set to, Pandas will call has_table() during to_sql().

I can think of a hacky way of doing this, though. If you were to monkey-patch pandas.io.sql.SQLTable.create() so that it's a no-op, then you could trick Pandas into thinking that the table already exists. The downside of this is that Pandas won't automatically create tables.

Raisaraise answered 31/10, 2020 at 16:5 Comment(6)
I think you are basically on the right track, but the issue occurs after the insert. When Pandas to_sql does a check on whether the table might have case sensitivity issues (look for: # check for potentially case sensitivity issues (GH7815) in to_sql) it ends up calling the function get_table_names in SQLAlchemy which is ultimately what causes the block.Summertree
one relevant (albeit closed) github issue here: github.com/pandas-dev/pandas/issues/36542Boatyard
also here: github.com/pandas-dev/pandas/issues/52601Boatyard
A simple workaround is to break the transactions into two statements: 1) a small initial insert that commits the table schema DDL and 2) an append of the remaining records. Not the prettiest of solutions but it works.Wikiup
@TonyDiFranco In principle, yes, but to_sql() doesn't give you that option. If you are doing the insert yourself then you have the freedom to do this.Raisaraise
@NickODell it doesn't explicitly give you this option but executing df.iloc[:1, :].to_sql(if_exists='replace') an then df.iloc[1:, :].to_sql(if_exists='append') is the workaround that I mention. It will prevent from holding a lock on the information schema in ms sql server.Wikiup
B
2

I created this issue to keep track

This only occurs if the table name has any uppercase letters. MyTable will get stuck, while mytable will finish successfully.

highlighting code lines below.

Is this an issue that Pandas needs to correct? or SQLAchemy? I think Pandas

Workaround

Option 1: Use lowercase names in the table.

Option 2. set Transaction Isolation Level to Read Uncommitted in the SQLAchemy engine.

Option 3: some other user mentioned specifying the driver version

https://docs.sqlalchemy.org/en/20/dialects/mssql.html#transaction-isolation-level

engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params)
                          , fast_executemany=True
                           , isolation_level="READ UNCOMMITTED"
                          )
Boatyard answered 31/8, 2023 at 20:28 Comment(0)
S
2

As of pandas v.2.2.0 you can override the pandas method that runs the check which causes the block/deadlock. Add this before calling to_sql:

from pandas.io.sql import SQLDatabase
def pass_check_case_sensitive(*args, **kwargs):
    pass
SQLDatabase.check_case_sensitive = pass_check_case_sensitive
Summertree answered 30/1 at 22:46 Comment(2)
Good workaround - this should also work in v1.3+, right? github.com/pandas-dev/pandas/blob/1.3.x/pandas/io/sql.pyBoatyard
@Boatyard Yes, should work there as well. FWIW none of the other workarounds were options for more or worked for me which is why I posted this solution.Summertree
U
1

execute set transaction isolation level read uncommitted before calling to_sql/

Urbina answered 10/8, 2022 at 18:45 Comment(0)
M
1

Mentioning the proper version of the ODBC driver in the connection string has solved this for me. I don't know how it solved it but the error is gone.

Sample code

from sqlalchemy import create_engine

db_connection_str = ( 
    'mssql+pyodbc://test_user:test_password@localhost/test_database'
    '?driver=ODBC+Driver+17+for+SQL+Server&TrustServerCertificate=yes'
)
engine = create_engine(db_connection_str)

df = pd.DataFrame({'id': [1, 2, 3], 'name': ['John', 'Mary', 'Peter']})
df.to_sql("my_empty_table", engine,if_exists="append", index=False)
Maxentia answered 15/12, 2023 at 13:14 Comment(2)
What is the db_connection_str that failed?Summertree
It's without the driver versionMaxentia

© 2022 - 2024 — McMap. All rights reserved.