How to properly insert pandas NaT datetime values to my postgresql table

Asked 19/10, 2020 at 21:47 Answered 13/4, 2022 at 8:43

Solved python pandas postgresql sqlalchemy psycopg2

I am tying to bulk insert a dataframe to my postgres dB. Some columns in my dataframe are date types with NaT as a null value. Which is not supported by PostgreSQL, I've tried to replace NaT (using pandas) with other NULL type identifies but that did not work during my inserts.

I used df = df.where(pd.notnull(df), 'None') to replace all the NaTs, Example of errors that keep coming up due to datatype issues.

Error: invalid input syntax for type date: "None"
LINE 1: ...0,1.68757,'2022-11-30T00:29:59.679000'::timestamp,'None','20...

My driver and insert statement to postgresql dB:

def execute_values(conn, df, table):
    """
    Using psycopg2.extras.execute_values() to insert the dataframe
    """
    # Create a list of tupples from the dataframe values
    tuples = [tuple(x) for x in df.to_numpy()]
    # Comma-separated dataframe columns
    cols = ','.join(list(df.columns))
    # SQL quert to execute
    query  = "INSERT INTO %s(%s) VALUES %%s" % (table, cols)
    cursor = conn.cursor()
    try:
        extras.execute_values(cursor, query, tuples)
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        conn.rollback()
        cursor.close()
        return 1
    print("execute_values() done")
    cursor.close()

Info about my dataframe: for this case the culprits are the datetime columns only.

how is this commonly solved?

Raddled answered 19/10, 2020 at 21:47 Comment(4)

Not that it solves your exact problem, but I would suggest writing dataframe to csv, then using copy expert to get the data into the table rather than insert #46715854 The NaT's in the dataframe should be written to csv as empty, then there should be no problem inserting into the table – Astragal 19/10, 2020 at 22:49

@macaw_9227, I would like to use my INSERT statement because my pk auto increments by default. – Raddled 19/10, 2020 at 23:4

that isn't a problem, you can still copy, just don't include the pk in the csv you are copying from. – Astragal 20/10, 2020 at 0:7

How can I tweak copy my statement to fit the table when my postgresql is expecting an int for pkey column? I can remove it from pandas, but the copy statement expects the same table shape. It will be expecting an integer inplace where my primary key is. – Raddled 20/10, 2020 at 1:51

You're re-inventing the wheel. Just use pandas' to_sql method and it will

match up the column names, and
take care of the NaT values.

Use method="multi" to give you the same effect as psycopg2's execute_values.

from pprint import pprint

import pandas as pd
import sqlalchemy as sa

table_name = "so64435497"
engine = sa.create_engine("postgresql://scott:[email protected]/test")
with engine.begin() as conn:
    # set up test environment
    conn.exec_driver_sql(f"DROP TABLE IF EXISTS {table_name}")
    conn.exec_driver_sql(
        f"CREATE TABLE {table_name} ("
        "id integer PRIMARY KEY GENERATED ALWAYS AS IDENTITY, "
        "txt varchar(50), "
        "txt2 varchar(50), "
        "dt timestamp)"
    )
    df = pd.read_csv(r"C:\Users\Gord\Desktop\so64435497.csv")
    df["dt"] = pd.to_datetime(df["dt"])
    print(df)
    """console output:
                       dt  txt2  txt
    0 2020-01-01 00:00:00  foo2  foo
    1                 NaT  bar2  bar
    2 2020-01-02 03:04:05  baz2  baz
    """

    # run test
    df.to_sql(
        table_name, conn, index=False, if_exists="append", method="multi"
    )
    pprint(
        conn.exec_driver_sql(
            f"SELECT id, txt, txt2, dt FROM {table_name}"
        ).all()
    )
    """console output:
    [(1, 'foo', 'foo2', datetime.datetime(2020, 1, 1, 0, 0)),
     (2, 'baz', 'baz2', None),
     (3, 'bar', 'bar2', datetime.datetime(2020, 1, 2, 3, 4, 5))]
    """

Abidjan answered 20/10, 2020 at 15:52 Comment(4)

This is exactly the type of answer I was looking for :). Thank you for sharing this my way. This is the first suggestion that feels right (after searching for answers on StackOveflow). Using df.to_sql solved this issue. – Raddled 20/10, 2020 at 17:1

Adding another comment because of something I noticed with df.to_sql, I don't think this command is meant to be used on a frequent basis for inserts. It allows for duplicates. – Raddled 20/10, 2020 at 17:11

Suppose You can't use to_sql for some reason... how does pandas internally convert NaT to psycopg2 valid types? – Bother 15/11, 2020 at 17:29

@Bother see my answer here – Giddy 13/4, 2022 at 8:43

Regarding your original update statement:
df = df.where(pd.notnull(df), 'None')

What's happening here is you are replacing the values with the STRING 'None' and not the special Python object None. Then in the following insert statement, it tries to insert the string 'None' into a timestamp field and throws an error.

What's funny is that the version of this you'd expect to work:
df = df.where(pd.notnull(df), None)
does not actually seem to work as expected for NaT values for reasons I don't fully understand. (See example below)

But what DOES seem to work is this statement (assuming you have numpy imported as np):
df = df.replace({np.NaN: None})
So if you do THAT, then the NaN and NaT values all convert to Python None and then psycopg2 (or probably any other db connector) will correctly treat those values as SQL Nulls on inserts.

Here's some example code to illustrate:

import datetime as dt
import pandas as pd
import numpy as np
data = [
    ['one', 1.0, pd.NaT],
    ['two', np.NaN, dt.datetime(2019, 2, 2)],
    [None, 3.0, dt.datetime(2019, 3, 3)]
    ]
df = pd.DataFrame(data, columns=["Name", "Value", "Event_date"])

Got our basic dataframe:

>>> df
   Name  Value Event_date
0   one    1.0        NaT
1   two    NaN 2019-02-02
2  None    3.0 2019-03-03

As mentioned above, this update leaves the NaT in it for some reason:

>>> df.where(pd.notnull(df), None)
   Name Value Event_date
0   one   1.0        NaT
1   two  None 2019-02-02
2  None   3.0 2019-03-03

But this version gets both the NaNs and NaTs and leaves the expected Nones:

>>> df.replace({np.NaN: None})
   Name Value           Event_date
0   one   1.0                 None
1   two  None  2019-02-02 00:00:00
2  None   3.0  2019-03-03 00:00:00

The accepted answer is probably the "better" way if you can use sqlalchemy for what you want to do, but if you gotta do it the hard way, this worked for me.

H/T to the discussion in this pandas issue for much of the details of this answer.

Moy answered 29/1, 2021 at 16:20 Comment(1)

This is actually great, thank you for the input. This seems like a more reliable way to solve my original problem. – Raddled 29/1, 2021 at 17:47

You're re-inventing the wheel. Just use pandas' to_sql method and it will

match up the column names, and
take care of the NaT values.

Use method="multi" to give you the same effect as psycopg2's execute_values.

from pprint import pprint

import pandas as pd
import sqlalchemy as sa

table_name = "so64435497"
engine = sa.create_engine("postgresql://scott:[email protected]/test")
with engine.begin() as conn:
    # set up test environment
    conn.exec_driver_sql(f"DROP TABLE IF EXISTS {table_name}")
    conn.exec_driver_sql(
        f"CREATE TABLE {table_name} ("
        "id integer PRIMARY KEY GENERATED ALWAYS AS IDENTITY, "
        "txt varchar(50), "
        "txt2 varchar(50), "
        "dt timestamp)"
    )
    df = pd.read_csv(r"C:\Users\Gord\Desktop\so64435497.csv")
    df["dt"] = pd.to_datetime(df["dt"])
    print(df)
    """console output:
                       dt  txt2  txt
    0 2020-01-01 00:00:00  foo2  foo
    1                 NaT  bar2  bar
    2 2020-01-02 03:04:05  baz2  baz
    """

    # run test
    df.to_sql(
        table_name, conn, index=False, if_exists="append", method="multi"
    )
    pprint(
        conn.exec_driver_sql(
            f"SELECT id, txt, txt2, dt FROM {table_name}"
        ).all()
    )
    """console output:
    [(1, 'foo', 'foo2', datetime.datetime(2020, 1, 1, 0, 0)),
     (2, 'baz', 'baz2', None),
     (3, 'bar', 'bar2', datetime.datetime(2020, 1, 2, 3, 4, 5))]
    """

Abidjan answered 20/10, 2020 at 15:52 Comment(4)

Suppose You can't use to_sql for some reason... how does pandas internally convert NaT to psycopg2 valid types? – Bother 15/11, 2020 at 17:29

@Bother see my answer here – Giddy 13/4, 2022 at 8:43

If you can't use pandas's to_sql method, you can register an adapter with psycopg instead:

import pandas as pd
from psycopg2.extensions import register_adapter, AsIs

# Register adapter for pandas NA type (e.g. null datetime or integer values)
# NOTE: Must use protected member, rather than pd.NA, as pd.NA is just defined as None
register_adapter(pd._libs.missing.NAType, lambda i: AsIs('NULL'))

When you then call any of psycopg's execute methods, it will automatically convert any pd.NA values into PostgreSQL NULL values.

Note that the same principle can also be used for numpy NaN values:

import numpy as np
from psycopg2.extensions import register_adapter, AsIs, Float

# Register adapter for np.nan
register_adapter(float, lambda f: AsIs('NULL') if np.isnan(f) else Float(f))

Giddy answered 13/4, 2022 at 8:43 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags