Snowflake write_pandas is not inserting dates correctly
Asked Answered
D

6

14

I have a pandas dataframe named "df" that I'm creating with the result of a SQL query against a Netezza database. I'm working in Jupyter notebook. The dataframe has two rows and two of the columns (CREATEDDATE and STAGEDATE) contain datetime values. When I run print(df), the result looks like this:

                   ID ISDELETED            PARENTID         CREATEDBYID  \
0  017o000003tQRftAAG     false  a0no000000Hrv1IAAR  005o0000001w8wgAAA   
1  017o000003jl52cAAA     false  a0no000000GszDUAAZ  005o0000001w2pTAAQ   

          CREATEDDATE    FIELD OLDVALUE NEWVALUE  STAGEDATE  
0 2015-07-30 14:51:41  created     None     None 2016-06-06  
1 2015-07-16 14:48:37  created     None     None 2016-06-06

If I run print(df.dtypes), the result is this:

ID                     object
ISDELETED              object
PARENTID               object
CREATEDBYID            object
CREATEDDATE    datetime64[ns]
FIELD                  object
OLDVALUE               object
NEWVALUE               object
STAGEDATE      datetime64[ns]
dtype: object

So, as far as I can tell, my datetime columns are correctly formatted to write to Snowflake using write_pandas(). However, after I do that, the dates are wildly different in Snowflake: enter image description here

The STAGEDATE value of 2016-06-06, for example, is now 48399-06-06. Does anyone know how to fix this? I'm doing my pull from Netezza using pyodbc and using df = cs.execute() to populate the dataframe. I am importing and using snowflake.connector for my connection to Snowflake and am running the following to get write_pandas:

from snowflake.connector.pandas_tools import write_pandas
Demodulator answered 16/3, 2021 at 22:34 Comment(4)
can you share the structure of the destination table in Snowflake? in particular the data type of the two date columns CREATEDDATE and STAGEDATE. Also, it might be worth checking the TIMESTAMP_OUTPUT_FORMAT parameter (see Show Parameters command docs.snowflake.com/en/sql-reference/sql/show-parameters.html)Birdlime
create or replace TRANSIENT TABLE ACCT_HST ( ID VARCHAR(18), ISDELETED VARCHAR(5), ACCOUNTID VARCHAR(18), CREATEDBYID VARCHAR(18), CREATEDDATE TIMESTAMP_NTZ(9), FIELD VARCHAR(255), OLDVALUE VARCHAR(255), NEWVALUE VARCHAR(255), STAGEDATE TIMESTAMP_NTZ(9) );Demodulator
my hunch is that your session/account parameter setting for TIMESTAMP_OUTPUT_FORMAT is responsible for it. You could prove that converting into UTC your dates using CONVERT_TIMEZONE('UTC', CREATEDDATE) or checking the parameters TIMESTAMP_OUTPUT_FORMAT and TIMESTAMP_NTZ_OUTPUT_FORMAT running SHOW PARAMETERS;Birdlime
@FrancescoQuaratino, Snowflake is the destination. The source is a Netezza appliance (IBM Pure Data). Datetimes loaded into Snowflake through other methods are fine. It's only the ones I'm loading from my Python program that have this issue.Demodulator
T
11

I found a solution with the code that pday wrote here. This function automatically adds to the date type cols a timezone (default one in the function is UTC).

def fix_date_cols(df, tz='UTC'):
    cols = df.select_dtypes(include=['datetime64[ns]']).columns
    for col in cols:
        df[col] = df[col].dt.tz_localize(tz)

So I would suggest using this function before pass the dataframe to write_pandas.

From what I get the issue is that the datetime object is misinterpreted, because it is not enough well defined. Adding the info of a timezone will force the interpretation of the elements as datetime.

But I really recommend you to read this amazing conversation where it is well explained, it really helped me.

Teteak answered 24/1, 2022 at 13:16 Comment(0)
C
8

I'm hoping this will help others that are in a similar position to mine.

My data frame column had a data type of date and I kept getting this error: Python Date Upload Error - Data Type Conversion

Eventually I found the solution was to change the date to a ISO 8601 datetime format.

dataDF["DATE_COL"] = dataDF["DATE_COL"].astype('datetime64')
dataDF["DATE_COL"] = dataDF["DATE_COL"].dt.strftime('%Y-%m-%d') #Note full format can be strftime('%Y-%m-%dT%H:%M:%SZ')

Hope this helps others as I spent AGES trying to figure this out!

Constructionist answered 15/8, 2023 at 9:15 Comment(2)
This helps me a lot. ThanksRaby
It helped me, too! Basically, if Snowflake column's data type is 'date', and panda's data is a string such as '2024-03-26', then write_pandas knows how to convert.Landlord
C
4

The solution fix_date_cols(df, tz = 'UTC') regarding fixing datetime columns by localizing to UTC did not work for me.

However, write_pandas() now issues a userwarning to include use_logical_type=True parameter if your dataframe has datetime columns. And that worked!

write_pandas(conn=conn, df=df_clean, table_name='VENTITY_STG', use_logical_type=True)
Curson answered 26/3 at 21:28 Comment(0)
D
0
SELECT '2016-06-06'::timestamp as a
    ,to_timestamp_ntz(date_part('epoch_second', a),0)::date as a_s
    ,to_timestamp_ntz(date_part('epoch_millisecond', a),0)::date as a_ms
    ,to_timestamp_ntz(date_part('epoch_millisecond', a),3)::date as b_ms;

gives

A                        A_S          A_MS         B_MS
2016-06-06 00:00:00.000  2016-06-06   48399-06-06  2016-06-06

which is to say, your date, turned into epoch millisconds and parse as seconds gives your date.

so you ether throw away your milliseconds, or alter how your dates are getting parsed.

Dercy answered 17/3, 2021 at 2:32 Comment(1)
According to the Snowflake documentation for the write_pandas function, the Snowflake to Pandas data mapping of TIMESTAMP_NTZ, TIMESTAMP_LTZ, and TIMESTAMP_TZ in Snowflake is to pandas.Timestamp(np.datetime64[ns]). My datetime values are already datetime64[ns], so I thought they would work, but I'll see if an implicit cast in Python corrects the issue.Demodulator
B
0

The solution from @Lorenzo Vitali works well. And I've added it to my snowflake helper class. You'll want to remember to return df. While this a “partial copy” it fixes the above issue.

def fix_date_cols(df, tz='UTC'):
     cols = df.select_dtypes(include=['datetime64[ns]']).columns
     for col in cols:
         df[col] = df[col].dt.tz_localize(tz)
     return df
Blast answered 5/1, 2023 at 14:52 Comment(2)
Please, do not copy other answers (especially with indentation error, which, in Python, is fatal).Safety
Please don't add "thank you" as an answer. Once you have sufficient reputation, you will be able to vote up questions and answers that you found helpful. - From ReviewSafety
T
0

The solution presented by Lorenzo sufficed for local development but failed when containerizing the script. I assumed a missing Python library, but found a better solution offered by Snowflake.

The problem was fixed by updating the version of snowflake-connector-python==3.4.0 and using the use_logical_type=True parameter in write_pandas to insert the datetime pandas columns with no time zone (NTZ). This is part of the underlying COPY INTO file format parameter for Parquet used for the snowflake connector.

success, nchunks, nrows, ouput = write_pandas(conn, df, 
                                              table_name=df.name,
                                              auto_create_table=True,
                                              overwrite=True,
                                              chunk_size=150000,
                                              compression='gzip',
                                              on_error="ABORT_STATEMENT",
                                              parallel=4,                                       
                                              quote_identifiers=True, 
                                              use_logical_type=True)
Tref answered 26/4 at 23:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.