How to fix timestamp interpretation in python pandas of parquet files
Asked Answered
A

0

6

I have some spark(scala) dataframes/tables with timestamps which are coming from our DHW and which are using some High Watermarks some times.

I want to work with this data in python with pandas so I write them as parquet files from spark and read it again with pandas. The problem is pandas/pyarrow can not deal with the timestamps. These get converted to dateTime64[ns], witch has a limited date range which it can hold. So some timestamps (especially all high water marks) get wrong entries.

How can I force pandas to interpret the timestamps as a dateTime[mu] for example. or to set the high (and low) watermarks to NAN instead of just using wrong converted values?

Here is a minimum code example:

Spark:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._


val df_spark = Seq(
  (1, "1050-01-01 23:00:01"), 
  (2, "2014-11-30 12:40:32"), 
  (3, "2016-12-29 09:54:00"), 
  (4, "2500-01-01 00:00:00")).toDF("id", "date-string")
  .withColumn("valid_to", to_timestamp(col("date-string")))

df_spark.repartition(1).write.parquet("timestamptest")
df_spark.show
+---+-------------------+-------------------+
| id|        date-string|           valid_to|
+---+-------------------+-------------------+
|  1|1050-01-01 23:00:01|1050-01-01 23:00:01|
|  2|2014-11-30 12:40:32|2014-11-30 12:40:32|
|  3|2016-12-29 09:54:00|2016-12-29 09:54:00|
|  4|2500-01-01 00:00:00|2500-01-01 00:00:00|
+---+-------------------+-------------------+

When Read in Python:

import pandas as pd
df_pandas= pd.read_parquet("timestamptest")
df_pandas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           4 non-null      int32         
 1   date-string  4 non-null      object        
 2   valid_to     4 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int32(1), object(1)
memory usage: 208.0+ bytes


df_pandas

id  date-string valid_to
0   1   1050-01-01 23:00:01 2219-02-16 22:09:08.419103232
1   2   2014-11-30 12:40:32 2014-11-30 12:40:32.000000000
2   3   2016-12-29 09:54:00 2016-12-29 09:54:00.000000000
3   4   2500-01-01 00:00:00 1915-06-14 00:25:26.290448384

You see the first and the last timesamp are wrong now. If I read the parqeut again with spark everything is fine.

Augmenter answered 9/10, 2020 at 8:55 Comment(5)
both "wrong" timestamps exceed the min/max of pd.Timestamp (Timestamp('1677-09-21 00:12:43.145225')/Timestamp('2262-04-11 23:47:16.854775807')). you might want to have a look at pd.Period.Aggappera
Yes this I figured out. my problem is when i read the parquet file, the values are directly converted to dateTimes and all values out of range get wrong values. It would be already helpful, if there these values out of range get set to NaT instead. I just don't know how to stop pandas from instantly (wrongly) converting the values.Augmenter
to me this seems to be related to the parquet library used by pd.read_parquet, not pandas directly (e.g. pd.to_datetime would raise OutOfBoundsDatetime error). Looks like the "out-of-bouds-dates" silently cause overflow somewhere, and whatever value remains is converted to pandas' datetime.Aggappera
convert your timestamp to date before exporting and you should be able to read in with pandas, otherwise, keep your timestamps within the bounds that @Mr.Fuppes mentioned.Stillman
Or cleanup the high watermarks in spark prior to writing the parquet file.Lowestoft

© 2022 - 2024 — McMap. All rights reserved.