I have some spark(scala) dataframes/tables with timestamps which are coming from our DHW and which are using some High Watermarks some times.
I want to work with this data in python with pandas so I write them as parquet files from spark and read it again with pandas.
The problem is pandas/pyarrow can not deal with the timestamps. These get converted to dateTime64[ns]
, witch has a limited date range which it can hold. So some timestamps (especially all high water marks) get wrong entries.
How can I force pandas to interpret the timestamps as a dateTime[mu]
for example. or to set the high (and low) watermarks to NAN instead of just using wrong converted values?
Here is a minimum code example:
Spark:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val df_spark = Seq(
(1, "1050-01-01 23:00:01"),
(2, "2014-11-30 12:40:32"),
(3, "2016-12-29 09:54:00"),
(4, "2500-01-01 00:00:00")).toDF("id", "date-string")
.withColumn("valid_to", to_timestamp(col("date-string")))
df_spark.repartition(1).write.parquet("timestamptest")
df_spark.show
+---+-------------------+-------------------+
| id| date-string| valid_to|
+---+-------------------+-------------------+
| 1|1050-01-01 23:00:01|1050-01-01 23:00:01|
| 2|2014-11-30 12:40:32|2014-11-30 12:40:32|
| 3|2016-12-29 09:54:00|2016-12-29 09:54:00|
| 4|2500-01-01 00:00:00|2500-01-01 00:00:00|
+---+-------------------+-------------------+
When Read in Python:
import pandas as pd
df_pandas= pd.read_parquet("timestamptest")
df_pandas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 4 non-null int32
1 date-string 4 non-null object
2 valid_to 4 non-null datetime64[ns]
dtypes: datetime64[ns](1), int32(1), object(1)
memory usage: 208.0+ bytes
df_pandas
id date-string valid_to
0 1 1050-01-01 23:00:01 2219-02-16 22:09:08.419103232
1 2 2014-11-30 12:40:32 2014-11-30 12:40:32.000000000
2 3 2016-12-29 09:54:00 2016-12-29 09:54:00.000000000
3 4 2500-01-01 00:00:00 1915-06-14 00:25:26.290448384
You see the first and the last timesamp are wrong now. If I read the parqeut again with spark everything is fine.
pd.Timestamp
(Timestamp('1677-09-21 00:12:43.145225')/Timestamp('2262-04-11 23:47:16.854775807')). you might want to have a look at pd.Period. – Aggapperapd.read_parquet
, not pandas directly (e.g. pd.to_datetime would raiseOutOfBoundsDatetime
error). Looks like the "out-of-bouds-dates" silently cause overflow somewhere, and whatever value remains is converted to pandas' datetime. – Aggappera