to_date fails to parse date in Spark 3.0
Asked Answered
P

7

31

I am trying to parse date using to_date() but I get the following exception.

SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '12/1/2010 8:26' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.

The exception suggests I should use a legacy Time Parser, for starter I don't know how to set it to Legacy.

Here is my implementation

dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/dd/yyyy"))

my date is in following format

+--------------+
|   InvoiceDate|
+--------------+
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
Pergolesi answered 16/7, 2020 at 21:44 Comment(0)
V
38
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df.withColumn("date", to_date(col("InvoiceDate"), "MM/dd/yyyy")).show()


+--------------+----------+
|   InvoiceDate|      date|
+--------------+----------+
|12/1/2010 8:26|2010-12-01|
+--------------+----------+

# in above code spark refers SparkSession
Vesicatory answered 16/7, 2020 at 22:0 Comment(2)
Can the same date be parsed without using legacy parser policy ?Dhumma
Going back to Legacy is not a solution at all. Why so many up praises, I do not know.Unsound
H
14

You can keep using the new implementation of spark 3 by parsing the string into timestamp first, than cast into a date :

from pyspark.sql import functions as F

dfWithDate = df.withColumn("date", F.to_date(F.to_timestamp(col("InvoiceDate"), "M/d/yyyy H:mm")))

dfWithDate.show()
#+--------------+----------+
#|   InvoiceDate|      date|
#+--------------+----------+
#| 2/1/2010 8:26|2010-02-01|
#| 2/1/2010 8:26|2010-02-01|
#| 2/1/2010 8:26|2010-02-01|
#| 2/1/2010 8:26|2010-02-01|
#|12/1/2010 8:26|2010-12-01|
#+--------------+----------+
Huston answered 17/3, 2021 at 15:13 Comment(0)
D
11

According to this in spark 3 you should use pattern "M/d/y". It works for me.

Deficit answered 20/9, 2021 at 11:9 Comment(1)
It worked. You just highlighted the important thing. ExcellentUnsound
V
5

in case you want to keep using the Spark 3.0 version (not use the legacy version of time conversion), you can just use one digit of d in "MM/d/yyyy":

dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/d/yyyy"))
Vociferance answered 18/2, 2021 at 8:58 Comment(0)
M
3

Instead of using the legacy parser you could also update the date format from MM/dd/yyyy to MM-dd-yyyy

This is not a solution as it returns in NULL values

Moia answered 14/2, 2021 at 11:54 Comment(0)
B
1

One can use:

data = data.withColumn("Date", to_date(unix_timestamp("InvoiceDate", "MM/dd/yyyy").cast("timestamp")))

#result
    +--------------+----------+
    |   InvoiceDate|      Date|
    +--------------+----------+
    |12/1/2010 8:26|2010-12-01|

This works for me.

Blitzkrieg answered 7/2, 2022 at 6:43 Comment(0)
S
0

Instead of using the legacy parser you could also update the date format from MM/dd/yyyy to M-d-yy. It worked for me

Spirogyra answered 28/12, 2023 at 5:0 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Germanic

© 2022 - 2024 — McMap. All rights reserved.