PySpark dataframe convert unusual string format to Timestamp

Asked 22/8, 2016 at 20:47 Answered 22/8, 2023 at 13:8

Solved apache-spark dataframe pyspark apache-spark-sql timestamp

I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:

Row[(datetime='2016_08_21 11_31_08')]

Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp? Something that can eventually come along the lines of

df = df.withColumn("date_time",df.datetime.astype('Timestamp'))

I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace _ with - in the date half and _ with : in the time part.

I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

Wolcott answered 22/8, 2016 at 20:47 Comment(0)

Spark >= 2.2

from pyspark.sql.functions import to_timestamp

(sc
    .parallelize([Row(dt='2016_08_21 11_31_08')])
    .toDF()
    .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
    .show(1, False))

## +-------------------+-------------------+
## |dt                 |parsed             |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+

Spark < 2.2

It is nothing that unix_timestamp cannot handle:

from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp

(sc
    .parallelize([Row(dt='2016_08_21 11_31_08')])
    .toDF()
    .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
    # For Spark <= 1.5
    # See issues.apache.org/jira/browse/SPARK-11724 
    .cast("double")
    .cast("timestamp"))
    .show(1, False))

## +-------------------+---------------------+
## |dt                 |parsed               |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+

In both cases the format string should be compatible with Java SimpleDateFormat.

Evelyn answered 22/8, 2016 at 21:35 Comment(2)

Great. You can check issues.apache.org/jira/browse/SPARK-11724 for some details. – Evelyn 23/8, 2016 at 14:43

can I use the unix_timestamp function if the timestamp string has timezones? That is, the pattern is yyyy_MM_dd HH_mm_ss z like 1995_05_20 20_30_11 -400? – Inviolate 22/4, 2020 at 1:27

zero323's answer answers the question, but I wanted to add that if your datetime string has a standard format, you should be able to cast it directly into timestamp type:

df.withColumn('datetime', col('datetime_str').cast('timestamp'))

It has the advantage of handling milliseconds, while unix_timestamp only has only second-precision (to_timestamp works with milliseconds too but requires Spark >= 2.2 as zero323 stated). I tested it on Spark 2.3.0, using the following format: '2016-07-13 14:33:53.979' (with milliseconds, but it also works without them).

Archine answered 17/9, 2018 at 14:18 Comment(0)

I add some more code lines from Florent F's answer for better understanding and running the snippet in local machine:

import os, pdb, sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, ArrayType  
from pyspark.sql.types import StringType
from pyspark.sql.functions import col

sc = pyspark.SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()

# preparing some example data - df1 with String type and df2 with Timestamp type
df1 = sc.parallelize([{"key":"a", "date":"2016-02-01"}, 
    {"key":"b", "date":"2016-02-02"}]).toDF()
df1.show()

df2 = df1.withColumn('datetime', col('date').cast("timestamp"))
df2.show()

Countable answered 30/3, 2021 at 3:48 Comment(0)

Just want to add more resources and example into this discussion. https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html For example, if your ts string is "22 Dec 2022 19:06:36 EST", then the format is "dd MMM yyyy HH:mm:ss zzz"

Stamin answered 11/1, 2023 at 23:46 Comment(0)

spark - 2.4

to_timestamp() is not able to handle datetime string with daylight saving, whereas unix_timestamp() saves it. I was using with from_utc_timestamp() for zone conversion.

Directly using datetime string gave 1 hour incorrect result with from_utc_timestamp(). By using from_utc_timestamp(unix_timestamp().cast(TimestampType())) gave the right conversion.

See answered 22/8, 2023 at 13:8 Comment(0)

Recommended topics

Hot tags