Convert date to end of month in Spark
Asked Answered
G

3

7

I have a Spark DataFrame as shown below:

#Create DataFrame    
df <- data.frame(name = c("Thomas", "William", "Bill", "John"),
      dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08'))
df <- createDataFrame(df)

#Make sure df$dates column is in 'date' format    
df <- withColumn(df, 'dates', cast(df$dates, 'date'))

name    | dates
--------------------
Thomas  |2017-01-05
William |2017-02-23
Bill    |2017-03-16
John    |2017-04-08

I want to change dates to the end of month date, so they would look like shown below. How do I do this? Either SparkR or PySpark code is fine.

name    | dates
--------------------
Thomas  |2017-01-31
William |2017-02-28
Bill    |2017-03-31
John    |2017-04-30
Glisten answered 21/6, 2017 at 21:38 Comment(0)
P
16

You may use the following (PySpark):

from pyspark.sql.functions import last_day

df.select('name', last_day(df.dates).alias('dates')).show()

To clarify, last_day(date) returns the last day of the month of which date belongs to.

I'm pretty sure there is a similar function in sparkR https://spark.apache.org/docs/1.6.2/api/R/last_day.html

Polypeptide answered 21/6, 2017 at 21:58 Comment(0)
A
3

last_day is a poorly named function and should be wrapped in something more descriptive to make the code easier to read.

endOfMonth is a better function name. Here's how to use this function with the Scala API. Suppose you have the following data:

+----------+
| some_date|
+----------+
|2016-09-10|
|2020-01-01|
|2016-01-10|
|      null|
+----------+

Run the endOfMonth function that's part of spark-daria:

import com.github.mrpowers.spark.daria.sql.functions._

df.withColumn("res", endOfMonth(col("some_date"))).show()

Here are the results:

+----------+----------+
| some_date|       res|
+----------+----------+
|2016-09-10|2016-09-30|
|2020-01-01|2020-01-31|
|2016-01-10|2016-01-31|
|      null|      null|
+----------+----------+

I'll try to add this function to quinn as well so there is an easily accessible function for PySpark users as well.

Apps answered 30/12, 2020 at 4:38 Comment(0)
G
1

For completeness, here is the SparkR code:

df <- withColumn(df, 'dates', last_day(df$dates))
Glisten answered 21/6, 2017 at 22:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.