How to get the weekday from day of month using pyspark
Asked Answered
H

5

12

I have a dataframe log_df: enter image description here

I generate a new dataframe based on the following code:

from pyspark.sql.functions import split, regexp_extract 
split_log_df = log_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_log_df.show(10, truncate=False)

the new dataframe is like: enter image description here

I need another column showing the dayofweek, what would be the best elegant way to create it? ideally just adding a udf like field in the select.

Thank you very much.

Updated: my question is different than the one in the comment, what I need is to make the calculation based on a string in log_df, not based on the timestamp like the comment, so this is not a duplicate question. Thanks.

Hellgrammite answered 13/8, 2016 at 3:31 Comment(6)
Write a UDF python function that uses the python datetime module and parse out the timestamp column.Berners
Possible duplicate of How to get day of week in SparkSQL?Garotte
@cricket_007 that's exactly what I am asking for help here, thanks.Hellgrammite
You could reformat / cast the timestamp column into a Date format that Spark accepts... Then this question practically is a duplicate. And you don't need to regex extract the date string, it has a standard format that you can use datetime.strptime forBerners
@cricket_007 Thanks. Can you provide your full script here? I am really not satisfied with my own solution posted below hereHellgrammite
If you put your current python UDF code in the question, I'm sure someone might provide a solution. It's just Python code you need for the given timestamp field, Spark isn't really required for your questionBerners
H
-5

I finally resolved the question myself, here is the complete solution:

  1. import date_format, datetime, DataType
  2. first, modify the regexp to extract 01/Jul/1995
  3. convert 01/Jul/1995 to DateType using func
  4. create a udf dayOfWeek to get the week day in brief format (Mon, Tue,...)
  5. using the udf to convert the DateType 01/Jul/1995 to weekday which is Sat enter image description here

I am not satisfied with my solution as it seems to be so zig-zag, it would be appreciated if anyone can come up with a more elegant solution, thank you in advance.

Hellgrammite answered 13/8, 2016 at 10:57 Comment(0)
J
45

I suggest a bit different method

from pyspark.sql.functions import date_format
df.select('capturetime', date_format('capturetime', 'u').alias('dow_number'), date_format('capturetime', 'E').alias('dow_string'))
df3.show()

It gives ...

+--------------------+----------+----------+
|         capturetime|dow_number|dow_string|
+--------------------+----------+----------+
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
Jacelynjacenta answered 7/6, 2017 at 18:13 Comment(4)
is 'u' option gone?Tiffany
Looks like so, in spark 3.0 'u' is not there, spark.apache.org/docs/latest/sql-ref-datetime-pattern.html. Spark 3.0 is suggesting to set spark.sql.legacy.timeParserPolicy to LEGACY to get old behavior.Incredible
Please, feel free to update the answer and publish the recent solution without 'u'. I no longer work with pyspark. Thanks!Jacelynjacenta
You can use "E" to get the string version of day-of-week, spark.apache.org/docs/latest/sql-ref-datetime-pattern.htmlStrade
R
17

Since Spark 2.3 you can use the dayofweek function https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.dayofweek.html

from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', dayofweek('my_timestamp'))

However this defines the start of the week as a Sunday = 1

If you don't want that, but instead require Monday = 1, then you could do an inelegant fudge like either subtracting 1 day before using the dayofweek function or amend the result such as like this

from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', ((dayofweek('my_timestamp')+5)%7)+1)
Rabjohn answered 17/6, 2021 at 11:56 Comment(0)
A
4

I did this to get weekdays from date:

def get_weekday(date):
    import datetime
    import calendar
    month, day, year = (int(x) for x in date.split('/'))    
    weekday = datetime.date(year, month, day)
    return calendar.day_name[weekday.weekday()]

spark.udf.register('get_weekday', get_weekday)

Example of usage:

df.createOrReplaceTempView("weekdays")
df = spark.sql("select DateTime, PlayersCount, get_weekday(Date) as Weekday from weekdays")

enter image description here

Annal answered 12/2, 2018 at 8:55 Comment(0)
M
0
## Here is a potential solution with using UDF which can solve the issue.  

# UDF’s are a black box to PySpark as it can’t apply any optimization and you 
# will lose all the optimization PySpark does on Dataframe. so you should use 
# Spark SQL built-in functions as these functions provide optimization. 
# you should use UDF only when existing built-in SQL function doesn’t have it.


from dateutil.parser import parse

def findWeekday(dt):
    dt = parse(dt)
    return dt.strftime('%A')

weekDayUDF = udf(lambda x:findWeekday(x),StringType())

df.withColumn('weekday',weekDayUDF('ORDERDATE')).show()


+-------+---------------+--------+---------+
|  SALES|      ORDERDATE|MONTH_ID|  weekday|
+-------+---------------+--------+---------+
| 2871.0| 2/24/2003 0:00|       2|   Monday|
| 2765.9|  5/7/2003 0:00|       5|Wednesday|
|3884.34|  7/1/2003 0:00|       7|  Tuesday|
| 3746.7| 8/25/2003 0:00|       8|   Monday|
|5205.27|10/10/2003 0:00|      10|   Friday|
|3479.76|10/28/2003 0:00|      10|  Tuesday|
|2497.77|11/11/2003 0:00|      11|  Tuesday|
|5512.32|11/18/2003 0:00|      11|  Tuesday|
|2168.54| 12/1/2003 0:00|      12|   Monday|
|4708.44| 1/15/2004 0:00|       1| Thursday|
|3965.66| 2/20/2004 0:00|       2|   Friday|
 
Metempsychosis answered 8/5, 2022 at 6:12 Comment(0)
H
-5

I finally resolved the question myself, here is the complete solution:

  1. import date_format, datetime, DataType
  2. first, modify the regexp to extract 01/Jul/1995
  3. convert 01/Jul/1995 to DateType using func
  4. create a udf dayOfWeek to get the week day in brief format (Mon, Tue,...)
  5. using the udf to convert the DateType 01/Jul/1995 to weekday which is Sat enter image description here

I am not satisfied with my solution as it seems to be so zig-zag, it would be appreciated if anyone can come up with a more elegant solution, thank you in advance.

Hellgrammite answered 13/8, 2016 at 10:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.