Converting string/chr to date using sparklyr
Asked Answered
R

1

5

I've brought a table into Hue which has a column of dates and i'm trying to play with it using sparklyr in Rstudio.

I'd like to convert a character column into a date column like so:

Weather_data = mutate(Weather_data, date2 = as.Date(date, "%m/%d/%Y"))

and this runs fine but when i check:

head(Weather_data) 

enter image description here

How to I properly convert the chr to dates?

Thanks!!!!

Redfaced answered 27/9, 2017 at 17:13 Comment(5)
Can you provide the data in dput(Weather_data) format before you applied the conversion?Arlettaarlette
Thank you for your answer. I'm sorry, i'm not the best with R or using hue so I'm not quite sure what you mean. But when I brought the data into hue, I tried to select the column type as date but when it showed the output, it was all NAs so I used string insteadRedfaced
sdf_mutate(Weather_data, date2 = as.Date(date, "%m/%d/%Y"))Hoggard
Thank you for your message. THis gave me the error:Redfaced
Error in as.Date.default(list(src = list(con = list(master = "yarn-client", : c("do not know how to convert 'structure(list(src = structure(list(con = structure(list(master = \"yarn-client\", ' to class “Date”", "do not know how to convert ' method = \"shell\", app_name = \"sparklyr\", config.......Redfaced
L
8

The problem is that sparklyr doesn't correctly support Spark DateType. It is possible to parse dates, and correct format, but not represent these as proper DateType columns. If that's enough then please follow the instructions below.

In Spark 2.2 or later use to_date with Java SimpleDataFormat compatible string:

df <- copy_to(sc, data.frame(date=c("01/01/2010")))
parsed <- df %>% mutate(date_parsed = to_date(date, "MM/dd/yyyy"))
parsed
# Source:   lazy query [?? x 2]
# Database: spark_connection
        date date_parsed
       <chr>       <chr>
1 01/15/2010  2010-01-15

Interestingly internal Spark object still uses DateType columns:

parsed %>% spark_dataframe
<jobj[120]>
  class org.apache.spark.sql.Dataset
  [date: string, date_parsed: date]

For earlier versions unix_timestamp and cast (but watch for possible timezone problems):

df %>%
  mutate(date_parsed = sql(
    "CAST(CAST(unix_timestamp(date, 'MM/dd/yyyy') AS timestamp) AS date)"))
# Source:   lazy query [?? x 2]
# Database: spark_connection
        date date_parsed
       <chr>       <chr>
1 01/15/2010  2010-01-15

Edit:

It looks like this problem has been resolved on current master (sparklyr_0.7.0-9105):

# Source:   lazy query [?? x 2]
# Database: spark_connection
        date date_parsed
       <chr>      <date>
1 01/01/2010  2009-12-31
Lionfish answered 28/9, 2017 at 21:29 Comment(4)
beware that the timezone default in the JVM will play dirty tricks here! #49355577Religion
@ℕʘʘḆḽḘ In practice timezone should be always configured explicitly for the Spark session - otherwise things can get ugly pretty fast (which much more serious outcomes than the one linked, which is actually intended and documented :). So you either need extraJavaOptions or respective Spark SQL config.Lionfish
great! do you know what would be the correct command with javaextraoptions to set UTC?Religion
Old way --conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC" or if you use spark-submit or equivalent in Spark configuration. New way (2.2+) --conf spark.sql.session.timeZone=UTC. A few answers (stackoverflow.com/…) if you want to give a credit to some user :) The possible gotcha is that things get weird, if for some crazy reason different nodes have different settings :)Lionfish

© 2022 - 2024 — McMap. All rights reserved.