how to convert a timestamp into string (without changing timezone)?
Asked Answered
D

3

5

I have some unix times that I convert to timestamps in sparklyr and for some reasons I also need to convert them into strings.

Unfortunately, it seems that during the conversion to string hive converts to EST (my locale).

df_new <- spark_read_parquet(sc, "/mypath/parquet_*",
                             overwrite = TRUE,
                             name = "df_new",
                             memory = FALSE,
                             options = list(mergeSchema = "true"))
> df_new %>%  
           mutate(unix_t = from_utc_timestamp(timestamp(t) ,'UTC'),
           date_str = date_format(unix_t, 'yyyy-MM-dd HH:mm:ss z'),
           date_alt = to_date(from_utc_timestamp(timestamp(t) ,'UTC'))) %>% 
    select(t, unix_t, date_str, date_alt) %>% head(5)
# Source:   lazy query [?? x 4]
# Database: spark_connection
            t unix_t              date_str                date_alt  
        <dbl> <dttm>              <chr>                   <date>    
1 1419547405. 2014-12-25 22:43:25 2014-12-25 17:43:25 EST 2014-12-25
2 1418469714. 2014-12-13 11:21:54 2014-12-13 06:21:54 EST 2014-12-13
3 1419126103. 2014-12-21 01:41:43 2014-12-20 20:41:43 EST 2014-12-20
4 1419389856. 2014-12-24 02:57:36 2014-12-23 21:57:36 EST 2014-12-23
5 1418271811. 2014-12-11 04:23:31 2014-12-10 23:23:31 EST 2014-12-10

As you can see both date_str and date_alt use the EST timezone. I need UTC here. How can I do that?

Thanks!

Donnelldonnelly answered 19/3, 2018 at 2:41 Comment(8)
Look up POSIXct. Also, I'm pretty sure the format strings for months and minutes are the other way around. You may want to double checkTelstar
we are talking abour spark here, right?Feriga
This could be of interest; stackoverflow.com/a/11237811Bohman
thanks but are you sure this works in sparklyr? i am not talking about base R hereFeriga
are you really using sparklyr and not SparkR here?Depress
this is sparklyrFeriga
the latest versionFeriga
How about posting output of dput(df_new)?Bemis
O
7

From the Hive function reference, date_format uses Java's SimpleDateFormat, which I believe always defaults to the JVM time zone, this explains why this gets you a character string converted to your time zone.

One option is to detect the time zone and manually add the hours to get UTC.

Another option would be to use lubridate with spark_apply():

sdf_len(sc, 1) %>%
  mutate(unix_t = from_utc_timestamp(timestamp(1522371003) , 'UDT')) %>%
  spark_apply(
    function(e) {
      dplyr::mutate(
        e,
        time_str = as.character(
          lubridate::with_tz(
            as.POSIXct(unix_t, origin="1970-01-01"),
            "GMT"
          )
        )
      )
    },
    columns = c("id", "unix_t", "time_str"))
Opportunism answered 30/3, 2018 at 1:19 Comment(2)
wonderful!!! it works! funnily enough, after the spark_apply the column unix_t is now a double (whereas it was a dttm before)Feriga
what is perhaps puzzling is that also to_date seems to use the timezones locale...Feriga
P
0

It's possible that sparklyr is doing some weird translation of timezones into the hive functions. I'd try registering the dataframe as a table and doing the manipulation with pure HQL:

createOrReplaceTempView(df_new, "df_new")
result <- sql("select from_utc_timestamp(timestamp(t) ,'UTC'),
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as STRING),
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as DATE)
   from df_new")
head(result)

edit

If you're unfamiliar with SQL-languages, you can add any of the variables from df_new as a comma separated list like so (and rename your selections with as)

select var1, var2, t,
   from_utc_timestamp(timestamp(t) ,'UTC') as unix_t,
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as STRING) as date_str,
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as DATE) as date_alt
   from df_new

You can also use * to represent all variables from the data frame:

select *,
   from_utc_timestamp(timestamp(t) ,'UTC') as unix_t,
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as STRING) as date_str,
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as DATE) as date_alt
   from df_new
Pillage answered 28/3, 2018 at 17:40 Comment(4)
do you know how can I use these SQL cast directly in the mutate above?Feriga
I believe all of the casting in mutate has to be done through R functions, and that those are the source of the issue, which is why I suggest using HQL to do this particular manipulation. Is there a reason it needs to be in the mutate?Pillage
thanks, i dont need the mutate, but i do need all the other variable as well in the same table.Feriga
you can add them to the select. If you're unfamiliar with SQL, I'll edit to show how.Pillage
H
-1

Try using as.POSIXct() ?

format(as.POSIXct(unix_t, origin = unix_t, tz = "UTC", usetz=TRUE),"%Y-%m-%d %H:mm:ss") 

This will first convert unix timestamp to UTC and then formatted to desired string.

Hairdresser answered 29/3, 2018 at 10:31 Comment(1)
does not work. this is not regular R, this is Sparklyr!Feriga

© 2022 - 2024 — McMap. All rights reserved.