Using the result of summarise (dplyr) to mutate the original dataframe
Asked Answered
A

3

5

I have a rather big dataframe with a column of POSIXct datetimes (~10yr of hourly data). I would flag all the rows in which the day falls in a Daylight saving period. For example if the Daylight shift starts on '2000-04-02 03:00:00' (DOY=93) i would like that the two previous hours of DOY=93 could be flagged. Although I am a newbie of dplyr I would use this package as much as possible and avoid for-loops as much as possible

For example:

library(lubridate)
sd = ymd('2000-01-01',tz="America/Denver")
ed = ymd('2005-12-31',tz="America/Denver")
span = data.frame(date=seq(from=sd,to=ed, by="hour"))
span$YEAR = year(span$date)
span$DOY = yday(span$date)
span$DLS = dst(span$date)

To find the different days of the year in which the daylight saving is applied I use dplyr

library(dplyr)
limits = span %.% group_by(YEAR) %.% summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]))

That gives

      YEAR minDOY maxDOY
    1 2000     93    303
    2 2001     91    301
    3 2002     97    300
    4 2003     96    299
    5 2004     95    305
    6 2005     93    303

Now I would 'pipe' the above results in the span dataframe without using a inefficient for-loop.

SOLUTION 1

with the help of @aosmith the problem can be tackled with just two commands (and avoiding the inner_join as in 'solution 2'):

 limits = span %>% group_by(YEAR) %>% mutate(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]),CHECK=FALSE)

 limits$CHECK[(limits2$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE      

SOLUTION 2

With the help of @beetroot and @matthew-plourde, the problem has been solved: an inner-join between was missing:

limits = span %>% group_by(YEAR) %>% summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS])) %>% inner_join(span, by='YEAR')

Then I just added a new column (CHECK) to fill with the right values for the Daylight-savings days

limits$CHECK = FALSE
limits$CHECK[(limits$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE
Anta answered 12/8, 2014 at 14:6 Comment(7)
So you want two new columns in span with minDOY and maxDOY and their values repeated for each row and respective years? That could probably done with left_join but can't test without example data.Crossfertilization
Thank you @beetroot for the hint. I will try it now! For the example data... the code above does not work for you?Anta
Ah I just don't have lubridate installed and am too lazy to do so for testing this right now ;)Crossfertilization
No problem, beetroot! All has been solved with your help and with the help of matthew-plourde. And now I have a new tool thanks to oropendola. See the code above.Anta
Adding new columns to an existing dataset seems like a job for mutate instead of summarise with joining.Mastin
yes, @aosmith... but the column to add contains a calculculation that can be accomplished (so far as I know) with the result of a joined summarise.Anta
@Anta Have you tried using mutate(minDOY = min(DOY[DLS]), maxDOY = max(DOY[DLS])) in place of where you use summarise in your original code? That adds columns of group specific values to the original dataset, which is what it looks like you were trying to do.Mastin
P
14

As @beetroot points out in the comments, you can accomplish this with a join:

limits = span %>% 
   group_by(YEAR) %>% 
   summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS])) %>%
   inner_join(span, by='YEAR')
#    YEAR minDOY maxDOY                date DOY   DLS
# 1  2000     93    303 2000-01-01 00:00:00   1 FALSE
# 2  2000     93    303 2000-01-01 01:00:00   1 FALSE
# 3  2000     93    303 2000-01-01 02:00:00   1 FALSE
# 4  2000     93    303 2000-01-01 03:00:00   1 FALSE
# 5  2000     93    303 2000-01-01 04:00:00   1 FALSE
# 6  2000     93    303 2000-01-01 05:00:00   1 FALSE
# 7  2000     93    303 2000-01-01 06:00:00   1 FALSE
# 8  2000     93    303 2000-01-01 07:00:00   1 FALSE
# 9  2000     93    303 2000-01-01 08:00:00   1 FALSE
# 10 2000     93    303 2000-01-01 09:00:00   1 FALSE
Propulsion answered 12/8, 2014 at 14:24 Comment(1)
The join works as an intermediate step to accomplish my task.Anta
R
4

dplyr is a great tool, but in this case I'm not sure it's the best for the job. This accomplishes your task:

span$CHECK <- ave(dst(span$date), as.Date(span$date, tz = tz(span$date)), FUN = any)

I think ave is a terrible name for this function, but if you can remember it exists, it's often quite useful when you want to join a summary back to the data.frame it came from.

Reginiaregiomontanus answered 12/8, 2014 at 15:38 Comment(3)
well... like often, in R there is always room for learning new functions and new tricks. I Never used the ave function before. I will try it immediately!Anta
This doesn't give the same results as CHECK in the OP. You need to group by both YEAR and DOY. I like your use of any, which makes the creation of CHECK so simple. I think the dplyr equivalent is span %>% group_by(YEAR, DOY) %>% mutate(CHECK = any(DLS))Mastin
Thanks @aosmith, that was a bug. I fixed the issue and made it so that CHECK could be calculated in one line using only span$date.Reginiaregiomontanus
A
1

The best solution to get the job done, as suggested by @aosmith, is.

limits = span %>% group_by(YEAR) %>% mutate(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]),CHECK=FALSE)

limits$CHECK[(limits2$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE

The use of the ave function is a good choice, but I personally prefer to stick to the 'dplyr' package.

Anta answered 16/8, 2014 at 11:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.