Using the result of summarise (dplyr) to mutate the original dataframe

Asked 12/8, 2014 at 14:6 Answered 16/8, 2014 at 11:4

I have a rather big dataframe with a column of POSIXct datetimes (~10yr of hourly data). I would flag all the rows in which the day falls in a Daylight saving period. For example if the Daylight shift starts on '2000-04-02 03:00:00' (DOY=93) i would like that the two previous hours of DOY=93 could be flagged. Although I am a newbie of dplyr I would use this package as much as possible and avoid for-loops as much as possible

For example:

library(lubridate)
sd = ymd('2000-01-01',tz="America/Denver")
ed = ymd('2005-12-31',tz="America/Denver")
span = data.frame(date=seq(from=sd,to=ed, by="hour"))
span$YEAR = year(span$date)
span$DOY = yday(span$date)
span$DLS = dst(span$date)

To find the different days of the year in which the daylight saving is applied I use dplyr

library(dplyr)
limits = span %.% group_by(YEAR) %.% summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]))

That gives

      YEAR minDOY maxDOY
    1 2000     93    303
    2 2001     91    301
    3 2002     97    300
    4 2003     96    299
    5 2004     95    305
    6 2005     93    303

Now I would 'pipe' the above results in the span dataframe without using a inefficient for-loop.

SOLUTION 1

with the help of @aosmith the problem can be tackled with just two commands (and avoiding the inner_join as in 'solution 2'):

 limits = span %>% group_by(YEAR) %>% mutate(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]),CHECK=FALSE)

 limits$CHECK[(limits2$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE

SOLUTION 2

With the help of @beetroot and @matthew-plourde, the problem has been solved: an inner-join between was missing:

limits = span %>% group_by(YEAR) %>% summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS])) %>% inner_join(span, by='YEAR')

Then I just added a new column (CHECK) to fill with the right values for the Daylight-savings days

limits$CHECK = FALSE
limits$CHECK[(limits$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE

Anta answered 12/8, 2014 at 14:6 Comment(7)

So you want two new columns in span with minDOY and maxDOY and their values repeated for each row and respective years? That could probably done with left_join but can't test without example data. – Crossfertilization 12/8, 2014 at 14:21

Thank you @beetroot for the hint. I will try it now! For the example data... the code above does not work for you? – Anta 12/8, 2014 at 14:32

Ah I just don't have lubridate installed and am too lazy to do so for testing this right now ;) – Crossfertilization 12/8, 2014 at 14:37

No problem, beetroot! All has been solved with your help and with the help of matthew-plourde. And now I have a new tool thanks to oropendola. See the code above. – Anta 12/8, 2014 at 15:54

Adding new columns to an existing dataset seems like a job for mutate instead of summarise with joining. – Mastin 13/8, 2014 at 3:18

yes, @aosmith... but the column to add contains a calculculation that can be accomplished (so far as I know) with the result of a joined summarise. – Anta 13/8, 2014 at 9:39

@Anta Have you tried using mutate(minDOY = min(DOY[DLS]), maxDOY = max(DOY[DLS])) in place of where you use summarise in your original code? That adds columns of group specific values to the original dataset, which is what it looks like you were trying to do. – Mastin 13/8, 2014 at 14:47

As @beetroot points out in the comments, you can accomplish this with a join:

limits = span %>% 
   group_by(YEAR) %>% 
   summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS])) %>%
   inner_join(span, by='YEAR')
#    YEAR minDOY maxDOY                date DOY   DLS
# 1  2000     93    303 2000-01-01 00:00:00   1 FALSE
# 2  2000     93    303 2000-01-01 01:00:00   1 FALSE
# 3  2000     93    303 2000-01-01 02:00:00   1 FALSE
# 4  2000     93    303 2000-01-01 03:00:00   1 FALSE
# 5  2000     93    303 2000-01-01 04:00:00   1 FALSE
# 6  2000     93    303 2000-01-01 05:00:00   1 FALSE
# 7  2000     93    303 2000-01-01 06:00:00   1 FALSE
# 8  2000     93    303 2000-01-01 07:00:00   1 FALSE
# 9  2000     93    303 2000-01-01 08:00:00   1 FALSE
# 10 2000     93    303 2000-01-01 09:00:00   1 FALSE

Propulsion answered 12/8, 2014 at 14:24 Comment(1)

The join works as an intermediate step to accomplish my task. – Anta 12/8, 2014 at 15:5

dplyr is a great tool, but in this case I'm not sure it's the best for the job. This accomplishes your task:

span$CHECK <- ave(dst(span$date), as.Date(span$date, tz = tz(span$date)), FUN = any)

I think ave is a terrible name for this function, but if you can remember it exists, it's often quite useful when you want to join a summary back to the data.frame it came from.

Reginiaregiomontanus answered 12/8, 2014 at 15:38 Comment(3)

well... like often, in R there is always room for learning new functions and new tricks. I Never used the ave function before. I will try it immediately! – Anta 12/8, 2014 at 15:49

This doesn't give the same results as CHECK in the OP. You need to group by both YEAR and DOY. I like your use of any, which makes the creation of CHECK so simple. I think the dplyr equivalent is span %>% group_by(YEAR, DOY) %>% mutate(CHECK = any(DLS)) – Mastin 13/8, 2014 at 15:18

Thanks @aosmith, that was a bug. I fixed the issue and made it so that CHECK could be calculated in one line using only span$date. – Reginiaregiomontanus 13/8, 2014 at 15:47

The best solution to get the job done, as suggested by @aosmith, is.

limits = span %>% group_by(YEAR) %>% mutate(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]),CHECK=FALSE)

limits$CHECK[(limits2$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE

The use of the ave function is a good choice, but I personally prefer to stick to the 'dplyr' package.

Anta answered 16/8, 2014 at 11:4 Comment(0)

SOLUTION 1

SOLUTION 2

Recommended topics

Hot tags