Calculate Difference between dates by group in R
Asked Answered
A

2

6

I'm using a logistic exposure to calculate hatching success for bird nests. My data set is quite extensive and I have ~2,000 nests, each with a unique ID ("ClutchID). I need to calculate the number of days a given nest was exposed ("Exposure"), or more simply, the difference between the 1st and last day. I used the following code:

HS_Hatch$Exposure=NA    
for(i in 2:nrow(HS_Hatch)){HS_Hatch$Exposure[i]=HS_Hatch$DateVisit[i]- HS_Hatch$DateVisit[i-1]}

where HS_Hatch is my dataset and DateVisit is the actual date. The only problem is R is calculating an exposure value for the 1st date (which doesn't make sense).

What I really need is to calculate the difference between the 1st and last date for a given clutch. I've also looked into the following:

Exposure=ddply(HS_Hatch, "ClutchID", summarize, 
                     orderfrequency = as.numeric(diff.Date(DateVisit)))


df %>%
  mutate(Exposure =  as.Date(HS_Hatch$DateVisit, "%Y-%m-%d")) %>%
  group_by(ClutchID) %>%
  arrange(Exposure) %>%
  mutate(lag=lag(DateVisit), difference=DateVisit-lag)

I'm still learning R so any help would be greatly appreciated.

Edit: Below is a sample of the data I'm using

HS_Hatch <- structure(list(ClutchID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                                        2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L
), DateVisit = c("3/15/2012", "3/18/2012", "3/20/2012", "4/1/2012", 
                 "4/3/2012", "3/18/2012", "3/20/2012", "3/22/2012", "4/3/2012", 
                 "4/4/2012", "3/22/2012", "4/3/2012", "4/4/2012", "3/18/2012", 
                 "3/20/2012", "3/22/2012", "4/2/2012", "4/3/2012", "4/4/2012", 
                 "3/20/2012", "3/22/2012", "3/25/2012", "3/27/2012", "4/4/2012", 
                 "4/5/2012"), Year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 
                                       2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 
                                       2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 
                                       2012L), Survive = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                                           1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("tbl_df", 
                                                                                                                               "tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("ClutchID", 
                                                                                                                                                                                         "DateVisit", "Year", "Survive"), spec = structure(list(cols = structure(list(
                                                                                                                                                                                             ClutchID = structure(list(), class = c("collector_integer", 
                                                                                                                                                                                                                                    "collector")), DateVisit = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                           "collector")), Year = structure(list(), class = c("collector_integer", 
                                                                                                                                                                                                                                                                                                                                             "collector")), Survive = structure(list(), class = c("collector_integer", 
                                                                                                                                                                                                                                                                                                                                                                                                  "collector"))), .Names = c("ClutchID", "DateVisit", "Year", 
                                                                                                                                                                                                                                                                                                                                                                                                                             "Survive")), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "collector"))), .Names = c("cols", "default"), class = "col_spec"))
Anthropolatry answered 13/11, 2016 at 3:19 Comment(9)
Welcome to Stack Overflow! Can you please include data that will provide us with a reproducible example ?Whimsicality
maybe summarise(exposure=diff(range(DateVisit))) ?Whimsicality
what @BenBolker said, just to add that his summarise line should go after your group_by line. And depending on the class of DateVisit you can either leave off your first mutate line, or change the summarise line to reference Exposure instead of DateVisit.Tadio
Please, edit your question and add the data properly formatted there, not in a comment and not as a link. Using dput is preferred. Thank you.Boohoo
@Tadio I modified the code per your Ben Boke suggestions, but I'm getting the following error message: "Error in UseMethod("mutate") : no applicable method for 'mutate' applied to an object of class "function" I tried removing the 1st mutate function but then I get this message: " no applicable method for 'regroup' applied to an object of class "Date" Like I mentioned I'm not super familiar with R and coding so I don't understand these error messagesAnthropolatry
First, please heed the dput suggestion above. Have you removed the last few lines of code in your question off? I'll leave putting all this together in an answer to @BenBolker if he wants to.Tadio
@Tadio 1)I'm sorry, I don't understand how to properly format data with dput. Where can I find out more about that? 2)Which lines of code are you referring to?Anthropolatry
One of your proximal problems is that plyr and dplyr both have mutate functions. Try specifying dplyr::mutate.Whimsicality
@BenBolker good point, except I think that the problematic mask might be with summarise should your first comment be implementedTadio
M
8

Collecting some of the comments...

Load dplyr

We need only the dplyr package for this problem. If we load other packages, e.g. plyr, it can cause conflicts if both packages have functions with the same name. Let's load only dplyr.

library(dplyr)

In the future, you may wish to load tidyverse instead -- it includes dplyr and other related packages, for graphics, etc.

Converting dates

Let's convert the DateVisit variable from character strings to something R can interpret as a date. Once we do this, it allows R to calculate differences in days by subtracting two dates from each other.

HS_Hatch <- HS_Hatch %>%
 mutate(date_visit = as.Date(DateVisit, "%m/%d/%Y"))

The date format %m/%d/%Y is different from your original code. This date format needs to match how dates look in your data. DateVisit has dates as month/day/year, so we use %m/%d/%Y.

Also, you don't need to specify the dataset for DateVisit inside mutate, as in HS_Hatch$DateVisit, because it's already looking in HS_Hatch. The code HS_Hatch %>% ... says 'use HS_Hatch for the following steps'.

Calculating exposures

To calculate exposure, we need to find the first date, last date, and then the difference between the two, for each set of rows by ClutchID. We use summarize, which collapses the data to one row per ClutchID.

exposure <- HS_Hatch %>% 
    group_by(ClutchID) %>%
    summarize(first_visit = min(date_visit), 
              last_visit = max(date_visit), 
              exposure = last_visit - first_visit)

first_visit = min(date_visit) will find the minimum date_visit for each ClutchID separately, since we are using group_by(ClutchID).

exposure = last_visit - first_visit takes the newly-calculated first_visit and last_visit and finds the difference in days.

This creates the following result:

  ClutchID first_visit last_visit exposure
     <int>      <date>     <date>    <dbl>
1        1  2012-03-15 2012-04-03       19
2        2  2012-03-18 2012-04-04       17
3        3  2012-03-22 2012-04-04       13
4        4  2012-03-18 2012-04-04       17
5        5  2012-03-20 2012-04-05       16

If you want to keep all the original rows, you can use mutate in place of summarize.

Man answered 13/11, 2016 at 13:59 Comment(4)
Thank you so much! I've been tearing my hair out trying to figure this out. As someone who is still learning R coding, I appreciate you explaining the code step-by-step. I was able to understand it and follow along.Anthropolatry
Click "accept" if the answer solves your problem. This will help saying thank you to the responder, and will help others find the solution in the future.Tadio
I am facing a similar problem. Upon replicating this example, for exposure, I only get returned one line. The grouping factor is ignored. Do you experience the same issue? Or is it because I'm using a newer version perhaps?Broucek
I'm not sure. I was able to replicate it with R 3.4.3 and dplyr 0.7.4. What does your one row result look like?Man
B
1

Here is a similar solutions if you look for a difftime results in days, from a vector date, without NA values produce in the new column, and if you expect to group by several conditions/groups.

make sure that your vector of date as been converting in the good format as previously explained.

dat2 <- dat %>% 
select(group1, group2, date) %>% 
arrange(group1, group2, date) %>% 
group_by(group1, group2) %>% 
mutate(diff_date = c(0,diff(date)))
Bannon answered 21/11, 2020 at 13:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.