Earliest Date for each id in R

Asked 11/8, 2016 at 10:24 Answered 17/7, 2019 at 20:3

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value. I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:

new <- aggregate(e_date ~ id, data_full, min)

data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)

new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)

data_fixed <- data_full[which(new$comb %in% data_full$comb),]

The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.

Hep answered 11/8, 2016 at 10:24 Comment(2)

It would be good to provide reproducible data especially when working with dates. I created my own example and aggregate with min worked fine, so you really need to dput the data. stackoverflow.com/help/mcve See also the guidelines for sharing data in the R tag description. – Fortuity 11/8, 2016 at 10:27

What is the format of e_date column ? Also have a look at this – Regime 11/8, 2016 at 10:29

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).

library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]

Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.

library(dplyr)
data_full %>%
    group_by(id) %>%
    arrange(e_date) %>%
    slice(1L)

If we need a base R option, ave can be used

data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]

Peloria answered 11/8, 2016 at 10:43 Comment(2)

@LeroyTyrone Wouldn't it be better to ask the OP for that – Peloria 25/8, 2017 at 5:15

in the second example, will arrange not override group by, without the .by_group argument? I'm asking for clarification, I'm quite new to coding and I learn a lot from responses – Sharell 20/1 at 5:58

Another answer that uses dplyr's filter command:

dta %>% 
  group_by(id) %>%
  filter(date == min(date))

Laticialaticiferous answered 17/7, 2019 at 20:3 Comment(0)

You may use library(sqldf) to get the minimum date as follows:

data1<-data.frame(id=c("789","123","456","123","123","456","789"),
                  e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))  

library(sqldf)
data2 = sqldf("SELECT id,
                    min(e_date) as 'earliest_date'
                    FROM data1 GROUP BY 1", method = "name__class")    

head(data2)   

id   earliest_date   
123    2014-03-01      
456    2015-07-08   
789    2015-12-11

Chimb answered 11/8, 2016 at 11:13 Comment(0)

I made a reproducible example, supposing that you grouped some dates by which quarter they were in.

library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)

df %>%
  group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])

# A tibble: 4 x 2
  which_quarter sort(rand_weeks)[1]
          <dbl>              <time>
1             1 2017-01-05 05:46:32
2             2 2017-04-06 05:46:32
3             3 2016-08-18 05:46:32
4             4 2016-10-06 05:46:32

Freightage answered 11/8, 2016 at 10:47 Comment(0)

Recommended topics

Hot tags