Return closest date to a given date in R
Asked Answered
A

5

8

My data frame consists of individual observations of individual animals. Each animal has a birthdate, that I would like to associate to the closest field season date from a date vector.

Here is a very basic reproducible example:

ID <- c("a", "b", "c", "d", "a") # individual "a" is measured twice here
birthdate <- as.Date(c("2012-06-12", "2014-06-14", "2015-11-11", "2016-09-30", "2012-06-12"))    
df <- data.frame(ID, birthdate)

# This is the date vector
season_enddates <- as.Date(c("2011-11-10", "2012-11-28", "2013-11-29", "2014-11-26", "2015-11-16", "2016-11-22", "2012-06-21", "2013-06-23", "2014-06-25", "2015-06-08", "2016-06-14"))

With the following code, I can get the difference between the birthdate and the closest season enddate.

for(i in 1:length(df$birthdate)){
  df$birthseason[i] <- which(abs(season_enddates-df$birthdate[i]) == min(abs(season_enddates-df$birthdate[i])))
}

However, what I want is the actual date, and not the difference. For example, the first value of birthseason should be 2012-06-21.

Alarum answered 13/7, 2017 at 13:9 Comment(3)
Are you using the lubridate package?Tontine
Would you mind editing the post so the object names in the second part (for loop) use the names in the first part of the example. It's unclear where younger$HatchCalendarYear1 comes from, for example.Colophon
@Colophon I suggested the edits you mention, have a lookMayapple
V
4

It's a bit confusing since you use variables which you didn't include in your examples.

But I think this is what you want:

for (ii in 1:nrow(df))  df$birthseason[ii] <-as.character(season_enddates[which.min(abs(df$birthdate[ii] - season_enddates))])

Alternatively using lapply:

df$birthseason <- unlist(lapply(df$birthdate,function(x) as.character(season_enddates[which.min(abs(x - season_enddates))])))

Result:

> df
  ID  birthdate birthseason
1  a 2012-06-12  2012-06-21
2  b 2014-06-14  2014-06-25
3  c 2015-11-11  2015-11-16
4  d 2016-09-30  2016-11-22
5  a 2012-06-12  2012-06-21
Vainglorious answered 13/7, 2017 at 13:32 Comment(0)
M
2

You are looking for which season_enddate is the closest to birthdate[1], and birthdate[2], etc.

To get the data straight, I will create an actual reproducible example:

birthdate <- as.Date(c("2012-06-12", "2014-06-14", 
                       "2015-11-11", "2016-09-30", 
                       "2012-06-12"))

season_enddates <- as.Date(c("2011-11-10", "2012-11-28", 
                             "2013-11-29", "2014-11-26",
                             "2015-11-16", "2016-11-22", 
                             "2012-06-21", "2013-06-23", 
                             "2014-06-25", "2015-06-08", 
                             "2016-06-14"))

Basically I use the function you also used, except that I decided to break it down a bit, so it's easier to follow what you're trying to do:

new.vector <- rep(0, length(birthdate))
for(i in 1:length(birthdate)){
    diffs <- abs(birthdate[i] - season_enddates)
    inds  <- which.min(diffs)
    new.vector[i] <- season_enddates[inds]
}

# new.vector now contains some dates that have been converted to numbers:
as.Date(new.vector, origin = "1970-01-01")
# [1] "2012-06-21" "2014-06-25" "2015-11-16" "2016-11-22"
# [5] "2012-06-21"
Murtagh answered 13/7, 2017 at 13:39 Comment(2)
code looks a lot cleaner here. Maybe edit the question to use the same layout? Makes it much easier to follow.Mayapple
Sorry, I must have pasted a part of the code I used on my real dataset by accident... It's fixed now.Alarum
J
2

All solutions here are essentially the same. If you want to have an optimized function doing this operation for you, this is how I'd do it:

match_season <- function(x,y){
  nx <- length(x)
  ind <- numeric(nx)
  for(i in seq_len(nx)){
    ind[i] <- which.min(abs(x[i] - y))
  }
  y[ind]
}

Then you can simply do:

younger$birthseason <- match_season(younger$HatchDate, season_enddates)

Looks cleaner and gives you the desired output in the correct Date format.

Benchmarking:

start <- as.Date("1990-07-01")
end <- as.Date("2017-06-30")

birthdate <- sample(seq(start, end, by = "1 day"), 1000)

season_enddates <- seq(as.Date("1990-12-21"),
                       as.Date("2017-6-21"),
                       by = "3 months")

library(rbenchmark)

benchmark(match_season(birthdate, season_enddates),
          columns = c("test","elapsed"))

gives a timing of 7.62 seconds for 100 replications.

Justus answered 13/7, 2017 at 14:22 Comment(0)
M
1

I have suggested some edits to your question, so that your example code produces all variables required to reproduce your problem. Please have a look and check that I understood your problem.

To solve it, i suggest to use which.min (keeps your code a bit simpler and faster), in combination with subsetting of your season_enddates vector, as shown below:

for(i in 1:length(younger$HatchCalendarYear)){
  df$birthseasonDate[i] <- season_enddates[which.min(abs(season_enddates - df$birthdate[i]))]
}
Mayapple answered 13/7, 2017 at 13:38 Comment(0)
D
1

findInterval is useful in such cases. Locating the nearest season_enddates for each df$birthdate:

vec = sort(season_enddates)
int = findInterval(df$birthdate, vec, all.inside = TRUE)
int
#[1]  1  5  8 10  1

we compare the distance from each of the surrounding dates of the interval and select the minimum:

ans = vec[int]
i = abs(df$birthdate - vec[int]) > abs(df$birthdate - vec[int + 1])
ans[i] = vec[int[i] + 1]
ans
#[1] "2012-06-21" "2014-06-25" "2015-11-16" "2016-11-22" "2012-06-21"
Dump answered 13/7, 2017 at 20:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.