R: calculate number of distinct categories in the specified time frame
Asked Answered
I

3

3

here's some dummy data:

  user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple

I'd like to calculate for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order

The solution would look like this:

 user_id       date category distinct_7 distinct_14
      27 2016-01-01    apple          1           1
      27 2016-01-03    apple          1           1
      27 2016-01-05     pear          2           2
      27 2016-01-07     plum          3           3
      27 2016-01-10    apple          3           3
      27 2016-01-14     pear          3           3
      27 2016-01-16     plum          3           3
      11 2016-01-01    apple          1           1
      11 2016-01-03     pear          2           2
      11 2016-01-05     pear          2           2
      11 2016-01-07     pear          2           2
      11 2016-01-10    apple          2           2
      11 2016-01-14    apple          2           2
      11 2016-01-16    apple          1           2

I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!

Imaginable answered 17/1, 2017 at 9:12 Comment(13)
Why does it start with 0?Limassol
that was my typo, now corrected, thanks!Imaginable
Are you sure the values in distinct_7 are correct? If I look at 2016-01-10, should it start as a new group. Also, if you look at the value of distinct_7 for user_id 11, it starts at 0.Limassol
in distinct_7, between 2016-01-10 and 2016-01-03 there were in total 3 categories for user 27 and 2 for user 11. Does it make sense now?Imaginable
You can iterate over date, i.e. library(tidyverse); df %>% group_by(user_id) %>% mutate(distinct_7 = map_int(date, ~n_distinct(category[date >= .x - 7 & date <= .x])), distinct_14 = map_int(date, ~n_distinct(category[date >= .x - 14 & date <= .x]))), though I think there's probably a more elegant way to do this with zoo::rollapply or the like.Wily
should 11 2016-01-01 not be 1 for both output cases?Kindly
relatedIgnorance
@alistaire: this works very well for me, although it takes time to process. Still, if you write an answer, I'll happily accept it.Imaginable
@Aramis7d: thanks for spotting this, it's now correctedImaginable
@alistaire: I just tried to use your solution to calculate cumulative sums of numerical values, e.g.: order_value_minus7 = map_int(order_date, ~cumsum(order_value[order_date >= .x - 7 & order_date <= .x])), but it returns Error: object 'order_value' not found. Why can't the same approach work universally, or can it? Thanks again for your helpImaginable
Fwiw, the data.table analogue of n_distinct (in alistaire's comment-answer) is uniqueN.Bossuet
You don't have a variable called order_value in what you show above. I'm not posting an answer because the data doesn't match your desired result, I believe because the desired result is miscalculated, if I understand what you want. Also, I'm pretty sure there's a better way to do it, anyway.Wily
@alistaire: sure, that's absolutely understandable. I asked about order_value because your solution looked pretty universal and I thought it would require only a small tweak to use it for numerical variable. Anyway, why do you think that the desired solution is miscalculated?Imaginable
W
1

In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.

library(tidyverse)

df %>% group_by(user_id) %>% 
    mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), 
           distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))

## Source: local data frame [14 x 5]
## Groups: user_id [2]
## 
##    user_id       date category distinct_7 distinct_14
##      <int>     <date>   <fctr>      <int>       <int>
## 1       27 2016-01-01    apple          1           1
## 2       27 2016-01-03    apple          1           1
## 3       27 2016-01-05     pear          2           2
## 4       27 2016-01-07     plum          3           3
## 5       27 2016-01-10    apple          3           3
## 6       27 2016-01-14     pear          3           3
## 7       27 2016-01-16     plum          3           3
## 8       11 2016-01-01    apple          1           1
## 9       11 2016-01-03     pear          2           2
## 10      11 2016-01-05     pear          2           2
## 11      11 2016-01-07     pear          2           2
## 12      11 2016-01-10    apple          2           2
## 13      11 2016-01-14    apple          2           2
## 14      11 2016-01-16    apple          1           2
Wily answered 20/1, 2017 at 0:53 Comment(0)
T
4

I recommend using runner package. You can use any R function on running windows with runner function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):

df <- read.table(
  text = "  user_id       date category
       27 2016-01-01    apple
  27 2016-01-03    apple
  27 2016-01-05     pear
  27 2016-01-07     plum
  27 2016-01-10    apple
  27 2016-01-14     pear
  27 2016-01-16     plum
  11 2016-01-01    apple
  11 2016-01-03     pear
  11 2016-01-05     pear
  11 2016-01-07     pear
  11 2016-01-10    apple
  11 2016-01-14    apple
  11 2016-01-16    apple", header = TRUE, colClasses = c("integer", "Date", "character"))



library(dplyr)
library(runner)
df %>%
  group_by(user_id) %>%
  mutate(distinct_7  = runner(category, k = 7 + 1, idx = date, 
                              f = function(x) length(unique(x))),
         distinct_14 = runner(category, k = 14 + 1, idx = date, 
                              f = function(x) length(unique(x))))

More informations in package and function documentation.

Taker answered 27/10, 2019 at 11:58 Comment(0)
M
3

Here are two data.table solutions, one with two nested lapplyand the other using non-equi joins.

The first one is a rather clumsy data.table solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although @alistaire's concise tidyverse solution he had suggested in his comment could be modified as well).

It uses two nested lapply. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.

library(data.table)
tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) rbindlist(
           lapply(unique(dt$date), 
                  function(ldate) {
                    dt[between(date, ldate - ldays, ldate), 
                       .(distinct = sprintf("distinct_%02i", ldays), 
                         date = ldate, 
                         N = uniqueN(category)), 
                       by = .(user_id)]
                  })
         )
  )
)
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)] 
#          date user_id category distinct_07 distinct_14
# 1: 2016-01-01      27    apple           1           1
# 2: 2016-01-03      27    apple           1           1
# 3: 2016-01-05      27     pear           2           2
# 4: 2016-01-07      27     plum           3           3
# 5: 2016-01-10      27    apple           3           3
# 6: 2016-01-14      27     pear           3           3
# 7: 2016-01-16      27     plum           3           3
# 8: 2016-01-01      11    apple           1           1
# 9: 2016-01-03      11     pear           2           2
#10: 2016-01-05      11     pear           2           2
#11: 2016-01-07      11     pear           2           2
#12: 2016-01-10      11    apple           2           2
#13: 2016-01-14      11    apple           2           2
#14: 2016-01-16      11    apple           1           2

Here is a variant following a suggestion by @Frank which uses data.table's non-equi joins instead of the second lapply:

tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) {
           dt[.(user_id = user_id, dago = date - ldays, d = date), 
              on=.(user_id, date >= dago, date <= d), 
              .(distinct = sprintf("distinct_%02i", ldays),
                N = uniqueN(category)), 
              by = .EACHI]
         }
  )
)[, date := NULL]
# 
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)] 

Data:

dt <- fread("user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple")
dt[, date := as.IDate(date)]

BTW: The wording in the past 7, 14 days is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.

Memory answered 18/1, 2017 at 9:17 Comment(0)
W
1

In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.

library(tidyverse)

df %>% group_by(user_id) %>% 
    mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), 
           distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))

## Source: local data frame [14 x 5]
## Groups: user_id [2]
## 
##    user_id       date category distinct_7 distinct_14
##      <int>     <date>   <fctr>      <int>       <int>
## 1       27 2016-01-01    apple          1           1
## 2       27 2016-01-03    apple          1           1
## 3       27 2016-01-05     pear          2           2
## 4       27 2016-01-07     plum          3           3
## 5       27 2016-01-10    apple          3           3
## 6       27 2016-01-14     pear          3           3
## 7       27 2016-01-16     plum          3           3
## 8       11 2016-01-01    apple          1           1
## 9       11 2016-01-03     pear          2           2
## 10      11 2016-01-05     pear          2           2
## 11      11 2016-01-07     pear          2           2
## 12      11 2016-01-10    apple          2           2
## 13      11 2016-01-14    apple          2           2
## 14      11 2016-01-16    apple          1           2
Wily answered 20/1, 2017 at 0:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.