Aggregating sequential and grouped data in R
Asked Answered
C

4

9

I have a dataset that looks like this toy example. The data describes the location a person has moved to and the time since this relocation happened. For example, person 1 started out in a rural area, but moved to a city 463 days ago (2nd row), and 415 days ago he moved from this city to a town (3rd row), etc.

set.seed(123)
df <- as.data.frame(sample.int(1000, 10))
colnames(df) <- "time"
df$destination <- as.factor(sample(c("city", "town", "rural"), size = 10, replace = TRUE, prob = c(.50, .25, .25)))
df$user <- sample.int(3, 10, replace = TRUE)
df[order(df[,"user"], -df[,"time"]), ]

The data:

time destination user
 526       rural    1
 463        city    1
 415        town    1
 299        city    1
 179       rural    1
 938        town    2
 229        town    2
 118        city    2
 818        city    3
 195        city    3

I wish to aggregate this data to the format below. That is, to count the types of relocations for each user, and sum it up to one matrix. How do I achieve this (preferably without writing loops)?

from  to     count
city  city   1
city  town   1
city  rural  1
town  city   2
town  town   1
town  rural  0
rural city   1
rural town   0
rural rural  0
Cephalochordate answered 22/7, 2021 at 19:49 Comment(0)
P
8

One possible way based on data.table package:

library(data.table)

cases <- unique(df$destination)

setDT(df)[, .(from=destination, to=shift(destination, -1)), by=user
          ][CJ(from=cases, to=cases), .(count=.N), by=.EACHI, on=c("from", "to")]


#      from     to count
#    <char> <char> <int>
# 1:   city   city     1
# 2:   city  rural     1
# 3:   city   town     1
# 4:  rural   city     1
# 5:  rural  rural     0
# 6:  rural   town     0
# 7:   town   city     2
# 8:   town  rural     0
# 9:   town   town     1
Perfume answered 22/7, 2021 at 21:0 Comment(1)
Thanks, always interesting to learn new tricks with data.table. And just a side note for future visitors, this solution works great, but note that it relies on the data being ordered by user and time as in the example.Cephalochordate
L
5

Here is a tidyverse solution:

library(dplyr)
library(purrr)

df %>%
  group_split(user) %>%
  map_dfr(~ bind_cols(as.character(.x[["destination"]][-nrow(.x)]), 
                  as.character(.x[["destination"]][-1])) %>%
        set_names("from", "to")) %>%
  group_by(from, to) %>%
  count()

# A tibble: 6 x 3
# Groups:   from, to [6]
  from  to        n
  <chr> <chr> <int>
1 city  city      1
2 city  rural     1
3 city  town      1
4 rural city      1
5 town  city      2
6 town  town      1
Lauzon answered 22/7, 2021 at 22:40 Comment(0)
I
5

Here is a dplyr only solution:

  1. identify from to with lag function and combine with paste0 to a helper column.
  2. remove NA caused by lead
  3. use add_count to mutate a ncolumn
df %>% 
  group_by(user) %>% 
  rename(from = destination) %>% 
  mutate(to = lead(from), .before=3) %>% 
  mutate(helper = paste0(from, to)) %>% 
  filter(!is.na(to)) %>% 
  group_by(helper) %>% 
  add_count(helper, from, to) %>% 
  ungroup() %>% 
  select(user, from, to, n)

output:

   user from  to        n
  <int> <fct> <fct> <int>
1     1 rural city      1
2     1 city  town      1
3     1 town  city      2
4     1 city  rural     1
5     2 town  town      1
6     2 town  city      2
7     3 city  city      1
Interception answered 23/7, 2021 at 8:41 Comment(0)
D
4

Here is a data.table option

setDT(df)[
    ,
    setNames(
        rev(data.frame(embed(as.character(destination), 2))),
        c("from", "to")
    ), user
][, count := .N, .(from, to)][]

which gives

   user  from    to count
1:    1 rural  city     1
2:    1  city  town     1
3:    1  town  city     2
4:    1  city rural     1
5:    2  town  town     1
6:    2  town  city     2
7:    3  city  city     1
Death answered 22/7, 2021 at 20:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.