identify groups of linked episodes which chain together

Asked 27/8, 2012 at 3:38 Answered 16/8, 2020 at 0:8

Take this simple data frame of linked ids:

test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))

> test
  id1 id2
1  10   1
2  10  36
3   1  24
4   1  45
5  24 300
6   8  11

I now want to group together all the ids which link. By 'link', I mean follow through the chain of links so that all ids in one group are labelled together. A kind of branching structure. i.e:

Group 1
10 --> 1,   1 --> (24,45)
                   24 --> 300
                          300 --> NULL
                   45 --> NULL
10 --> 36, 36 --> NULL,
Final group members: 10,1,24,36,45,300

Group 2
8 --> 11
      11 --> NULL
Final group members: 8,11

Now I roughly know the logic I would want, but don't know how I would implement it elegantly. I am thinking of a recursive use of match or %in% to go down each branch, but am truly stumped this time.

The final result I would be chasing is:

result <- data.frame(group=c(1,1,1,1,1,1,2,2),id=c(10,1,24,36,45,300,8,11))

> result
  group  id
1     1  10
2     1   1
3     1  24
4     1  36
5     1  45
6     1 300
7     2   8
8     2  11

Washday answered 27/8, 2012 at 3:38 Comment(3)

I wish SO and this question had been available 25 years ago when I was banging my head against the wall with SAS trying to solve this question. – Dry 27/3, 2015 at 21:13

@bondeddust - coincidentally this question arose as a result of trying to replace an ugly and inefficient piece of SAS code that did something similar. – Washday 27/3, 2015 at 23:40

And it now turns out that I forgot this question and answer, but am being reminded of it by @Henrik – Dry 10/6, 2021 at 0:51

The Bioconductor package RBGL (an R interface to the BOOST graph library) contains a function, connectedComp(), which identifies the connected components in a graph -- just what you are wanting.

(To use the function, you will first need to install the graph and RBGL packages, available here and here.)

library(RBGL)
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))

## Convert your 'from-to' data to a 'node and edge-list' representation  
## used by the 'graph' & 'RBGL' packages 
g <- ftM2graphNEL(as.matrix(test))

## Extract the connected components
cc <- connectedComp(g)

## Massage results into the format you're after 
ld <- lapply(seq_along(cc), 
             function(i) data.frame(group = names(cc)[i], id = cc[[i]]))
do.call(rbind, ld)
#   group  id
# 1     1  10
# 2     1   1
# 3     1  24
# 4     1  36
# 5     1  45
# 6     1 300
# 7     2   8
# 8     2  11

Rhinarium answered 27/8, 2012 at 5:17 Comment(3)

Thanks for the answer. I also now have a term in 'connected components' to use when searching for more info. – Washday 27/8, 2012 at 5:43

Glad to be able to point you down a helpful path. Cheers, and happy trails to you. – Retain 27/8, 2012 at 5:47

I just tested this answer and the links and packages do succeed as they did nine years ago. – Dry 10/6, 2021 at 0:53

Here's an alternative answer that I have discovered myself after the nudging in the right direction by Josh. This answer uses the igraph package. For those that are searching and come across this answer, my test dataset is referred to as an "edge list" or "adjacency list" in graph theory (http://en.wikipedia.org/wiki/Graph_theory)

library(igraph)
test <- data.frame(id1=c(10,10,1,1,24,8 ),id2=c(1,36,24,45,300,11))
gr.test <- graph_from_data_frame(test)
links <- data.frame(id=unique(unlist(test)),group=components(gr.test)$membership)
links[order(links$group),]

#   id group
#1  10     1
#2   1     1
#3  24     1
#5  36     1
#6  45     1
#7 300     1
#4   8     2
#8  11     2

Washday answered 29/8, 2012 at 4:4 Comment(0)

(To use the function, you will first need to install the graph and RBGL packages, available here and here.)

library(RBGL)
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))

## Convert your 'from-to' data to a 'node and edge-list' representation  
## used by the 'graph' & 'RBGL' packages 
g <- ftM2graphNEL(as.matrix(test))

## Extract the connected components
cc <- connectedComp(g)

## Massage results into the format you're after 
ld <- lapply(seq_along(cc), 
             function(i) data.frame(group = names(cc)[i], id = cc[[i]]))
do.call(rbind, ld)
#   group  id
# 1     1  10
# 2     1   1
# 3     1  24
# 4     1  36
# 5     1  45
# 6     1 300
# 7     2   8
# 8     2  11

Rhinarium answered 27/8, 2012 at 5:17 Comment(3)

Thanks for the answer. I also now have a term in 'connected components' to use when searching for more info. – Washday 27/8, 2012 at 5:43

Glad to be able to point you down a helpful path. Cheers, and happy trails to you. – Retain 27/8, 2012 at 5:47

I just tested this answer and the links and packages do succeed as they did nine years ago. – Dry 10/6, 2021 at 0:53

Without using packages:

# 2 sets of test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))

grouppairs <- function(df){

  # from wide to long format; assumes df is 2 columns of related id's
  test <- data.frame(group = 1:nrow(df),val = unlist(df))

  # keep moving to next pair until all same values have same group
  i <- 0
  while(any(duplicated(unique(test)$val))){
    i <- i+1

    # get group of matching values
    matches <- test[test$val == test$val[i],'group']

    # change all groups with matching values to same group
    test[test$group %in% matches,'group'] <- test$group[i]
  }

  # renumber starting from 1 and show only unique values in group order
  test$group <- match(test$group, sort(unique(test$group)))
  unique(test)[order(unique(test)$group), ]
}

# test
grouppairs(test)
grouppairs(mytest)

Swearword answered 10/11, 2014 at 1:53 Comment(0)

You said recursive... and I thought I'd be super terse while I'm at it.

Test data

mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))

Recursive function to get the groupings

aveminrec <- function(v1,v2){
  v2 <- ave(v1,by = v2,FUN = min)
  if(identical(v1,v2)){
    as.numeric(as.factor(v2))
  }else{
    aveminrec(v2,v1)
  }
}

Prep data and simplify after

groupvalues <- function(valuepairs){
  val <- unlist(valuepairs)
  grp <- aveminrec(val,1:nrow(valuepairs))
  unique(data.frame(grp,val)[order(grp,val), ])
}

Get results

groupvalues(test)
groupvalues(mytest)

aveminrec() is probably along the lines of what you were thinking, though I bet there's a way to be more direct about going down each branch instead of repeating ave() which is essentially split() and lapply(). Maybe recursively split and lapply? As it is, it's like repeated partial branching, or alternately simplifying 2 vectors slightly without group information loss.

Maybe parts of this would be used on a real problem, but groupvalues() is way too dense to read without some comments at least. I also haven't checked how performance compares to a for loop with ave and flipping the groups that way.

Swearword answered 16/8, 2020 at 0:8 Comment(2)

There is a function in R named Recall that supposedly improves recursion-based code. And ave can be thought of as a simplistic bundling of lapply and split, but it's not really capable of handling operations across multiple columns within the groupings. – Dry 10/6, 2021 at 0:55

@Dry Recall should probably be used in there, really good to know! – Swearword 12/7, 2021 at 15:30

Recommended topics

Hot tags