Grouping Ids based on at least one common values
Asked Answered
M

3

5

I have a list whose elements are integers and I would like to accumulate these elements if only they share at least one value. With regard to those elements that don't share any values with the rest I would like them to stay as they are. Here is my sample date:

x <- list(c(1, 2), c(1, 2, 3), c(2, 3, 4), c(3, 4, 5), c(4, 5, 8), c(6, 9, 7), 7, c(5, 8), 10, 11)

And here is my desired output:

desired_reult <- list(c(1, 2, 3, 4, 5, 8), 
                      c(6, 9, 7), 
                      10, 
                      11)

I would like to do it first with reduce or accumulate functions from purrr but any other tidyverse solution would be welcomed. I have tried this solution so far but it only gave me one union and apparently abandons the rest:

x %>% 
  reduce(~ if(any(.x %in% .y)) union(.x, .y) else .x)

[1] 1 2 3 4 5 8

In general I am looking for a way of grouping integers (ids) with common values like a sort of clustering but so far my efforts have been in vain unfortunately.

Thank you very much indeed for your help in advance.

Matthewmatthews answered 14/6, 2021 at 13:8 Comment(5)
What'd your desired output be if there's one more element in the x say c(5,6) in the end i.e. after 11?Narcosynthesis
@Narcosynthesis , In that case it will link the first and second elements of the desired result.Matthewmatthews
I finally figured out the specific search approach to find the duplicate: Union of intersecting vectors in a list in RJohan
Thank you dear Ian, that looks great! I will have a look into this in particular the second solution which solely relies on base R. I actually failed to take into consideration that I could've thought about writing a custom function for this. Using a for loop will always make things easier in my opinion.Matthewmatthews
Dear @IanCampbell I have posted my own custom function for this question, would you mind checking it whenever you have time. It's ugly and inelegant but quite effective I suppose.Matthewmatthews
J
3

I suspect there's a set covering solution to be had, but in the interim here's a graph approach:

First, let's convert the integer vectors to an edge list so it can be made into a graph. We can use expand.grid.

library(igraph)
edgelist <- do.call(rbind,lapply(x,\(x)expand.grid(x,x))) #R version >= 4.1.0

Now we have a two column data.frame showing the connections between all the integers (a set of edges).

igraph::graph.data.frame can conveniently make a graph from this.

From there we can use igraph::components to extract the connected components.

g <- graph.data.frame(edgelist)
split(names(components(g)$membership),components(g)$membership)
#$`1`
#[1] "1" "2" "3" "4" "5" "8"
#$`2`
#[1] "6" "9" "7"
#$`3`
#[1] "10"
#$`4`
#[1] "11"

Or with Tidyverse:

library(dplyr); library(purrr)
map_dfr(x, ~expand.grid(.x,.x)) %>%
  graph.data.frame() %>%
  components() %>% 
  pluck(membership) %>%
  stack() %>%
  {split(as.numeric(as.character(.[,2])),.[,1])}

$`1`
[1] 1 2 3 4 5 8

$`2`
[1] 6 9 7

$`3`
[1] 10

$`4`
[1] 11
Johan answered 14/6, 2021 at 13:31 Comment(5)
Thank you dear Ian for this detailed answer and also explanations. I got it totally except for the last split function which I would be grateful if you could explain it to me.Matthewmatthews
why split behaves differently when wrapped in {}Narcosynthesis
If you pipe the left hand side in without curly brackets, the whole data frame will become the first argument. This way, with {}, the left hand side is assigned as ., but you can put it anywhere you want.Johan
@IanCampbell, upvoted already. However, if you ever came across the set please post a link for referenceNarcosynthesis
@IanCampbell, directly converting a factor variable was not giving results as intended, so converted it to character first. :)Narcosynthesis
V
3

One way of doing it:

i is adjacent to j iff intersect(i, j) != empty set. We want to find connected components of matrix that on position (i,j) has 1 iff set i is adjecent to set j, 0 otherwise. First 4 rows build adjacency matrix, 5th and 6th row finds connected components and rest is splitting list based on that membership and taking unique values.

library(tidyverse)
library(igraph)

map(x, function(a) map_int(x, ~length(base::intersect(a, .x)) > 0) * 1L) %>% 
  reduce(rbind) %>%
  graph.adjacency() %>%
  as.undirected() %>%
  components() %>%
  pluck("membership") %>%
  split(seq_along(.), .) %>%
  map(~unique(unlist(x[.x])))
Veronikaveronike answered 14/6, 2021 at 13:31 Comment(4)
@IanCampbell one big problem with great minds is you don't know which one to choose lol.Matthewmatthews
Thank you very much dear @Veronikaveronike for your additional details. It was really helpful. Do you have any suggestion on where to start learning igraph since I also need it for my future network analysis course at university. In the end I appreciate your help.Matthewmatthews
I haven't really used igraph that much. I think you can find a lot of useful information in book https://www.springer.com/gp/book/9783030441289Veronikaveronike
Thank you very much dear @det, the book is also great and I will soon delve into it. You are terrific R coder and it was great pleasure talking to you and getting to know you. Best wishes.Matthewmatthews
M
2

Thanks to a very informative post introduced by my dear friend @Ian Canmpbell, I thought to challenge myself to write a custom function for this purpose. It is still the first version, though not very elegant and can certainly be improved greatly but for now it is stable as I tried it on some inputs and it did not disappoint.

anoush <- function(x) {
# First we check whether x is a list

  stopifnot(is.list(x)) 

# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices 
# in `vec` vector. So in the end we have a list called `ind` whose elements 
# are all the indices connected with the corresponding elements of the original 
# list for example first element of `ind` is `1`, `2`, `3` which means in 
# the original list these elements have common values.
  
  ind <- lapply(1:length(x), function(a) {
    vec <- c()
    for(i in 1:length(x)) {
      if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) {
        vec <- c(vec, i)
      }
    }
    vec 
    })

# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also 
# result in duplicated values. For example elements `1` through `5` of 
# `dup_ind` contains the same value cause in the original list these 
# elements have common values.

  dup_ind <- lapply(1:length(ind), function(a) {
    out <- c()
    for(i in 1:length(ind)) {
      if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) {
        out[[i]] <- union(ind[[a]], ind[[i]])
      }
      vec2 <- Reduce("union", out)
    }
    vec2
  }) 

# Here we get rid of the duplicated elements of the list by means of 
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.
  
  un <- unlist(dup_ind)
  res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
  res2 <- Filter(length, res)
  
  sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))
  
}

Output

> anoush(x)

[[1]]
[1] 1 2 3 4 5 8

[[2]]
[1] 6 9 7

[[3]]
[1] 10

[[4]]
[1] 11
Matthewmatthews answered 16/6, 2021 at 21:50 Comment(2)
I've never used the relist function. That's pretty slick.Johan
That's magical actually, In particular the concepts of flesh and skeleton as the docs says Skeleton is a sample object that has the right shape but the wrong content. flesh is a vector with the right content but the wrong shape. Here I first unlist my list then I applied my desired transformation on it and then restructure it like the original one. I myself learned that from one of @akrun's solutions.Matthewmatthews

© 2022 - 2025 — McMap. All rights reserved.