How to summarize a list of combination
Asked Answered
S

5

8

I have a list of 2 elements' combination like below.

cbnl <- list(
  c("A", "B"), c("B", "A"), c("C", "D"), c("E", "D"), c("F", "G"), c("H", "I"),
  c("J", "K"), c("I", "H"), c("K", "J"), c("G", "F"), c("D", "C"), c("E", "C"),
  c("D", "E"), c("C", "E")
)

I'd like to summarize above list. Expected result is like below list. Order of element in a vector doesn't matter here.

[[1]]
[1] "A" "B"

[[2]]
[1] "C" "D" "E"

[[3]]
[1] "F" "G"

[[4]]
[1] "H" "I"

[[5]]
[1] "J" "K"

(Rule 1) {A, B} is equivalent to {B, A}. To correspond this I think I can do this.

cbnl <- unique(lapply(cbnl, function(i) { sort(i) }))

(Rule 2) {A, B}, {B, C} (One of element is common) then take a union of two sets. It results {A, B, C}. I don't have clear nice idea to do this.

Any efficient way to do this?

Superaltar answered 15/12, 2021 at 12:44 Comment(2)
Yes, it's essentially the same. From graph theory view point this can be thought as connections of nodes. Thank you for your point.Superaltar
I thought Merging Listed Vectors that share Elements in R is essentially the same, but R code doesn't work for this case. The answer here may not work for a list of numeric vectors.....Superaltar
W
7

I know this answer is more like a traditional programming rather than "R like" but it solves the issue.

cbnl <- unique(lapply(cbnl, sort))

i <- 1
count <- 1
out <- list()

while (i <= length(cbnl) - 1) {
  if (sum(cbnl[[i]] %in% cbnl[[i + 1]]) == 0) {
    out[[count]] <- cbnl[[i]]
    } else {
      out[[count]] <- sort(unique(c(cbnl[[i]], cbnl[[i + 1]])))
      i <- i + 1        
    }
  count <- count + 1
  i <- i + 1 
}

out

gives,

[[1]]
[1] "A" "B"

[[2]]
[1] "C" "D" "E"

[[3]]
[1] "F" "G"

[[4]]
[1] "H" "I"

[[5]]
[1] "J" "K"
Wanderlust answered 15/12, 2021 at 12:58 Comment(1)
Great. Thank you for the answer. I got a expected result.Superaltar
D
6

You can try the following igraph option

library(igraph)

graph_from_data_frame(do.call(rbind, cbnl)) %>%
  components() %>%
  membership() %>%
  stack() %>%
  with(., split(as.character(ind), values))

which gives

$`1`
[1] "A" "B"

$`2`
[1] "C" "E" "D"

$`3`
[1] "F" "G"

$`4`
[1] "H" "I"

$`5`
[1] "J" "K"

A shorter one

graph_from_data_frame(do.call(rbind, cbnl)) %>%
  decompose() %>%
  Map(function(x) names(V(x)), .)

which gives

[[1]]
[1] "A" "B"

[[2]]
[1] "C" "E" "D"

[[3]]
[1] "F" "G"

[[4]]
[1] "H" "I"

[[5]]
[1] "J" "K"
Dilettante answered 15/12, 2021 at 12:59 Comment(5)
Oh, this is shorter command. Expected result comes out.Superaltar
@Superaltar you can find a shorter one in my updateDilettante
Thank you. Very simple code!!Superaltar
The funny thing is that few months ago I was searching for an algorithm / function to find records linked in this way and I remember I couldn't find an appropriate words to use in google (but I also didn't ask on SO) and now I see I should look around graph topic :D (I was sure that something like this must exists already!).Hyehyena
It sounds a good info. If you can provide what a exact name of topic is, please let me know.Superaltar
S
5

Base R: sorting union as FUN= in combn, then partly filling a matrix based on unique elements u and removing duplicated rows, and finally coercing as.list.

u <- Reduce(union, cbnl)  ## get unique elements

res <- combn(cbnl, 2, \(x) {
  if (length(intersect(x[[1]], x[[2]])) > 0) {
    union(x[[1]], x[[2]])
  } else {
    el(x)
  }
}, simplify=FALSE) |>
  unique() |>
  (\(x) sapply(x, \(i) replace(rep(NA, length(u)), match(i, u), i)))() |>
  (\(x) x[, !colSums(duplicated(x, MARGIN=1:2)) == nrow(x)])() |>
  (\(x) unname(lapply(as.list(as.data.frame(x)), \(x) x[!is.na(x)])))()

res
# [[1]]
# [1] "A" "B"
# 
# [[2]]
# [1] "C" "D" "E"
# 
# [[3]]
# [1] "F" "G"
# 
# [[4]]
# [1] "H" "I"
# 
# [[5]]
# [1] "J" "K"

Note:

> R.version.string
[1] "R version 4.1.2 (2021-11-01)"
Seanseana answered 15/12, 2021 at 13:7 Comment(7)
Thank you for your answer. But expected result is not shown. I need to think how to get expected list in your way.Superaltar
@Superaltar Actually I missed a case handling, check update. I'm not yet sure, though, which rule exactly leads to your expected result?Seanseana
Thank you very much. Almost completed. Just need to delete {C, D}, {D, E}, {C, E} because {C, D, E} includes them.Superaltar
@Superaltar Got ya, see update!Seanseana
Thank you. Nearly completed but still {D, E} remains...Superaltar
@Superaltar Now it should work!Seanseana
Finally, yes I could get expected result. Thank you very much.Superaltar
H
4

I took a one line of code from @ThomasIsCoding and would like to show that we can achieve this using my package dedupewider.

library(dedupewider)
library(purrr)
library(magrittr)

cbnl <- list(
  c("A", "B"), c("B", "A"), c("C", "D"), c("E", "D"), c("F", "G"), c("H", "I"),
  c("J", "K"), c("I", "H"), c("K", "J"), c("G", "F"), c("D", "C"), c("E", "C"),
  c("D", "E"), c("C", "E")
)

cbnl_df <- data.frame(do.call(rbind, cbnl))

result <- dedupe_wide(cbnl_df, names(cbnl_df)) # it performs deduplication by connecting elements which are linked by transitive relation

result_list <- as.list(as.data.frame(t(result)))

result_list <- map(result_list, ~ .x[!is.na(.x)]) # remove NA
result_list
#> $V1
#> [1] "A" "B"
#> 
#> $V2
#> [1] "C" "E" "D"
#> 
#> $V3
#> [1] "F" "G"
#> 
#> $V4
#> [1] "H" "I"
#> 
#> $V5
#> [1] "J" "K"

A lot of steps are necessary, because list is an input and output, so with data.frame we would have less code than above.

Hyehyena answered 15/12, 2021 at 13:26 Comment(2)
Thank you for your answer. My original data is data.table so your data.frame way is more preferable for me. I could get my expected result.Superaltar
That's even better, because dedupe_wide internally uses setDT (at the end removes data.table class if wasn't present at the beginning) if needed, so you should get data.table object as returned value when data.table is an input.Hyehyena
S
0

Thank you for all supporters' wonderful answers.

Let me post my own solution by base R as below;

cbnl <- list(
  c("A", "B"), c("B", "A"), c("C", "D"), c("E", "D"), c("F", "G"), c("H", "I"),
  c("J", "K"), c("I", "H"), c("K", "J"), c("G", "F"), c("D", "C"), c("E", "C"),
  c("D", "E"), c("C", "E")
)

repeat {
  # Get A Count Table
  tbl <- table(unlist(cbnl))
  # No Duplicated Items Then break Out
  if (length(tbl[tbl > 1]) == 0) { break }
  # Take A First Duplicated Item And Get the Index
  idx <- which(sapply(seq_len(length(cbnl)), function(i) {
    any(cbnl[[i]] == names(tbl[tbl > 1])[1])
  }))
  # Create New vector By Taking Union
  newvec <- sort(unique(unlist(cbnl[idx])))
  # Append newvec To cbnl And Remove Original vectors
  cbnl <- c(cbnl, list(newvec))[-idx]
}

cbnl

The result is

[[1]]
[1] "A" "B"

[[2]]
[1] "C" "D" "E"

[[3]]
[1] "F" "G"

[[4]]
[1] "H" "I"

[[5]]
[1] "J" "K"

Here is data.table version.

cbn <- data.table(
  item1 = c("A", "B", "C", "E", "F", "H", "J", "I", "K", "G", "D", "E", "D", "C"),
  item2 = c("B", "A", "D", "D", "G", "I", "K", "H", "J", "F", "C", "C", "E", "E")
)

repeat {
  # Get A Count Table
  tbl <- table(as.vector(as.matrix(cbn)))
  # No Duplicated Items Then break Out
  if (length(tbl[tbl > 1]) == 0) { break }
  # Take A First Duplicated Item And Get Row Numbers Where The Item Is Located
  idx <- which(cbn == names(tbl[tbl > 1])[1], arr.ind = TRUE)[, 1]
  # Create New Row By Taking Union
  newrow <- setDT(as.list(sort(unique(as.vector(as.matrix(cbn[idx]))))))
  names(newrow) <- paste0("item", seq_len(ncol(newrow)))
  # Append newrow To cbn And Remove Original Rows
  cbn <- rbindlist(l = list(cbn, newrow), use.names = TRUE, fill = TRUE)[-idx]
}

cbn

This result is as below.

   item1 item2 item3
1:     A     B  <NA>
2:     C     D     E
3:     F     G  <NA>
4:     H     I  <NA>
5:     J     K  <NA>
Superaltar answered 21/12, 2021 at 12:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.