How to efficiently divide pairs into clusters such that each cluster contains all entries of a given set
Asked Answered
R

3

15

Assuming we have a set v of even cardinality, e.g., v <- 1:6, and a data.frame df consisting of entries of v, which is defined by an fixed occurrence of each element from v in each column, namely, k, for example

k <- 2
x <- rep(v, each = k)
df <- data.frame(A = x, B = c(tail(x, -(k + 1)), head(x, k + 1)))

shown as

> df
   A B
1  1 2
2  1 3
3  2 3
4  2 4
5  3 4
6  3 5
7  4 5
8  4 6
9  5 6
10 5 1
11 6 1
12 6 2

where the occurrences of 1:6 on both columns are 2

> table(df$A)

1 2 3 4 5 6
2 2 2 2 2 2

> table(df$B)

1 2 3 4 5 6
2 2 2 2 2 2 

Objective and Expected Output

In df, each row represents a "pair", and there are no duplicated "pairs". I am wondering, how to divide the pairs into clusters, such that each cluster is minimal and complete, i.e., each cluster contains all values from v, without any duplicated entries.

Since the cardinality of v is length(v), and the occurrence of each entry in df is actually 2*k, the number of clusters via a "ideal" split of df should be 2*k*length(v)/length(v) == 2*k. In other words, the number of clusters is defined by k only, say, 2*k.

For example, df can be divided into 4 clusters like below, where the "completeness" property can be achieved

[[1]]
  A B
1 1 2
5 3 4
9 5 6

[[2]]
   A B
2  1 3
7  4 5
12 6 2

[[3]]
   A B
3  2 3
8  4 6
10 5 1

[[4]]
   A B
4  2 4
6  3 5
11 6 1

Note that, output above is just one of the valid instances, and there should be other candidates for clustering.

Question

One possible solution is using Monte Carlo simulation and keep the valid clustering outcomes, iteratively, if the randomized cluster satisfy all constraints. The code may look like below

out <- c()
repeat {
  if (nrow(df) == 0) {
    break
  }
  repeat {
    k <- sample.int(nrow(df), length(v) / 2)
    if (!length(setdiff(v, unlist(df[k, ])))) {
      out <- c(out, list(df[k, ]))
      df <- df[-k, ]
      break
    }
  }
}

which sometimes can give the desired output like

> out
[[1]]
   A B
6  3 5
11 6 1
4  2 4

[[2]]
   A B
2  1 3
7  4 5
12 6 2

[[3]]
   A B
8  4 6
3  2 3
10 5 1

[[4]]
  A B
1 1 2
9 5 6
5 3 4

However, this approach has a major issue, say, Inefficiency: If the set v has a large cardinality, the space for Monte Carlo simulation exponentially grows, which dramatically slows down the procedure of finding a valid solution.


I am wondering, if there is a stable and more efficient to solve such kind of problem. I think backtracking should work, but I believe there must be other approaches that can make it in a more elegant manner.

Looking forward to diverse and interesting solutions. Appreciated in advance!

Rarotonga answered 14/12, 2023 at 23:41 Comment(12)
Don't the constraints here mean that each member is just a 3x2 permutation of the numbers 1:6? Would it not be easier to approach the problem as generating 4 groups of permutations in which no row is repeated? Also, I think I'm right in saying that any valid group's mirror image would be another valid group?Jeaniejeanine
@AllanCameron Thanks for the comment. Your proposal of starting with 3-by-2 permutations is the inverse process of "clustering", but it still needs to consider the constraint that: in the resulting df (if we want to recover df from the clusters), besides the conditions of non-repeated pairs, the occurrence of each elements in each column is the same. And of course, I am more curious about the "dividing" strategy instead of the other way around.Rarotonga
This feels similar to Sudoku generating algorithm - for example https://mcmap.net/q/224249/-how-to-generate-sudoku-boards-with-unique-solutions/10276092Dambro
@Dambro yes, sort of like Sudoku, and that's why I was saying backtracking should work. I myself ever implemented a Sudoku solver using backtracking https://mcmap.net/q/826009/-solving-a-sudoku-by-hand, but I am looking forward to other possibly simpler algorithm for this problemRarotonga
How do you choose between all possible clusterings? Are you trying to minimize the size of each cluster/maximize the number of clusters? I am assuming you don't just want the clustering containing a single cluster consisting of all the pairs in df.Prandial
@Prandial Thanks for your comment. With your feedback, I clarify a bit more in the question: "each cluster contains all values from v, without any duplicated entries.". So, yes, I am trying to minimize the size of each cluster and make sure each cluster is complete as well.Rarotonga
Can you please provide a readable example where the choice of some clusters prevents completing to a valid solution?Rinderpest
@גלעדברקן I think the difficulty is that the previous choices will affect what follows. For instance, it is probably infeasible to generate the last cluster since the the remaining values are not able to be clustered, since the previous 3 clusters has used up all "resources". I guess this is a global optimization problem, where sequential choices (based on local optimality) cannot not always guarantee the global optimality.Rarotonga
Yes, I think I understand the intention behind the assumption, but can you please provide a readable example where the choice of some clusters prevents completing to a valid solution?Rinderpest
@גלעדברקן aha, now I got your point. Sorry that I mixed it up with another solution where the pairs in each cluster are added iteratively, then there would be the problem that the previous choices hinders the following options. And you are right that in the Monte Carlo code as I wrote, it doesn't have such kind of issue since each cluster is generated as a whole. The only issue is about the efficiency.Rarotonga
Should this work for a df with more than two columns?Stomatology
@Stomatology no, just 2 columnsRarotonga
S
5

I'm not confident that I completely follow what the desired behavior is, so I recommend further testing of this solution. The idea is to:

  1. Create a graph where each vertex is a row of df, and two vertices are connected if they have no elements in common.
  2. Find the maximal cliques (largest collections of rows that can go together without resulting in a duplicate).
  3. Solve the set cover problem from the results of step 2 so that each row is represented in the output.

library(Rfast) # for `rowTabulate` and `rowMaxs`
library(adagio) # for `setcover`
library(igraph) # for `max_cliques`

f <- function(df) {
  v <- unique(unlist(df))
  pairs <- combn(nrow(df), 2)
  n <- choose(nrow(df), 2)
  y <- matrix(match(unlist(df[combn(nrow(df), 2),]), v), 2*n, 2, 1)
  y <- rowTabulate(cbind(y[1:n,], y[(n + 1):(2*n),]), length(v))
  mode(y) <- "numeric"
  g <- graph_from_data_frame(as.data.frame(t(pairs[,rowMaxs(y, TRUE) == 1])),
                             FALSE)
  cl <- lapply(max_cliques(g, length(v)/2), \(x) as.integer(names(x)))
  m <- matrix(0L, length(cl), nrow(df))
  m[cbind(rep(1:length(cl), each = length(v)/2), unlist(cl))] <- 1L
  lapply(cl[setcover(m)$sets], \(x) df[x,])
}

Testing on df from the question:

f(df)
#> [[1]]
#>    A B
#> 11 6 1
#> 6  3 5
#> 4  2 4
#> 
#> [[2]]
#>    A B
#> 2  1 3
#> 12 6 2
#> 7  4 5
#> 
#> [[3]]
#>    A B
#> 3  2 3
#> 8  4 6
#> 10 5 1
#> 
#> [[4]]
#>   A B
#> 5 3 4
#> 1 1 2
#> 9 5 6
Subantarctic answered 18/12, 2023 at 18:9 Comment(3)
Interesting interpretation in terms of graph, +1!. I am wondering how to solve it without calling external packages such that I can see more strategies under the hood.Rarotonga
adagio::setcover just calls lpSolve::lp, which solves a linear programming problem. The algorithm used in igraph::max_cliques is referenced in the documentation.Subantarctic
Yes, it is a coloring problem. My early attempt was using backtracking and pruning tree iteratively, but seems not that efficient https://mcmap.net/q/786235/-how-to-efficiently-divide-pairs-into-clusters-such-that-each-cluster-contains-all-entries-of-a-given-setRarotonga
F
2

Here is my data.table/igraph approach

updated answer the partitioning part is heavily inspired by this answer

v <- 1:6
k <- 2
x <- rep(v, each = k)
df <- data.frame(A = x, B = c(tail(x, -(k + 1)), head(x, k + 1)))

library(data.table)
library(igraph)
# set of tiles to work with
n = 3
setDT(df)
# set a rownumber
df[, from := .I]
# set rownumber as key
setkey(df, from)
# get all possible combinations with different A and B values
df[df, to := paste0(
  df[!A %in% c(i.A, i.B) & !B %in% c(i.A, i.B), ]$from,
  collapse = ";"), by = .EACHI]
# split to rows
nodes <- df[, .(to = unlist(tstrsplit(to, ";"))), by = .(from)]
# build graph
g <- graph_from_data_frame(d = nodes, directed = FALSE, vertices = df$from)
# get closed triangles (here n = 3) / subsets of n vertices, present in a matrix
allCliques <- cliques(g, min = n,max = n)
# create matrix of subgraphs
cliqueMatrix  <- matrix(c(rep(paste0("cluster", seq_along(allCliques)), 
                          sapply(allCliques, length)), names(unlist(allCliques))), ncol=2)
# make graph of subgraphs
g2 <- graph_from_edgelist(cliqueMatrix, directed = FALSE)
V(g2)$type <- grepl("cluster", V(g2)$name)
plot(g2)

# get icidence matrix to get what cluster contains which vertices
g2.ind <- t(as_incidence_matrix(g2))
# create an adjacency matrix by multiplcate the incidennce matrix by its transpose
g2.adj <- g2.ind %*% t(g2.ind)
# create a graph from the adjacency matrix
g2.adj.g <- graph_from_adjacency_matrix(g2.adj, mode = "undirected")
#plot(g2.adj.g)
# get the independant vertex sets
g2.adj.mis <- ivs(g2.adj.g)
sets <- lapply(g2.adj.mis, function(x) cliqueMatrix[cliqueMatrix[,1] %in% as_ids(x), 2])
# we want the sets with maximum sets size (12 in this case, so all tiles are used) 
lapply(sets[which(sapply(sets, length) == max(sapply(sets, length)))], 
       matrix, nrow = n)
# [[1]]
#     [,1] [,2] [,3] [,4]
# [1,] "5"  "3"  "2"  "1" 
# [2,] "10" "7"  "4"  "6" 
# [3,] "12" "11" "9"  "8" 
# 
# [[2]]
#     [,1] [,2] [,3] [,4]
# [1,] "4"  "3"  "2"  "1" 
# [2,] "6"  "8"  "7"  "5" 
# [3,] "11" "10" "12" "9" 


lapply(sets[which(sapply(sets, length) == max(sapply(sets, length)))], 
       function(m) {
         lapply(as.data.table(matrix(m, nrow = n)), function(x) df[from %in% as.numeric(x), .(rn = from, A, B)])
       })

# [[1]]
# [[1]]$V1
#    rn A B
# 1:  5 3 4
# 2: 10 5 1
# 3: 12 6 2
# 
# [[1]]$V2
#    rn A B
# 1:  3 2 3
# 2:  7 4 5
# 3: 11 6 1
# 
# [[1]]$V3
#    rn A B
# 1:  2 1 3
# 2:  4 2 4
# 3:  9 5 6
# 
# [[1]]$V4
#    rn A B
# 1:  1 1 2
# 2:  6 3 5
# 3:  8 4 6
# 
# 
# [[2]]
# [[2]]$V1
#    rn A B
# 1:  4 2 4
# 2:  6 3 5
# 3: 11 6 1
# 
# [[2]]$V2
#    rn A B
# 1:  3 2 3
# 2:  8 4 6
# 3: 10 5 1
# 
# [[2]]$V3
#    rn A B
# 1:  2 1 3
# 2:  7 4 5
# 3: 12 6 2
# 
# [[2]]$V4
#    rn A B
# 1:  1 1 2
# 2:  5 3 4
# 3:  9 5 6

old answer

note: I lack some mathematic skills, so it could be that the solution only works in this specific usecase. so I would appreciate of someone could take a look at that

I believe this is a way to select n rows (here 3) with all different values in A and B (so you end up with 6 diffetent values).

library(data.table)
library(igraph)
# set of tiles to work with
n = 3
setDT(df)
# set a rownumber
df[, from := .I]
# set rownumber as key
setkey(df, from)
# get all possible combinations with different A and B values
df[df, to := paste0(
  df[!A %in% c(i.A, i.B) & !B %in% c(i.A, i.B), ]$from,
  collapse = ";"), by = .EACHI]
# split to rows
nodes <- df[, .(to = unlist(tstrsplit(to, ";"))), by = .(from)]
# build graph
g <- graph_from_data_frame(d = nodes, directed = FALSE, vertices = df$from)
# get triangles/subsets of 3 vertices, present in a matrix
t(sapply(cliques(g, min = n,max = n), names))
#      [,1] [,2] [,3]
# [1,] "5"  "10" "12"
# [2,] "4"  "6"  "11"
# [3,] "3"  "7"  "11"
# [4,] "3"  "8"  "10"
# [5,] "2"  "4"  "9" 
# [6,] "2"  "7"  "12"
# [7,] "1"  "5"  "9" 
# [8,] "1"  "6"  "8" 

or, as a last line

lapply(cliques(g, min = 3,max = 3), function(x) df[as.numeric(x), .(rn = from, A,B)])

which results in

# [[1]]
#    rn A B
# 1:  5 3 4
# 2: 10 5 1
# 3: 12 6 2
# 
# [[2]]
#    rn A B
# 1:  4 2 4
# 2:  6 3 5
# 3: 11 6 1
# 
# [[3]]
#    rn A B
# 1:  3 2 3
# 2:  7 4 5
# 3: 11 6 1
# 
# [[4]]
#    rn A B
# 1:  3 2 3
# 2:  8 4 6
# 3: 10 5 1
# 
# [[5]]
#    rn A B
# 1:  2 1 3
# 2:  4 2 4
# 3:  9 5 6
# 
# [[6]]
#    rn A B
# 1:  2 1 3
# 2:  7 4 5
# 3: 12 6 2
# 
# [[7]]
#    rn A B
# 1:  1 1 2
# 2:  5 3 4
# 3:  9 5 6
# 
# [[8]]
#    rn A B
# 1:  1 1 2
# 2:  6 3 5
# 3:  8 4 6
Funiculate answered 16/1, 2024 at 11:5 Comment(7)
This is an interesting approach. which is able to find the exclusive A--B pairs, but how do you accomplish the splitting task, i.e., splitting the entire df (12 rows) into 4 distinct groups (within each group there are 3 A--B pairs, as shown in your output)?Rarotonga
You can refer to output from my own attempt https://stackoverflow.com/a/77806059/12158757, and you will see that there are two distinct ways for the partitioning, i.e., out[[1]] and out[[2]], and each partition consists of 4 groups that covers all rows in df, and each group has 3 exclusive A--B pairs.Rarotonga
Not sure what you mean? I get 8 groups of 3 exclusieve A-B pairs... Perhaps (read: probable) I misunderstand your comment? Why partition? I use igraph::clique() with to find closed subsets/subgraphs of size n = 3.Funiculate
For example, In your output, the 11th row of df (A--B pair is 6 -- 1) is in both the 2nd and 3rd partitions.Rarotonga
If you see the output in my posted question, you will see that, 12 rows of df can be divided into 4 groups, exclusive (no pair appears in two groups), and each group should contain all values of df, say, 1:6. Hope my explanation makes sense.Rarotonga
i edited in an undated anwer...Funiculate
yes, that gives the desired output, +1!Rarotonga
R
0

Disclaimer: This answer below is just to log my ever attempts, and I am still looking forward to other interesting solutions.


My attempt is bit of using backtracking algorithm, which keeps searching for available pairs that can be added to existing clusters. The the procedure cannot go on, then the path will be obsoleted and will be proceeded with in the following steps, like pruning a tree.

For simplification, I just assume all found pairs within each cluster (from top to bottom) should be sorted in an ascending order (in terms of the first column A), the pros and cons are:

  • pros: return multiple valid anti-isomorphisms
  • cons: slow

Code

f <- function(df) {
    # occurrence of each value, and unique values applied in df
    k <- unique(rowSums(table(df)))
    v <- unique(unlist(df))

    # helper1: grouping rows of df, say, row-index groups, such that the resulting pairs of values in each group are mutually exclusive
    helper1 <- function(S, idx) {
        d <- df[S, ]
        v <- df[idx, ]
        if (v[1] > d[nrow(d), 1] && all(!v %in% unlist(d))) {
            c(S, idx)
        }
    }

    rs <- seq_len(nrow(df))
    lst <- as.list(rs)
    cnt1 <- 1
    while (cnt1 < length(v) / k) { # each group consists of length(v)/k row indices
        lst <- Filter(
            length,
            unlist(lapply(lst, \(p) {
                q <- rs[rs > p[length(p)]]
                if (length(q)) {
                    lapply(q, helper1, S = p)
                }
            }), recursive = FALSE)
        )
        cnt1 <- cnt1 + 1
    }

    # helper 2: clustering the found list of row-index groups, say, group-index clusters i.e., `lst` from previous steps, such that the row-index groups in each cluster are multually exclusive
    helper2 <- function(S, idx) {
        u <- lst[S]
        v <- lst[idx]
        if (idx > S[length(S)] && all(!unlist(v) %in% unlist(u))) {
            c(S, idx)
        }
    }
    l <- seq_along(lst)
    g <- as.list(l)
    cnt2 <- 1
    while (cnt2 < 2 * k) { # each cluster consists of 2*k*length(v)/length(v) == 2*k row-index groups
        g <- Filter(
            length,
            unlist(
                lapply(g, \(p) {
                    q <- l[l > p[length(p)]]
                    if (length(q)) {
                        lapply(q, helper2, S = p)
                    }
                }),
                recursive = FALSE
            )
        )
        cnt2 <- cnt2 + 1
    }

    # output
    lapply(g, \(i) lapply(lst[i], \(j) df[j, ]))
}

Output Example

> out
[[1]]
[[1]][[1]]
  A B
1 1 2
5 3 4
9 5 6

[[1]][[2]]
   A B
2  1 3
7  4 5
12 6 2

[[1]][[3]]
   A B
3  2 3
8  4 6
10 5 1

[[1]][[4]]
   A B
4  2 4
6  3 5
11 6 1


[[2]]
[[2]][[1]]
  A B
1 1 2
6 3 5
8 4 6

[[2]][[2]]
  A B
2 1 3
4 2 4
9 5 6

[[2]][[3]]
   A B
3  2 3
7  4 5
11 6 1

[[2]][[4]]
   A B
5  3 4
10 5 1
12 6 2
Rarotonga answered 12/1, 2024 at 10:56 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.