Assuming we have a set v
of even cardinality, e.g., v <- 1:6
, and a data.frame df
consisting of entries of v
, which is defined by an fixed occurrence of each element from v
in each column, namely, k
, for example
k <- 2
x <- rep(v, each = k)
df <- data.frame(A = x, B = c(tail(x, -(k + 1)), head(x, k + 1)))
shown as
> df
A B
1 1 2
2 1 3
3 2 3
4 2 4
5 3 4
6 3 5
7 4 5
8 4 6
9 5 6
10 5 1
11 6 1
12 6 2
where the occurrences of 1:6
on both columns are 2
> table(df$A)
1 2 3 4 5 6
2 2 2 2 2 2
> table(df$B)
1 2 3 4 5 6
2 2 2 2 2 2
Objective and Expected Output
In df
, each row represents a "pair", and there are no duplicated "pairs". I am wondering, how to divide the pairs into clusters, such that each cluster is minimal and complete, i.e., each cluster contains all values from v
, without any duplicated entries.
Since the cardinality of v
is length(v)
, and the occurrence of each entry in df
is actually 2*k
, the number of clusters via a "ideal" split of df
should be 2*k*length(v)/length(v) == 2*k
. In other words, the number of clusters is defined by k
only, say, 2*k
.
For example, df
can be divided into 4
clusters like below, where the "completeness" property can be achieved
[[1]]
A B
1 1 2
5 3 4
9 5 6
[[2]]
A B
2 1 3
7 4 5
12 6 2
[[3]]
A B
3 2 3
8 4 6
10 5 1
[[4]]
A B
4 2 4
6 3 5
11 6 1
Note that, output above is just one of the valid instances, and there should be other candidates for clustering.
Question
One possible solution is using Monte Carlo simulation and keep the valid clustering outcomes, iteratively, if the randomized cluster satisfy all constraints. The code may look like below
out <- c()
repeat {
if (nrow(df) == 0) {
break
}
repeat {
k <- sample.int(nrow(df), length(v) / 2)
if (!length(setdiff(v, unlist(df[k, ])))) {
out <- c(out, list(df[k, ]))
df <- df[-k, ]
break
}
}
}
which sometimes can give the desired output like
> out
[[1]]
A B
6 3 5
11 6 1
4 2 4
[[2]]
A B
2 1 3
7 4 5
12 6 2
[[3]]
A B
8 4 6
3 2 3
10 5 1
[[4]]
A B
1 1 2
9 5 6
5 3 4
However, this approach has a major issue, say, Inefficiency: If the set v
has a large cardinality, the space for Monte Carlo simulation exponentially grows, which dramatically slows down the procedure of finding a valid solution.
I am wondering, if there is a stable and more efficient to solve such kind of problem. I think backtracking should work, but I believe there must be other approaches that can make it in a more elegant manner.
Looking forward to diverse and interesting solutions. Appreciated in advance!
3-by-2
permutations is the inverse process of "clustering", but it still needs to consider the constraint that: in the resultingdf
(if we want to recoverdf
from the clusters), besides the conditions of non-repeated pairs, the occurrence of each elements in each column is the same. And of course, I am more curious about the "dividing" strategy instead of the other way around. – Rarotongadf
. – Prandialdf
with more than two columns? – Stomatology