Algorithm for ordering data so that neighbor elements are as identical as possible
Asked Answered
M

11

47

I have a (potentially large) list data of 3-tuples of small non-negative integers, like

data = [
    (1, 0, 5),
    (2, 4, 2),
    (3, 2, 1),
    (4, 3, 4),
    (3, 3, 1),
    (1, 2, 2),
    (4, 0, 3),
    (0, 3, 5),
    (1, 5, 1),
    (1, 5, 2),
]

I want to order the tuples within data so that neighboring tuples (data[i] and data[i+1]) are "as similar as possible".

Define the dissimilarity of two 3-tuples as the number of elements which are unequal between them. E.g.

  • (0, 1, 2) vs. (0, 1, 2): Dissimilarity 0.
  • (0, 1, 2) vs. (0, 1, 3): Dissimilarity 1.
  • (0, 1, 2) vs. (0, 2, 1): Dissimilarity 2.
  • (0, 1, 2) vs. (3, 4, 5): Dissimilarity 3.
  • (0, 1, 2) vs. (2, 0, 1): Dissimilarity 3.

Question: What is a good algorithm for finding the ordering of data which minimizes the sum of dissimilarities between all neighboring 3-tuples?

Some code

Here's a function which computes the dissimilarity between two 3-tuples:

def dissimilar(t1, t2):
    return sum(int(a != b) for a, b in zip(t1, t2))

Here's a function which computes the summed total dissimilarity of data, i.e. the number which I seek to minimize:

def score(data):
    return sum(dissimilar(t1, t2) for t1, t2 in zip(data, data[1:]))

The problem can be solved by simply running score() over every permutation of data:

import itertools
n_min = 3*len(data)  # some large number
for perm in itertools.permutations(data):
    n = score(perm)
    if n < n_min:
        n_min = n
        data_sorted = list(perm)
print(data_sorted, n_min)

Though the above works, it's very slow as we explicitly check each and every permutation (resulting in O(N!) complexity). On my machine the above takes about 20 seconds when data has 10 elements.

For completeness, here's the result of running the above given the example data:

data_sorted = [
    (1, 0, 5),
    (4, 0, 3),
    (4, 3, 4),
    (0, 3, 5),
    (3, 3, 1),
    (3, 2, 1),
    (1, 5, 1),
    (1, 5, 2),
    (1, 2, 2),
    (2, 4, 2),
]

with n_min = 15. Note that several other orderings (10 in total) with a score of 15 exist. For my purposes these are all equivalent and I just want one of them.

Final remarks

In practice the size of data may be as large as say 10000.

The sought-after algorithm should beat O(N!), i.e. probably be polynomial in time (and space).

If no such algorithm exists, I would be interested in "near-solutions", i.e. a fast algorithm which gives an ordering of data with a small but not necessarily minimal total score. One such algorithm would be lexicographic sorting, i.e.

sorted(data)  # score 18

though I hope to be able to do better than this.

Edit (comments on accepted solution)

I have tried all of the below heuristic solutions given as code (I have not tried e.g. Google OR-tools). For large len(data), I find that the solution of Andrej Kesely is both quick and gives the best results.

The idea behind this method is quite simple. The sorted list of data elements (3-tuples) is built up one by one. Given some data element, the next element is chosen to be the most similar one out of the remaining (not yet part of the sorted) data.

Essentially this solves a localized version of the problem where we only "look one ahead", rather than optimizing globally over the entire data set. We can imagine a hierarchy of algorithms looking n ahead, each successively delivering better (or at least as good) results but at the cost of being much more expensive. The solution of Andrej Kesely then sits lowest in this hierarchy. The algorithm at the highest spot, looking len(data) ahead, solves the problem exactly.

Let's settle for "looking 1 ahead", i.e. the answer by Andrej Kesely. This leaves room for a) the choice of initial element, b) what to do when several elements are equally good candidates (same dissimilarity) for use as the next one. Choosing the first element in data as the initial element and the first occurrence of an element with minimal dissimilarity, both a) and b) are determined from the original order of elements within data. As Andrej Kesely points out, it then helps to (lex)sort data in advance.

In the end I went with this solution, but refined in a few ways:

  • I try out the algorithm for 6 initial sortings of data; lex sort for columns (0, 1, 2), (2, 0, 1), (1, 2, 0), all in ascending as well as descending order.
  • For large len(data), the algorithm becomes too slow for me. I suspect it scales like O(n²). I thus process chunks of the data of size n_max independently, with the final result being the different sorted chunks concatenated. Transitioning from one chunk to the next we expect a dissimilarity of 3, but this is unimportant if we keep n_max large. I go with n_max = 1000.

As an implementation note, the performance can be improved by not using data.pop(idx) as this itself is O(n). Instead, either leave the original data as is and use another data structure for keeping track of which elements/indices have been used, or replace data[idx] with some marker value upon use.

Moe answered 15/4, 2022 at 16:34 Comment(13)
If I understand you correctly, I think this problem could be approached as finding a minimum weight Hamiltonian path in a complete graph where all tuples are treated as unique node (even if they are equal) the edge weights are the "dissimilarity" scores. Unfortunately, that problem is NP-complete. This is essentially the TSP without the requirement that the path is a cycle.Tannie
@KellyBundy Something like this: import random; f = lambda n: random.randint(0, n); n = 1000; data = [(f(n//30), f(n//20), f(n//10)) for _ in range(n)].Moe
@Brian Special cases might be special enough to not be NP-hard, though.Romine
Using sum((a != x) + (b != y) + (c != z) for (a, b, c), (x, y, z) in zip(data, data[1:])) made it ~4 times faster for me. (Just saying since I'd rather wait 9 seconds than 36 for a test to complete.)Romine
If they're small-integers, the complexity may be O(n), as it'd be dominated by simply counting the instances of each tuple.Ring
@Ring How so? Something like collections.Counter(data) is O(n), but it doesn't help much, as non-unique tuples are rare (indeed all tuples in the example data are unique). Identical tuples should certainly follow each other in the optimal ordering, so we may count them up and solve the problem for the unique case, then add in the counting information in the end. Still, the core problem remains. Or am I missing some insight?Moe
@jmd_dk: Gotcha. O()-notation is about the complexity as stuff goes to infinity, so if you're working with finite-spaces (such as the space of 3-tuples of a small amount of possible integer-values), then we'd assume that there're arbitrarily many of each possible tuple-value, as that'd tend to be the case as n->infty. But if you're working in a range where n isn't sufficiently large to hit that limit, the O(n)-nature of the problem might not be appreciable.Ring
@Ring OK. By "small" here I mean relative to n = length(data), i.e. no fixed upper value. So even in the n → ∞ limit one can expect the majority of the tuples to be unique.Moe
@jmd_dk: Could probably simplify the problem by doing a pre-processing step to remove duplicates, then a post-processing step that'd expand to account for removed-duplicates. Since those are relatively low-complexity tasks, the pre-processing and post-processing ought to be essentially free operations, while removing duplicates might help a lot (e.g., if O(4^n), then each duplicate removed might speed it up 4 times). Then dense-cases could still be handled well while even all-unique-cases would remain essentially the same.Ring
On the practical side, a TSP solver should give the optimum on n ~ 10000 in a few minutes, and an approximation with error below .5% in a few seconds. Without more info on your input distribution, this seems like the easiest available solution.Embassy
@kcsquared: Isn't that the real answer to the question? Specifically the question is, "what is a good algorithm", rather than "why is the best algorithm so slow in complexity terms?" :-)Headwaters
I think you could approach it as an extension of the shortest path problem, which has (Dijkstra) O(n2) complexity. Instead of the 2D shortest path, you would have to adapt it to the 3D shortest path, i.e., change the calculation of distance. It's not direct adaptation but may be a good approachBrogdon
I am not a pythonista, but it seems clear from the problem structure that your last algorithm would greatly benefit from dynamic programming.Sommerville
G
10

This isn't exact algorithm, just heuristic, but should be better that naive sorting:

# you can sort first the data for lower total average score:
# data = sorted(data)

out = [data.pop(0)]
while data:
    idx, t = min(enumerate(data), key=lambda k: dissimilar(out[-1], k[1]))
    out.append(data.pop(idx))


print(score(out))

Testing (100 repeats with data len(data)=1000):

import random
from functools import lru_cache


def get_data(n=1000):
    f = lambda n: random.randint(0, n)
    return [(f(n // 30), f(n // 20), f(n // 10)) for _ in range(n)]


@lru_cache(maxsize=None)
def dissimilar(t1, t2):
    a, b, c = t1
    x, y, z = t2
    return (a != x) + (b != y) + (c != z)


def score(data):
    return sum(dissimilar(t1, t2) for t1, t2 in zip(data, data[1:]))


def lexsort(data):
    return sorted(data)


def heuristic(data, sort_data=False):
    data = sorted(data) if sort_data else data[:]
    out = [data.pop(0)]
    while data:
        idx, t = min(enumerate(data), key=lambda k: dissimilar(out[-1], k[1]))
        out.append(data.pop(idx))
    return out


N, total, total_lexsort, total_heuristic, total_heuristic2 = 100, 0, 0, 0, 0
for i in range(N):
    data = get_data()
    r0 = score(data)
    r1 = score(lexsort(data))
    r2 = score(heuristic(data))
    r3 = score(heuristic(data, True))
    print("original data", r0)
    print("lexsort", r1)
    print("heuristic", r2)
    print("heuristic with sorted", r3)

    total += r0
    total_lexsort += r1
    total_heuristic += r2
    total_heuristic2 += r3

print("total original data score", total)
print("total score lexsort", total_lexsort)
print("total score heuristic", total_heuristic)
print("total score heuristic(with sorted)", total_heuristic2)

Prints:

...

total original data score 293682
total score lexsort 178240
total score heuristic 162722
total score heuristic(with sorted) 160384
Greengrocery answered 16/4, 2022 at 1:8 Comment(0)
E
26

Unfortunately, this problem is NP-complete. You can show this by a reduction from the Hamiltonian path problem on planar 3-regular bipartite graphs, which is also NP-complete.

As an overview of the proof: we will create a 3-Tuple for each vertex of our graph, such that pairs of corresponding 3-Tuples have dissimilarity equal to 2 if the vertices are adjacent, and dissimilarity equal to 3 if the vertices are not adjacent. The members of each vertex's 3-Tuples will correspond uniquely to the vertex's incident edges.

Proof:
Suppose we're given as input an undirected planar cubic bipartite graph G = (V, E) on which we're trying to find a Hamiltonian path. We can find a 3-coloring of the edges in linear time. Suppose our three edge-colors are 0, 1, 2. For each edge e in E, give it a unique natural number label L(e) such that L(e) mod 3 is equal to e's color.

For example, with this cubic planar bipartite graph: base graph

We can color the edges with colors 0, 1 and 2:

colored edges

And then label the edges with minimal natural numbers L(e) that respect the colors mod 3:

labeled edges

For each vertex v in V, create a 3-Tuple T = (t0, t1, t2), where t0, t1, t2 are the labels of edges incident to v with remainders equal to 0, 1, 2 respectively. Note that each edge label appears at the same index of each 3-Tuple it belongs to. In the example above, the top left vertex will get a 3-Tuple (0, 1, 29) and the left-most vertex will get a 3-Tuple (0, 16, 32).

Now, there is a Hamiltonian path in G if and only if there is an ordering of 3-Tuples with dissimilarity sum
2 * (|V| - 1). This is because an ordering of 3-Tuples has that dissimilarity sum if and only if the ordering corresponds to a Hamiltonian path in G.


References and addenda

The reduction used in the proof comes from an extremely specific version of the Hamiltonian path problem. The only properties of this class of graphs (i.e. the planar, cubic, bipartite class) that are used in the proof are:

  1. The graph must have chromatic index (edge coloring number) at most 3. By a result called König's line coloring theorem, the chromatic index of bipartite graphs equals the maximum vertex degree, so a cubic bipartite graph is sufficient.
  2. The 3-coloring of edges must be computable in polynomial time. In general, finding a 3-coloring of the edges of an arbitrary 3-edge-colorable graph is NP-complete. (In fact, by trying to color the vertices of the line graph, we can show it's NP-hard to find a 4-edge-coloring of a 3-edge-colorable cubic graph with this result by Khanna et al.). To fix this, I used a result by Cole et al on Edge-Coloring Bipartite Multigraphs in O(E log D) Time. This gives a way to construct the 3-coloring on edges in polynomial (linear, actually) time on our bipartite graph.
  3. The Hamiltonian path problem must be NP-hard for this class. There is a large patchwork of results on NP-completeness for Hamiltonian cycles or paths for graphs which are planar, and or cubic, and or k-connected, and or bipartite, etc. The first direct proof of the NP-completeness of Hamiltonian path on planar cubic bipartite graphs that I can cite is theorem 23 from Munaro's paper, On line graphs of subcubic triangle-free graphs. It's possible that an earlier reference showed this; the NP-completeness of the Hamiltonian cycle problem on that class was proven four decades earlier.
Embassy answered 15/4, 2022 at 21:19 Comment(13)
Hmm, I'm not convinced your reduction fulfills their constraints. They say their numbers are "small", and when I asked for code to generate realistic large inputs, they said n = 1000; data = [(f(n//30), f(n//20), f(n//10)) for _ in range(n)]. Which means the first value in each of the 1000 tuples is an int from 0 to 33. For that n, I think your values would go up to about 1500.Romine
@KellyBundy Thank you very much; I missed that edge in Paint. You're right that my reduction will have 1.5x as many unique elements as 3-tuples. Whenever the range of values for each slot can be at least a constant fraction of the number of 3-tuples, the problem remains NP-hard. I have no idea about the hardness when there are different limits on the number of unique values.Embassy
@KellyBundy For example, you could add c*n copies of some 'dummy tuple' (x,y,z), where none of x,y,z appear in any other tuple. Then the input size, n, grows by a constant factor of (c+1), while there are only 1/(c+1) fraction of unique values to 3-tuples for any constant c that you choose. The proof of the reduction gets a bit longer, too.Embassy
For the sake of "small" entries, every triple $(t_0,t_1,t_2)=(3s_0, 3s_1+1, 3s_2+2)$ can be replaced with $(s_0,s_1,s_2)$Puckery
Is it really NP-complete? I agree that the OP's problem is definitely NP-hard, but I don't see any way to check in polynomial time whether a given solution is optimal.Iatrochemistry
@IlmariKaronen I was being a bit imprecise with terminology: Being in NP and NP-hardness are properties of decision problems, not optimization problems. What I really mean is that the corresponding decision problem is NP-complete (the problem of whether an ordering with cost at most k exists). The question of whether you can efficiently check the optimality of a solution is quite interesting; I don't know the answer to that.Embassy
Do all planar cubic bipartite graphs have chromatic index 3? I haven’t encountered that result before.Postal
Your link links to a fairly large section, where in that section can we find what you're referring to?Romine
@Postal Apparently the fact that chromatic index = max degree in bipartite graphs is called Konigs line coloring theorem. It's definitely less well known than Vizing's theorem, and the source paper from 1916 is in German.Embassy
@KellyBundy Thanks for pointing that out; it seems that Wikipedia might not have a proper reference for the Hamiltonian-path-NP-completeness result I used from there. I've added all the references I used and what results I used from each reference.Embassy
Very clever insight!Postal
However, being NP-complete does not mean that you cannot write faster, and more efficient algorithms to solve such optimization problems in a reasonable time across a broad range of practical parameter sets. For instance, often there are pseudo-polynomial solution to such problems.Sommerville
@Sommerville I totally agree with you, this answer is mostly a response to the second part of the sentence "The sought-after algorithm should beat O(N!), i.e. probably be polynomial in time (and space)". This specific problem definitely does not have a pseudopolynomial solution (unless P=NP), as this proof works even when the input tuples elements are written in unary.Embassy
R
12

Here's a small improvement, not sure about the complexity, seems to be about O(4n). Solves N=10 in less than a second. It's too slow for your large cases, but might be useful for testing other solutions, by more quickly computing the expected results for testing inputs.

The idea is that of the N triples, one of them has to be the center. So try each as center. Let half = N // 2. Then half of the other triples must be on the left of the center and N - 1 - half on the right of the center. Try all splits into left and right and solve them recursively independently.

My helper function takes not just the triples in data but also the context in which it belongs: the data must be enclosed between head triple and tail triple (both can be None), and the score must be computed accordingly.

import itertools

data = [
    (1, 0, 5),
    (2, 4, 2),
    (3, 2, 1),
    (4, 3, 4),
    (3, 3, 1),
    (1, 2, 2),
    (4, 0, 3),
    (0, 3, 5),
    (1, 5, 1),
    (1, 5, 2),
]

def dissimilar(t1, t2):
    if t1 and t2:
        a, b, c = t1
        x, y, z = t2
        return (a != x) + (b != y) + (c != z)
    return 0

def score(data):
    return sum(dissimilar(t1, t2) for t1, t2 in zip(data, data[1:]))

def solve(data):
    def solve(head, data, tail):
        if len(data) <= 3:
           perm = min(itertools.permutations(data),
                      key=lambda perm: score([head, *perm, tail]))
           return list(perm), score([head, *perm, tail])
        half = len(data) // 2
        result = result_score = None
        for center in list(data):
            data.remove(center)
            for left in itertools.combinations(data, half):
                left = set(left)
                right = data - left
                left, score_left = solve(head, left, center)
                right, score_right = solve(center, right, tail)
                if result_score is None or score_left + score_right < result_score:
                    result = [*left, center, *right]
                    result_score = score_left + score_right
            data.add(center)
        return result, result_score
    return solve(None, set(data), None)

result, result_score = solve(data)
print(result, result_score, score(result), sorted(result) == sorted(data))
Romine answered 15/4, 2022 at 18:5 Comment(2)
Great answer; I think that solve(None, set(data), None) will give an ordering that's missing any duplicate elements. For a valid ordering, you probably need to count the frequencies of tuples in a postprocessing step.Embassy
@Embassy Yes, I'm assuming there aren't any duplicates in their actual data, as they'd be boring. I'm just making it a set so I have set operations. And like I said it's still slow, really the only value I see in it is the score number (for validating the next faster solution), which isn't affected by removing duplicates.Romine
G
10

This isn't exact algorithm, just heuristic, but should be better that naive sorting:

# you can sort first the data for lower total average score:
# data = sorted(data)

out = [data.pop(0)]
while data:
    idx, t = min(enumerate(data), key=lambda k: dissimilar(out[-1], k[1]))
    out.append(data.pop(idx))


print(score(out))

Testing (100 repeats with data len(data)=1000):

import random
from functools import lru_cache


def get_data(n=1000):
    f = lambda n: random.randint(0, n)
    return [(f(n // 30), f(n // 20), f(n // 10)) for _ in range(n)]


@lru_cache(maxsize=None)
def dissimilar(t1, t2):
    a, b, c = t1
    x, y, z = t2
    return (a != x) + (b != y) + (c != z)


def score(data):
    return sum(dissimilar(t1, t2) for t1, t2 in zip(data, data[1:]))


def lexsort(data):
    return sorted(data)


def heuristic(data, sort_data=False):
    data = sorted(data) if sort_data else data[:]
    out = [data.pop(0)]
    while data:
        idx, t = min(enumerate(data), key=lambda k: dissimilar(out[-1], k[1]))
        out.append(data.pop(idx))
    return out


N, total, total_lexsort, total_heuristic, total_heuristic2 = 100, 0, 0, 0, 0
for i in range(N):
    data = get_data()
    r0 = score(data)
    r1 = score(lexsort(data))
    r2 = score(heuristic(data))
    r3 = score(heuristic(data, True))
    print("original data", r0)
    print("lexsort", r1)
    print("heuristic", r2)
    print("heuristic with sorted", r3)

    total += r0
    total_lexsort += r1
    total_heuristic += r2
    total_heuristic2 += r3

print("total original data score", total)
print("total score lexsort", total_lexsort)
print("total score heuristic", total_heuristic)
print("total score heuristic(with sorted)", total_heuristic2)

Prints:

...

total original data score 293682
total score lexsort 178240
total score heuristic 162722
total score heuristic(with sorted) 160384
Greengrocery answered 16/4, 2022 at 1:8 Comment(0)
R
5

Perhaps also useful: A lower bound. Any ordering is a chain of the triples. A chain is a spanning tree. So the score of a minimum spanning tree is a lower bound.

Andrej Kesely ran their solution on 100 random inputs and got a total score of 160037. I ran minimum spanning tree on 100 random inputs. Did it three times, the total scores were 153866, 154040 and 153949. So Andrej's 160037 looks fairly close (4.0% above my average lower bound). And much closer than their 178096 from lexsort (15.7% above).

Code (Try it online!):

import random

def get_data(n=1000):
    f = lambda n: random.randint(0, n)
    return [(f(n // 30), f(n // 20), f(n // 10)) for _ in range(n)]

def dissimilar(t1, t2):
    a, b, c = t1
    x, y, z = t2
    return (a != x) + (b != y) + (c != z)

def mst_score(data):
    dist = dict.fromkeys(data, 3)
    dist[data[0]] = 0
    score = 0
    while dist:
        one = min(dist, key=dist.get)
        score += dist.pop(one)
        for other in dist:
            dist[other] = min(dist[other], dissimilar(one, other))
    return score

total = 0
for i in range(100):
    data = get_data()
    score = mst_score(data)
    total += score
    print(score, total)
Romine answered 16/4, 2022 at 5:41 Comment(0)
R
5

A drop-in replacement of Andrej Kesely's method that's about 30x faster in the same test. That method builds the output list by always appending the remaining triple most similar to the last one in the output. Most of the time, there is a remaining triple that matches that last one in at least one spot. So instead of checking all remaining triples for similarity, I first check only the remaining triples that match in at least one spot. Only when that fails do I check all remaining triples. I use indexes and break similarity ties with them, so I get the same order as Andrej's.

I call a pair (spot, number) a "mark". For example the triple (17, 3, 8) has the marks (0, 17), (1, 3) and (2, 8). That allows me to then look up triples matching those numbers at those spots.

from collections import defaultdict

def heuristic(data, sort_data=False):
    data = sorted(data) if sort_data else data[:]
    out = [data.pop(0)]
    indexes = defaultdict(set)
    for i, triple in enumerate(data):
        for mark in enumerate(triple):
            indexes[mark].add(i)
    remain = set(range(len(data)))
    while remain:
        a, b, c = out[-1]
        best = 4, None
        for mark in enumerate(out[-1]):
            for i in indexes[mark]:
                x, y, z = data[i]
                candidate = (a != x) + (b != y) + (c != z), i
                if candidate < best:
                    best = candidate
        i = best[1]
        if i is None:
            i = min(remain)
        remain.remove(i)
        t = data[i]
        for pattern in enumerate(t):
            indexes[pattern].remove(i)
        out.append(t)
    return out

And it gets about 35x faster if I replace the for mark in enumerate(out[-1]) loop with this, which doesn't compare the spot that I already know matches:

        for i in indexes[0, a]:
            _, y, z = data[i]
            candidate = (b != y) + (c != z), i
            if candidate < best:
                best = candidate
        for i in indexes[1, b]:
            x, _, z = data[i]
            candidate = (a != x) + (c != z), i
            if candidate < best:
                best = candidate
        for i in indexes[2, c]:
            x, y, _ = data[i]
            candidate = (a != x) + (b != y), i
            if candidate < best:
                best = candidate
Romine answered 19/4, 2022 at 12:10 Comment(1)
I'm impressed with how much thought and energy you've put into this problem :-)Moe
H
2

You can do it in O(2ⁿ·n²) if for each pair (subset, last element) you find the answer to put all the element from the subset so that the last element is fixed. You can do it recursively (iterating over previous element) taking care not to count the same state more than once (memoization)

Hoskins answered 16/4, 2022 at 8:51 Comment(0)
E
1

Your code for generating test data is this:

def f(n):
    return random.randint(0, n)

n = 10_000
data = [(f(n//30), f(n//20), f(n//10)) for _ in range(n)]

I had the idea to simply find the best order of the columns (or fields) in each item to sort on, and see how that went.

I theorised that the standard deviation of numbers in each column might have a beneficial effect on overall similarity i.e if there is not much of a spead of numbers in the third column then maybe sort on the third column values first to keep them together, and similar considerations on what column to sort on to break ties when successive third column values are the same, etc...

I found that random ordering of 10_000 items gave very low similarity. Ordering either by order of field/column with least standard deviation to most gave much better similarity. Ordering by any column order looks like it will be better than random.

The code

# -*- coding: utf-8 -*-
"""
Created on Sat Apr 16 10:33:41 2022

@author: paddy3118
"""

import random
from typing import List, Tuple
import pandas as pd


def f(n):
    return random.randint(0, n)

def stddev_mean(lst: List[int]) -> Tuple[float, float]:
    "Standard deviation of numbers in lst, mean."
    sm  = sum(lst)
    sm2 = sum(i**2 for i in lst)
    n = len(lst)
    return ((sm2 - sm**2 / n) / n)**0.5, sm / n

def similarity(d) -> int:
    "Compare successive 3-tuples for similarity - higher is better"
    return sum(sum((field0 == field1) for field0, field1 in zip(d0, d1))
               for d0, d1 in zip(d, d[1:]))

#%% Test data
data = [       # SIMILARITY in fields of seccessive records
    (0, 1, 2),
    (2, 4, 2), # 1
    (3, 2, 1), # 0
    (4, 3, 4), # 0
    (3, 3, 1), # 1
    (1, 2, 2), # 0
    (4, 0, 3), # 0
    (0, 3, 5), # 0
    (1, 5, 1), # 0
    (1, 5, 2), # 2
]              # Sum = 4
assert similarity(data) == 4

#%% Randomised data
print("NOTE: Similarity figures are better when higher.\n")

n = 10_000
data = [(f(n//30), f(n//20), f(n//10)) for _ in range(n)]

print(f"Random generation of {n} items has similarity of  = {similarity(data)}")

field_stddev = [(i, stddev_mean([datum[i] for datum in data])[0])
                   for i in range(len(data[0]))]
print(f"{field_stddev = }")


#%% Order by increasing stddeviation field
index_by_inc_dev = [ind for ind, _ in
                         sorted(field_stddev, key=lambda x: x[1])]

print(f"\nField indices by INcreasing stddev of tuple field values = {index_by_inc_dev}")
sort_by_field_of_inc_dev = sorted(data,
                                       key=lambda d:[d[i]
                                                     for i in index_by_inc_dev])
print(f"{similarity(sort_by_field_of_inc_dev) = :,}")
assert set(sort_by_field_of_inc_dev) == set(data), "Whoops, data lost in sort??"


#%% Order by increasing stddeviation field
print(f"\nField indices by DEcreasing stddev of tuple field values = {index_by_inc_dev[::-1]}")
sort_by_field_of_dec_dev = sorted(data,
                                       key=lambda d:[d[i]
                                                     for i in index_by_inc_dev[::-1]])
print(f"{similarity(sort_by_field_of_dec_dev) = :,}")
assert set(sort_by_field_of_dec_dev) == set(data), "Whoops, data lost in sort??"

#%%
print("\n Same data, random sorts, gives these similarity values:")
dcopy = data.copy()
for i in range(10):
    print(f"  Random similarity({i}) = {similarity(dcopy):,}")
    random.shuffle(dcopy)

Output

NOTE: Similarity figures are better when higher.

Random generation of 10000 items has similarity of  = 55
field_stddev = [(0, 96.04927579341764), (1, 145.8145251033312), (2, 288.84656085582884)]

Field indices by INcreasing stddev of tuple field values = [0, 1, 2]
similarity(sort_by_field_of_inc_dev) = 9,949

Field indices by DEcreasing stddev of tuple field values = [2, 1, 0]
similarity(sort_by_field_of_dec_dev) = 9,135

 Same data, random sorts, gives these similarity values:
  Random similarity(0) = 55
  Random similarity(1) = 61
  Random similarity(2) = 54
  Random similarity(3) = 60
  Random similarity(4) = 68
  Random similarity(5) = 55
  Random similarity(6) = 56
  Random similarity(7) = 58
  Random similarity(8) = 56
  Random similarity(9) = 63
Elishaelision answered 16/4, 2022 at 11:44 Comment(4)
If I understand correctly, you are using std as the proxy for the column order to use for lex sort. As we only have 3 columns, we can simply try out all 3! = 6 column orderings. Something like for x, y, z in itertools.permutations(range(3)): data_sorted = sorted(data, key=(lambda t: (t[x], t[y], t[z]))).Moe
If you have the time, then yes :-)Elishaelision
I don't understand these numbers and how they compare to other solutions. Could you include the scores as used by the question and other answers?Romine
See # test data in the code. The scores are equivalent.Elishaelision
E
1

This refines my previous answer by and also shows that sorting by columns where the column order is by increasing std-deviations of the numbers in that column does give the better result.

An extra algorithm takes that best sort and swaps adjecant rows if that increases overall similarity untill no more adjacent swaps improve the result.

Code

# -*- coding: utf-8 -*-
"""
Created on Sun Apr 17 07:30:23 2022

@author: paddy
"""

import random
from itertools import permutations
from typing import List, Tuple

DType = List[Tuple[int, int, int]]


def randomised_data_gen(n: int) -> DType:
    def f(n):
        return random.randint(0, n)

    return [(f(n // 30), f(n // 20), f(n // 10)) for _ in range(n)]


def stddev_mean(lst: List[int]) -> Tuple[float, float]:
    "Standard deviation of numbers in lst, mean."
    sm = sum(lst)
    sm2 = sum(i**2 for i in lst)
    n = len(lst)

    return ((sm2 - sm**2 / n) / n)**0.5, sm / n


def similarity(d: DType) -> int:
    "Compare successive 3-tuples for similarity - higher is better"

    return sum(sum((col0 == col1) for col0, col1 in zip(d0, d1)) for d0, d1 in zip(d, d[1:]))


def sort_by_column(data: DType, column_index: Tuple[int]) -> DType:

    return sorted(data, key=lambda d: [d[i] for i in column_index])


def similarity_when_randomized(data: DType, n: int = 10) -> None:
    dcopy = data.copy()
    sims = ", ".join([f"{similarity(dcopy):_}", random.shuffle(dcopy)][0] for _ in range(n))
    print(f"  Same data random sorts gives similarity values: {sims}")


#%% Test data
def test1():
    data = [  # SIMILARITY in columns of sucessive records
        (0, 1, 2),
        (2, 4, 2),  # 1
        (3, 2, 1),  # 0
        (4, 3, 4),  # 0
        (3, 3, 1),  # 1
        (1, 2, 2),  # 0
        (4, 0, 3),  # 0
        (0, 3, 5),  # 0
        (1, 5, 1),  # 0
        (1, 5, 2),  # 2
    ]  # Sum = 4
    assert similarity(data) == 4


test1()
#%% Randomised data

print("NOTE: Similarity figures are better when higher.\n")

n = 10_000
data = randomised_data_gen(n)

print(f"Random generation of {n} items has similarity of  = {similarity(data)}")
similarity_when_randomized(data, 10)

col_stddev = [(i, stddev_mean([datum[i] for datum in data])[0]) for i in range(len(data[0]))]

print(f"\n(Column_index, Standard_deviation) of each column of numbers: {col_stddev}")

#%% Order by increasing stddeviation column

index_by_inc_dev = [ind for ind, _ in sorted(col_stddev, key=lambda x: x[1])]

#%% Sort by columns

def score_all_column_orders(data: DType) -> Tuple[List[Tuple[int, Tuple[int, int, int]]], DType]: 
    "Print and return similarity scores for sorting data by all column orders, best first AND the best sort of data"
    sim_by_column_sort = [(similarity(sort_by_column(data, column_order)), column_order)
                        for column_order in permutations(range(len(data[0])))]
    sim_by_column_sort.sort(reverse=True)  # Highest similarity first

    print("\nSimilarity when sorting by all column orders:")
    print("  SIMILARITY  SORT_COLUMN_ORDER")
    for sim, col in sim_by_column_sort:
        col = list(col)
        if col == index_by_inc_dev[::-1]:
            comment = "  (by DEcreasing std-dev)"
        elif col == index_by_inc_dev:
            comment = "  (by INcreasing std-dev)"
        else:
            comment = ""
        print(f"  {sim:>10_}  {col}{comment}")

    return sim_by_column_sort, sort_by_column(data, sim_by_column_sort[0][1])

_, best_by_col_sort = score_all_column_orders(data)
score = best_score = similarity(best_by_col_sort)

#%% Swap with next row

print("\nNeighbourly swaps:")
twizzled = best_by_col_sort.copy()
while True:
    swaps = 0
    for i in range(1, len(twizzled) - 2):
        #breakpoint()
        i_window = [ twizzled[x] for x in [i-1, i, i+1, i+2]] 
        i_swap   = [ twizzled[x] for x in [i-1, i+1, i, i+2]] 
        if similarity(i_swap) > similarity(i_window):
            twizzled[i], twizzled[i+1] = twizzled[i+1], twizzled[i]
            swaps += 1
    if swaps == 0:
        break
    print(f"  Neighbour {swaps = } New similarity = {similarity(twizzled)}")

Sample output

NOTE: Similarity figures are better when higher.

Random generation of 10000 items has similarity of  = 70
  Same data random sorts gives similarity values: 70, 57, 65, 57, 51, 61, 64, 56, 63, 50

(Column_index, Standard_deviation) of each column of numbers: [(0, 96.5443987139596), (1, 145.326873357098), (2, 288.69842362782305)]

Similarity when sorting by all column orders:
  SIMILARITY  SORT_COLUMN_ORDER
       9_978  [0, 1, 2]  (by INcreasing std-dev)
       9_829  [0, 2, 1]
       9_802  [1, 0, 2]
       9_638  [1, 2, 0]
       9_160  [2, 0, 1]
       9_142  [2, 1, 0]  (by DEcreasing std-dev)

Neighbourly swaps:
  Neighbour swaps = 12 New similarity = 9990
Elishaelision answered 17/4, 2022 at 8:49 Comment(0)
B
1

Adopt a stochastic combinatorial optimization algorithm, such as tabu search, a genetic algorithm, or simulated annealing. I have ordered the alternatives in decreasing efficiency, according to my experience.

All involve starting with one or more random solutions and then improving the result in successive iterations through random perturbations and a method that decides which ones to keep. An objective function (in your case the total dissimilarity) guides you toward improvements. For the numbers you mention few thousand iterations down the road you will have a near optimal solution.

Bushwhack answered 17/4, 2022 at 10:12 Comment(0)
M
0

A reasonably good approximate (simulated annealing style) solution:

Start with an arbitrary (or random) permutation of the items. Optional: choose N random permutations (say, 10 or 100) and start with the one that has the lowest total error, discarding the others.

Choose two distinct random indices and swap the items at those indices if it would decrease the total error (you could do this the straightforward way, by computing the total error of the "before" and "after" lists, but you could also do it just by examining items i-1, i, i+1 and j-1, j, j+1 with and without the swap; all of the other contributions to the error remain the same).

Repeat the previous step until some stopping condition you choose, which could be one or more of:

  • You reach some predefined limit on iterations or execution time,
  • The total error falls below some threshold,
  • The percentage of "successful swaps" over the past N iterations falls below a threshold,
  • You go N iterations without a successful swap.
Mulct answered 16/4, 2022 at 20:35 Comment(0)
B
0

Maybe this can help you. It's an approach that you must validate, I haven't time to develop it, but I think it's feasible.

If the list of elements were ordered, its ordering would be trivial: linear. It would be enough to compare each element with the next one.

If only some elements were misplaced but close to their correct location, the sorting would be somewhat slower, but very fast as well. Most of the time, each element would be compared with the next one and some elements would need a few extra comparisons to be placed in position.

Based on the above, my approach is to easily obtain a fairly ordered list. And for this it would be enough (if I'm not mistaken) to make a sorting based on linear regression of the elements. I'm going to explain it in 2 dimensions (with elements with 2 values instead of 3) because it's simpler, but passing it to 3 dimensions, as you ask in your question, isn't complex.

You make an initial run of all your elements and calculate the regression line. Then, you calculate the rotation needed for the regression line to become Y=0 and apply that rotation to your points. This allows you to run through the points in order, from left to right. And that order is what you use as the first sort order.

Obviously, two contiguous points may not be the closest to each other but there will be few that are closer so the final sorting will always be localized and require few moves.

That is, we use the regression line as an approximation to the final location that the elements should have. It's clear that two elements that are far apart on the regression line (running from one end to the other) are also far apart and this allows us to arrange them in a way that is close to the optimum.

This forces us to make a couple of passes with some calculations that are not very complex but much more complex than the comparisons between each pair of elements. That is, with few comparisons, this approach is more expensive, but there will be a point at which this approach is better than making a very large number of comparisons between elements.

To move it to 3 dimensions (and use the 3 components you need) you have to use a 3D regression line and get the rotation matrix also in 3 dimensions. The rest is similar.

Brogdon answered 17/4, 2022 at 22:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.