Merging some sorted lists with unknown order sequence
Asked Answered
Q

4

8

I've some lists with variable number of elements. Each list is sorted, but the sorting algorithm is not known. I would like to merge the lists into one big list which contains all lists in same order, without duplicates.

Example Input:

  1. XS,M,L,XL
  2. S,M,XXL
  3. XXS,XS,S,L

Expected Result:

  • XXS,XS,S,M,L,XL,XXL

The expected result is obtained by matching up the input sequences in order to obtain a merged result that contains the elements of each input sequence in the correct order, like this:

    XS   M L XL
       S M      XXL
XXS XS S   L
-------------------
XXS XS S M L XL XXL

The function should notify, if there are elements which have ambiguous positions. Here, it would be XXL (it could stay after M,L or XL) and I need to specify its position manually after XL (because here I know the sorting algorithm and can help). I thought about defining pairs of every two elements, each pair in order as in original list. From this one could build the complete list.

Quasar answered 10/1, 2011 at 10:37 Comment(5)
I don't understand how you can merge the lists if the precedence rules are unknown.Anomalistic
@TylerDurden I've edited the question to hopefully make it a bit clearer. Does that help?Traweek
No, how is it that you decide where XXL is relative to L and XL?Anomalistic
@TylerDurden That's where the ambiguity comes in that's mentioned at the end of the question - in this example, there can be multiple valid results. Most algorithms will just pick any valid result, but this question was specifically asking to detect that condition and return an error instead. I included a $detectAmbiguity flag in my implementation below to handle that either way.Traweek
Related Question: #2739892Unsaid
T
14

This can be solved with a Topological Sort algorithm.

If you consider each of your input sequences to be a path through a directed graph, a topological sort will order your set of nodes from left to right in such a way that each directed edge points to the right. Diagram of a directed graph after topological sorting

The wikipedia page on Topological Sorting includes this algorithm, first described by Arthur Kahn in 1962:

L ← Empty list that will contain the sorted elements
S ← Set of all nodes with no incoming edges
while S is non-empty do
    remove a node n from S
    insert n into L
    for each node m with an edge e from n to m do
        remove edge e from the graph
        if m has no other incoming edges then
            insert m into S
if graph has edges then
    return error (graph has at least one cycle)
else 
    return L (a topologically sorted order)

This algorithm, as written, doesn't actually fail if it finds ambiguous sequences, but that's easy to add by inserting a check at the beginning of the loop, like this:

...
while S is non-empty do
    if S contains more than 1 item
        return error (inputs are ambiguous)
    remove a node n from S
    ...

I don't know what language you're working in, but I've thrown together this PHP implementation as a proof of concept:

function mergeSequences($sequences, $detectAmbiguity = false) {

    // build a list of nodes, with each node recording a list of all incoming edges
    $nodes = array();
    foreach ($sequences as $seq) {
        foreach ($seq as $i => $item) {
            if (!isset($nodes[$item])) $nodes[$item] = array();
            if ($i !== 0) {
                $nodes[$item][] = $seq[$i-1];
            }
        }
    }

    // build a list of all nodes with no incoming edges
    $avail = array();
    foreach ($nodes as $item => $edges) {
        if (count($edges) == 0) {
            $avail[] = $item;
            unset($nodes[$item]);
        }
    }

    $sorted = array();
    $curr = '(start)';
    while (count($avail) > 0) {

        // optional: check for ambiguous sequence
        if ($detectAmbiguity && count($avail) > 1) {
           throw new Exception("Ambiguous sequence: {$curr} can be followed by " . join(' or ', $avail));
        }

        // get the next item and add it to the sorted list
        $curr = array_pop($avail);
        $sorted[] = $curr;

        // remove all edges from the currently selected items to all others
        foreach ($nodes as $item => $edges) {
            $nodes[$item] = array_diff($edges, array($curr));                
            if (count($nodes[$item]) == 0) {
                $avail[] = $item;
                unset($nodes[$item]);
            }
        }

    }

    if (count($nodes) > 0) {
        throw new Exception('Sequences contain conflicting information. Cannot continue after: ' . join(', ', $sorted));
    }

    return $sorted;
}

You can call the function like this:

$input = array(
    array('XS', 'M', 'L', 'XL'),
    array('S', 'M', 'XXL'),
    array('XXS', 'XS', 'S', 'L'),
);
echo(join(', ', mergeSequences($input)));
echo(join(', ', mergeSequences($input, true)));

To get the following output:

XXS, XS, S, M, XXL, L, XL
Uncaught exception 'Exception' with message 'Ambiguous sequence: M can be followed by L or XXL'
Traweek answered 19/6, 2013 at 23:23 Comment(1)
Topological sorting is also described well in TAOCP Volume 1.Proffitt
D
6

You are trying to merge partially ordered sets, or posets. The ambiguous parts of the merge are called antichains. So, you want an algorithm that merges posets and tells you what the antichains are.

Here is a paper describing an algorithm for merging posets and detecting antichains, as well as a link to the first author's homepage in case you want to contact him to see if there is any source code available.

Doenitz answered 20/6, 2013 at 16:26 Comment(1)
Thanks. Often the hardest part about finding a solution is knowing the right terms to search for.Traweek
T
3

Here's what I would do:

  1. Preprocess the lists: figuring out that XXS is smaller than XS is smaller than S is smaller than ... XXL is a [constraint satisfaction problem](http://en.wikipedia.org/wiki/Constraint_satisfaction_problem). This type of problem involves searching for a correct ordering among all the elements given the constraints defined in the original lists.
  2. Create a bidirectional mapping from the set {XXS, ..., XXL} to the set {1, ..., 6}, after you have done step 1.
  3. For each list, create another list by using the mapping defined in 2.
  4. Use a modified [merge sort](http://en.wikipedia.org/wiki/Merge_sort) to combine two lists. Modify the merge algorithm so that it reports if two items being compared are identical (and disregards one of the items being merged).
  5. Do step 4 for each pair of lists until there is one list.
  6. Using the mapping defined in 2, create the text-version of the list.
Torchier answered 10/1, 2011 at 21:3 Comment(3)
I think you assume the order of elements (in 1.), but it is not known. From first list I know only order of XS,M,L and XL. From second list must be S < M, but stays S before or after XS? If we stop here, we don't know it and must set it manually.Quasar
I apologize, I must've misunderstood. Your first sentence said that the lists were sorted, which made me think that you know what the lists are (and therefore know all of the elements). It might be good to reword the first sentence to say: "I have sorted lists whose contents I do not know."Torchier
Also, I updated my answer to account for the misunderstanding.Torchier
R
-1

For sorting part, I think Merge Sort is enough according to your description. One thing need to modify is during merging, we should skip elements at the input array if the first element of the input array is the same as the result array.

If I understand correctly, you want to build a total order of the whole possible input elements. Some partial order is already defined in the input arraies(since they are already sorted), while others need to specified by users. For example in the question, order

'S'<'M'<'XXL'

'XS'<'M'<'L'<'XL'

'XXS'<'XS'<'S'<'L'

is well defined. But algorithm still doesn't know whether 'XXL' is bigger or smaller than 'XL', 'L'.
Well since the three input arraies is sorted, there must exist a total order of the input elements. So my suggestion is to ask your data provider for a ordered list of all possible data elements. Sounds stupid, but it's a easy way.

If this list is not available, a easy way to deal is to prompt for a pair sort for the user, then check whether this conflict with existing input sequence and remember it, when the algorithm encounter into an ambiguous pair. I think topology sorting is more powerful than this application. Since we deal with single data elements, there must exit a total order. While topology sorting is to deal with partial order.

Ronnieronny answered 26/6, 2013 at 21:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.