Python: sorting a dependency list
Asked Answered
C

6

15

I'm trying to work out if my problem is solvable using the builtin sorted() function or if I need to do myself - old school using cmp would have been relatively easy.

My data-set looks like:

x = [
('business', Set('fleet','address'))
('device', Set('business','model','status','pack'))
('txn', Set('device','business','operator'))
....

The sort rule should be basically for all value of N & Y where Y > N, x[N][0] not in x[Y][1]

Although I'm using Python 2.6 where the cmp argument is still available I'm trying to make this Python 3 safe.

So, can this be done using some lambda magic and the key argument?

-== UPDATE ==-

Thanks Eli & Winston! I didn't really think using key would work, or if it could I suspected it would be a shoe horn solution which isn't ideal.

Because my problem was for database table dependencies I had to make a minor addition to Eli's code to remove an item from its list of dependencies (in a well designed database this wouldn't happen, but who lives in that magical perfect world?)

My Solution:

def topological_sort(source):
    """perform topo sort on elements.

    :arg source: list of ``(name, set(names of dependancies))`` pairs
    :returns: list of names, with dependancies listed first
    """
    pending = [(name, set(deps)) for name, deps in source]        
    emitted = []
    while pending:
        next_pending = []
        next_emitted = []
        for entry in pending:
            name, deps = entry
            deps.difference_update(set((name,)), emitted) # <-- pop self from dep, req Py2.6
            if deps:
                next_pending.append(entry)
            else:
                yield name
                emitted.append(name) # <-- not required, but preserves original order
                next_emitted.append(name)
        if not next_emitted:
            raise ValueError("cyclic dependancy detected: %s %r" % (name, (next_pending,)))
        pending = next_pending
        emitted = next_emitted
Coastal answered 19/7, 2012 at 8:56 Comment(3)
en.wikipedia.org/wiki/Topological_sortingCocaine
@JonClements I would assume he means sets.Set, though that was deprecated even in the Python 2.6 that he says he is using. However, if that is what he means then he needs to provide a single iterable to the constructor rather than multiple arguments.Shane
Yes, sets.Set(). I'm supporting Python 2.3 - 3.1 environments and "Set('item1', 'item2')" is what the python interpreter prints out for a list of tuples containing a string and a set (copied output, not code for how this was created). Using sets as they ignore duplicate items if they're added which helps since the input used to create my data set is just ugly ...Coastal
S
20

What you want is called a topological sort. While it's possible to implement using the builtin sort(), it's rather awkward, and it's better to implement a topological sort directly in python.

Why is it going to be awkward? If you study the two algorithms on the wiki page, they both rely on a running set of "marked nodes", a concept that's hard to contort into a form sort() can use, since key=xxx (or even cmp=xxx) works best with stateless comparison functions, particularly because timsort doesn't guarantee the order the elements will be examined in. I'm (pretty) sure any solution which does use sort() is going to end up redundantly calculating some information for each call to the key/cmp function, in order to get around the statelessness issue.

The following is the alg I've been using (to sort some javascript library dependancies):

edit: reworked this greatly, based on Winston Ewert's solution

def topological_sort(source):
    """perform topo sort on elements.

    :arg source: list of ``(name, [list of dependancies])`` pairs
    :returns: list of names, with dependancies listed first
    """
    pending = [(name, set(deps)) for name, deps in source] # copy deps so we can modify set in-place       
    emitted = []        
    while pending:
        next_pending = []
        next_emitted = []
        for entry in pending:
            name, deps = entry
            deps.difference_update(emitted) # remove deps we emitted last pass
            if deps: # still has deps? recheck during next pass
                next_pending.append(entry) 
            else: # no more deps? time to emit
                yield name 
                emitted.append(name) # <-- not required, but helps preserve original ordering
                next_emitted.append(name) # remember what we emitted for difference_update() in next pass
        if not next_emitted: # all entries have unmet deps, one of two things is wrong...
            raise ValueError("cyclic or missing dependancy detected: %r" % (next_pending,))
        pending = next_pending
        emitted = next_emitted

Sidenote: it is possible to shoe-horn a cmp() function into key=xxx, as outlined in this python bug tracker message.

Subpoena answered 19/7, 2012 at 15:37 Comment(0)
N
8

I do a topological sort something like this:

def topological_sort(items):
    provided = set()
    while items:
         remaining_items = []
         emitted = False

         for item, dependencies in items:
             if dependencies.issubset(provided):
                   yield item
                   provided.add(item)
                   emitted = True
             else:
                   remaining_items.append( (item, dependencies) )

         if not emitted:
             raise TopologicalSortFailure()

         items = remaining_items

I think its a little more straightforward than Eli's version, I don't know about efficiency.

Nigh answered 19/7, 2012 at 16:0 Comment(1)
That's much more straightforward than my version. I think your main efficiency sink is the issubset() call, but that would only be an issue for large datasets -- my version is hampered by an initial setup-cost that's slower for small datasets, but that lets it modify the sets in place to avoid issubset when there's lots of dependancies. All that said, the basic structure of yours is still better, I hope you don't mind that I've reworked my implementation to borrow a bunch of your solution :)Subpoena
Z
6

Over looking bad formatting and this strange Set type... (I've kept them as tuples and delimited the list items correctly...) ... and using the networkx library to make things convenient...

x = [
    ('business', ('fleet','address')),
    ('device', ('business','model','status','pack')),
    ('txn', ('device','business','operator'))
]

import networkx as nx

g = nx.DiGraph()
for key, vals in x:
    for val in vals:
        g.add_edge(key, val)

print nx.topological_sort(g)
Zoltai answered 19/7, 2012 at 9:8 Comment(2)
It's the only solution which works for me... I don't like the dependency to other libs but it's worth the small codeDimity
One warning about this solution; it only works if your dependencies form a fully connected graph. If there are nodes that do not have any dependencies (and hence do not have any edges to other nodes), they will not be included in the output of topological_sort().Suzette
I
2

This is Winston's suggestion, with a docstring and a tiny tweak, reversing dependencies.issubset(provided) with provided.issuperset(dependencies). That change permits you to pass the dependencies in each input pair as an arbitrary iterable rather than necessarily a set.

My use case involves a dict whose keys are the item strings, with the value for each key being a list of the item names on which that key depends. Once I've established that the dict is non-empty, I can pass its iteritems() to the modified algorithm.

Thanks again to Winston.

def topological_sort(items):
    """
    'items' is an iterable of (item, dependencies) pairs, where 'dependencies'
    is an iterable of the same type as 'items'.

    If 'items' is a generator rather than a data structure, it should not be
    empty. Passing an empty generator for 'items' (zero yields before return)
    will cause topological_sort() to raise TopologicalSortFailure.

    An empty iterable (e.g. list, tuple, set, ...) produces no items but
    raises no exception.
    """
    provided = set()
    while items:
         remaining_items = []
         emitted = False

         for item, dependencies in items:
             if provided.issuperset(dependencies):
                   yield item
                   provided.add(item)
                   emitted = True
             else:
                   remaining_items.append( (item, dependencies) )

         if not emitted:
             raise TopologicalSortFailure()

         items = remaining_items
Intoxicate answered 26/1, 2016 at 15:56 Comment(0)
I
0

Propose one more sample based on Jon's answer using directed graphs, but it uses dict structure on input and looks more strightforward, also not all items should have dependencies.

import networkx as nx

dig = nx.DiGraph()

items= [
    {'name': 'first', 'depends': ['second', 'forth']},
    {'name': 'second', 'depends': []},
    {'name': 'third', 'depends': ['first']},
    {'name': 'forth', 'depends': []},
    {'name': 'fifth', 'depends': []}
]

for item in items:
    if not item['depends']:
        dig.add_node(item['name'])
    for dep in item['depends']:
        dig.add_edge(dep, item['name'])

print(list(nx.topological_sort(dig)))
# output: ['second', 'forth', 'fifth', 'first', 'third']
Icarian answered 28/4, 2023 at 16:8 Comment(0)
K
0

An example using the pypi package, toposort:

>>> from toposort import toposort
>>> 
>>> x = {
...  'business': {'fleet','address'},
...  'device': {'business','model','status','pack'},
...  'txn': {'device','business','operator'},
...  }

>>> list(toposort(x))
[{'pack', 'model', 'operator', 'status', 'address', 'fleet'}, {'business'}, {'device'}, {'txn'}]
Karelia answered 23/10, 2023 at 8:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.