How to aggregate matching pairs into "connected components" in Python
Asked Answered
D

4

17

Real-world problem:

I have data on directors across many firms, but sometimes "John Smith, director of XYZ" and "John Smith, director of ABC" are the same person, sometimes they're not. Also "John J. Smith, director of XYZ" and "John Smith, director of ABC" might be the same person, or might not be. Often examination of additional information (e.g., comparison of biographical data on "John Smith, director of XYZ" and "John Smith, director of ABC") makes it possible to resolve whether two observations are the same person or not.

Conceptual version of the problem:

In that spirit, am collecting data that will identify matching pairs. For example, suppose I have the following matching pairs: {(a, b), (b, c), (c, d), (d, e), (f, g)}. I want to use the transitivity property of the relation "is the same person as" to generate "connected components" of {{a, b, c, d, e}, {f, g}}. That is {a, b, c, d, e} is one person and {f, g} is another. (An earlier version of the question referred to "cliques", which are apparently something else; this would explain why find_cliques in networkx was giving the "wrong" results (for my purposes).

The following Python code does the job. But I wonder: is there a better (less computationally costly) approach (e.g., using standard or available libraries)?

There are examples here and there that seem related (e.g., Cliques in python), but these are incomplete, so I am not sure what libraries they are referring to or how to set up my data to use them.

Sample Python 2 code:

def get_cliques(pairs):
    from sets import Set

    set_list = [Set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for set in set_list:
            if pair[0] in set or pair[1] in set:
                set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(Set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))

This produces the desired output: [Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])].

Sample Python 3 code:

This produces [set(['a', 'c', 'b', 'e', 'd']), set(['g', 'f'])]):

def get_cliques(pairs):

    set_list = [set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for a_set in set_list:
            if pair[0] in a_set or pair[1] in a_set:
                a_set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))
Dunsinane answered 15/1, 2015 at 15:36 Comment(8)
Examples are probably referring to networkX lib : networkx.github.io/documentation/latest/reference/… It is a computationaly costly algorithm (NP complete to find the largest clique) but you have to check complexity of your code compared to the NX functions.Burette
Thanks. I am not a computer scientist in any way, but it seems this problem, which I have encountered twice in one week, is well-known en.wikipedia.org/wiki/Clique_problem . I couldn't get networkx to do what I wanted it to do. My solution takes about 20 minutes on 78,000 pairs (applied from PostgreSQL using PL/Python), so fine for the current problems.Dunsinane
as a side note, you can replace if not matched by else (else in this case meaning the for has reached the exit condition)Rubidium
Unless I'm missing something, your problem isn't the clique problem: it's instead variously called set consolidation, or finding the connected components in a graph (see also Union-Find).Embonpoint
This does not look like a clique, a and c are not connected.Rubidium
Yes, you're right DSM, it's more the problem of finding connected components in a graph and networkx can also find that easyly and it's less costly than finding cliquesBurette
So may be the question should now be reformulated with "connected components" and "consolidation" tags instead of cliques ?Burette
Apply union-find algorithm en.wikipedia.org/wiki/Disjoint-set_data_structureRevolutionary
B
13

With networkX:

import networkx as nx
G1=nx.Graph()
G1.add_edges_from([("a","b"),("b","c"),("c","d"),("d","e"),("f","g")])
sorted(nx.connected_components(G1), key = len, reverse=True)

giving:

[['a', 'd', 'e', 'b', 'c'], ['f', 'g']]

You have to check the fastest algorithm now ...

OP:

This works great! I have this in my PostgreSQL database now. Just organize pairs into a two-column table, then use array_agg() to pass to PL/Python function get_connected(). Thanks.

CREATE OR REPLACE FUNCTION get_connected(
    lhs text[],
    rhs text[])
  RETURNS SETOF text[] AS
$BODY$
    pairs = zip(lhs, rhs)

    import networkx as nx
    G=nx.Graph()
    G.add_edges_from(pairs)
    return sorted(nx.connected_components(G), key = len, reverse=True)

$BODY$ LANGUAGE plpythonu;

(Note: I edited answer, as I thought showing this step might be helpful addendum, but too long for a comment.)

Burette answered 15/1, 2015 at 16:44 Comment(4)
This seems to work very nicely. For 10,000 pairs, my code took 34 seconds, this less than a second. (This is an imperfect benchmark as I am running via PostgreSQL.)Dunsinane
Fine ! So networkX seems to do the job pretty fast with a simple API. It has optimized algorithms.Burette
Regarding Python graph libraries, also consider Python iGraph (igraph.org/python) which has a very similar API and a big performance improvement compared to networkX. If you work with large network, it is definitely worth it.Dominance
How do you scale this to billions of nodes? I have similar issue.Adlay
L
3

I don't believe (correct me if I'm wrong) that this is directly related to the largest clique problem. The definition of cliques (wikipedia) says that a clique "in an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge". In this case, we want to find which nodes can reach eachother (even indirectly).

I made a little sample. It builds a graph and traverses it looking for neighbors. This should be pretty efficient since each node is only traversed once when groups are formed.

from collections import defaultdict

def get_cliques(pairs):
    # Build a graph using the pairs
    nodes = defaultdict(lambda: [])
    for a, b in pairs:
        if b is not None:
            nodes[a].append((b, nodes[b]))
            nodes[b].append((a, nodes[a]))
        else:
            nodes[a]  # empty list

    # Add all neighbors to the same group    
    visited = set()
    def _build_group(key, group):
        if key in visited:
            return
        visited.add(key)
        group.add(key)
        for key, _ in nodes[key]:
            _build_group(key, group)

    groups = []
    for key in nodes.keys():
        if key in visited: continue
        groups.append(set())
        _build_group(key, groups[-1])

    return groups

if __name__ == '__main__':
    pairs = [
        ('a', 'b'), ('b', 'c'), ('b', 'd'), # a "tree"
        ('f', None),                        # no relations
        ('h', 'i'), ('i', 'j'), ('j', 'h')  # circular
    ]
    print get_cliques(pairs)
    # Output: [set(['a', 'c', 'b', 'd']), set(['f']), set(['i', 'h', 'j'])]

If your data set is best modeled like a graph and really big, maybe a graph database such as Neo4j is appropriate?

Lynettelynn answered 15/1, 2015 at 16:29 Comment(4)
I think if b: would better be if b is not None: -- otherwise you're ruling out 0 as an element for no reason, which happened to break my tests. :-)Embonpoint
You're not benchmarking all the suggested solutions, are you? :DHargrave
@AndréLaszlo Thanks for your answer. First you are right. This isn't the clique problem. I picked up this term from a casual usage I read in researching this quickly. I haven't yet tried your approach, as the networkX way works so well. Thanks also for directing me to the idea of graph databases. In their book "Graph Databases", Ian Robinson, Jim Webber, and Emil Eifrem make a distinction between the underlying storage and the processing engine. I guess I'm using PostgreSQL for the former, but (I guess) by using networkX, some kind of graph "database" for the latter.Dunsinane
@IanGow I haven't read "Graph Databases" but the idea of separating storage and processing is interesting. I'm guessing the "facts" that you are storing are pretty flat or fit well in the relational world? Otherwise I think Neo4j, for example, might have storage that is also optimized for generic graph operations - I'm no expert. I'm glad that you found a solution that works well!Hargrave
L
2

DSM's comment made me look for set consolidation algorithms in Python. Rosetta Code has two versions of the same algorithm. Example use (the non-recursive version):

[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

# Copied from Rosetta Code
def consolidate(sets):
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]

print consolidate([set(pair) for pair in pairs])
# Output: [set(['a', 'c', 'b', 'd']), set([None, 'f']), set(['i', 'h', 'j'])]
Lynettelynn answered 15/1, 2015 at 16:45 Comment(2)
It's quite likely this will fail for the OP's case as s is very long and Python's recursion limits are very low.Embonpoint
True, should have picked the other one :) Changed to the iterative version.Hargrave
T
1

I tried an alternate implementation using dictionaries as lookups and may have gotten a small reduction in computational latency.

# Modified to use a dictionary
from collections import defaultdict

def get_cliques2(pairs):
  maxClique = 1
  clique = defaultdict(int)
  for (a, b) in pairs:
    currentClique = max(clique[i] for i in (a,b))
    if currentClique == 0:
      currentClique = maxClique
      maxClique += 1
    clique[a] = clique[b] = currentClique
  reversed = defaultdict(list)
  for (k, v) in clique.iteritems(): reversed[v].append(k)
  return reversed

And just to convince myself that it returns the right result (get_cliques1 here is your original Python 2 solution):

>>> from cliques import *
>>> get_cliques1(pairs) # Original Python 2 solution
[Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])]
>>> get_cliques2(pairs) # Dictionary-based alternative
[['a', 'c', 'b', 'e', 'd'], ['g', 'f']]

Timing info in seconds (with 10 million repetitions):

$ python get_times.py 
get_cliques: 75.1285209656
get_cliques2: 69.9816100597

For the sake of completeness and reference, this is the full listing of both cliques.py and the get_times.py timing script:

# cliques.py
# Python 2.7
from collections import defaultdict
from sets import Set  # I moved your import out of the function to try to get closer to apples-apples

# Original Python 2 solution
def get_cliques1(pairs):

    set_list = [Set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for set in set_list:
            if pair[0] in set or pair[1] in set:
                set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(Set(pair))

    return set_list

# Modified to use a dictionary
def get_cliques2(pairs):
  maxClique = 1
  clique = defaultdict(int)
  for (a, b) in pairs:
    currentClique = max(clique[i] for i in (a,b))
    if currentClique == 0:
      currentClique = maxClique
      maxClique += 1
    clique[a] = clique[b] = currentClique
  reversed = defaultdict(list)
  for (k, v) in clique.iteritems(): reversed[v].append(k)
  return reversed.values()

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]


# get_times.py
# Python 2.7
from timeit import timeit

REPS = 10000000

print "get_cliques: " + str(timeit(
  stmt='get_cliques1(pairs)', setup='from cliques import get_cliques1, pairs',
  number=REPS
))
print "get_cliques2: " + str(timeit(
  stmt='get_cliques2(pairs)', setup='from cliques import get_cliques2, pairs',
  number=REPS
))

So at least in this contrived scenario, there is a measurable speedup. It's admittedly not groundbreaking, and I'm sure I left some performance bits on the table in my implementation, but maybe it will help get you thinking about other alternatives?

Thrower answered 15/1, 2015 at 16:24 Comment(1)
Unfortunately I don't think your get_cliques2 resolves all the links; consider get_cliques2([[0,2],[3,1],[2,1]]).Embonpoint

© 2022 - 2024 — McMap. All rights reserved.