Python: simple list merging based on intersections
Asked Answered
R

19

54

Consider there are some lists of integers as:

#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n []
#--------------------------------------

The question is to merge lists having at least one common element. So the results only for the given part will be as follows:

#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------

What is the most efficient way to do this on large data (elements are just numbers)? Is tree structure something to think about? I do the job now by converting lists to sets and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1: Here is the code I was using:

#--------------------------------------
lsts = [[0,1,3],
        [1,0,3,4,5,10,11],
        [2,8],
        [3,1,0,16]];
#--------------------------------------

The function is (buggy!!):

#--------------------------------------
def merge(lsts):
    sts = [set(l) for l in lsts]
    i = 0
    while i < len(sts):
        j = i+1
        while j < len(sts):
            if len(sts[i].intersection(sts[j])) > 0:
                sts[i] = sts[i].union(sts[j])
                sts.pop(j)
            else: j += 1                        #---corrected
        i += 1
    lst = [list(s) for s in sts]
    return lst
#--------------------------------------

The result is:

#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------

Update 2: To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting). The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!

Update 3: Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.

Update 4:
There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.

Rackety answered 2/2, 2012 at 10:36 Comment(21)
Should the merge be recursive? I.e., if you have (1) [0, 1, 2], (2) [1, 3, 4] and (3) [4, 5, 6], do you expect the result to be one list since the union of (1) and (2) will share the 4 with (3), or do you expect two lists since (1) and (3) are disjoint?Orr
If you show us the code you have now, we may be able to help you find your bug.Sulcus
"large data" means many many lists or very long lists? Maybe a smart multithreading can buy you some time.Herwin
@FerdinandBeyer according to your explanation it should be recursive. So at the end the remaining lists have no common element.Rackety
@RikPoggi almost both: many lists which each of them could be long.Rackety
The code sample seems incomplete, specifically the while statement.Springclean
Janne is right. If you could provide the correct code, I'd write a small benchmark to compare the performance of our solutions.Anthropoid
@JanneKarila The code was complete but there was a problem with SO rendering. So I put them completely as html as you can see now.Rackety
@NiklasBaumstark As I wrote above it was from the beginning complete but there is a strange difficulty in SO rendering. You could check if you tried to see the source of the page. Anyway I replaced all with HTML version and it should be fine now for all.Rackety
Sounds like the sort of algorithm used where given assorted chunks of a graph--e.g. each node and its edges--you need to sort out the connected subgraphs.Finella
Your requirements is not complete what about if we have such list [[1, 2], [2, 3], [3, 4]] what should be result -> [[1,2,3,4]] or [[1,2,3], [2,3,4]] what about traversal????Federalize
To test the answers you got: Generate a good testset. a) Figure out how big you want it to be. b) Generate enough unique numbers to populate it. (E.g., keep adding a random small increment to the last number you used). c) Randomize their order and build enough lists. You now have a bunch of lists with NO overlap. d) Now choose some lists that should be merged, and insert a number (or more) in common. Presto, you have a test set.Disherison
The code in my answer is slightly faster, and also simpler, than the other attempts. To me at least, it is simple enough that I can verify it is logically correct much more easily than the other implementations. It is the same (obvious) algorithm, and also implemented using Python sets.Schober
Still is not specified how to merge please specify how to merge [[1, 2], [2, 3], [3, 4]] to [[1,2,3,4]] (since traversal) or [[1,2,3], [2,3,4]] (since only two intersection and no single). More complex is question how to merge [[1, 2], [2, 3], [3, 4], [1, 2, 3], [4, 5]] - no idea since many variants depend on use!Federalize
It looks like you might still not know why your starting code is buggy, so here's why: If you come to a list that overlaps with two lists you've already seen, you'll only merge it with one and move on. By the way if "reliability is much more important than speed", your goal should be a good testing routine (see my earlier comment), not "simple" code. You'll never know if you thought of all cases just by eyeballing the code. Fun problem anyway, thanks!Disherison
OK, I think my new function is both well fitted to adoption to Fortran and more efficient than the other solutions, especially if most of the sets are distinct.Schober
Developer: One of the advantages of having so many nice solutions is that you can compare the answers! From my test it seems that as of yesterday, all the solutions that @Niklas compared are correct.Disherison
@agf: Can't argue about the Fortran part, but I still can't reproduce the speed advantage of the Python code you're speaking about all the time. Would you please comment on my answer how the parameters for my benchmark (number of lists, number of classes, list sizes, etc.) would have to be to actually show that advantage?Anthropoid
@agf: Made the benchmark more adaptable, your Pythonic solution outperforms the other solutions in certain cases (but not by much).Anthropoid
I just came across the following slides: www.eecs.wsu.edu/~ananth/CptS223/Lectures/UnionFind.pdf They describe your exact problem and discuss the pros and cons of different approaches and data structures to tackle it.Anthropoid
excellent question. Has somebody implemented this using (Py)Spark and/or can point me in the right direction on how to do it?Pentobarbital
A
30

My attempt:

def merge(lsts):
    sets = [set(lst) for lst in lsts if lst]
    merged = True
    while merged:
        merged = False
        results = []
        while sets:
            common, rest = sets[0], sets[1:]
            sets = []
            for x in rest:
                if x.isdisjoint(common):
                    sets.append(x)
                else:
                    merged = True
                    common |= x
            results.append(common)
        sets = results
    return sets

lst = [[65, 17, 5, 30, 79, 56, 48, 62],
       [6, 97, 32, 93, 55, 14, 70, 32],
       [75, 37, 83, 34, 9, 19, 14, 64],
       [43, 71],
       [],
       [89, 49, 1, 30, 28, 3, 63],
       [35, 21, 68, 94, 57, 94, 9, 3],
       [16],
       [29, 9, 97, 43],
       [17, 63, 24]]
print merge(lst)

Benchmark:

import random

# adapt parameters to your own usage scenario
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5

if False:  # change to true to generate the test data file (takes a while)
    with open("/tmp/test.txt", "w") as f:
        lists = []
        classes = [
            range(class_size * i, class_size * (i + 1)) for i in range(class_count)
        ]
        for c in classes:
            # distribute each class across ~300 lists
            for i in xrange(list_count_per_class):
                lst = []
                if random.random() < large_list_probability:
                    size = random.choice(large_list_sizes)
                else:
                    size = random.choice(small_list_sizes)
                nums = set(c)
                for j in xrange(size):
                    x = random.choice(list(nums))
                    lst.append(x)
                    nums.remove(x)
                random.shuffle(lst)
                lists.append(lst)
        random.shuffle(lists)
        for lst in lists:
            f.write(" ".join(str(x) for x in lst) + "\n")

setup = """
# Niklas'
def merge_niklas(lsts):
    sets = [set(lst) for lst in lsts if lst]
    merged = 1
    while merged:
        merged = 0
        results = []
        while sets:
            common, rest = sets[0], sets[1:]
            sets = []
            for x in rest:
                if x.isdisjoint(common):
                    sets.append(x)
                else:
                    merged = 1
                    common |= x
            results.append(common)
        sets = results
    return sets

# Rik's
def merge_rik(data):
    sets = (set(e) for e in data if e)
    results = [next(sets)]
    for e_set in sets:
        to_update = []
        for i, res in enumerate(results):
            if not e_set.isdisjoint(res):
                to_update.insert(0, i)

        if not to_update:
            results.append(e_set)
        else:
            last = results[to_update.pop(-1)]
            for i in to_update:
                last |= results[i]
                del results[i]
            last |= e_set
    return results

# katrielalex's
def pairs(lst):
    i = iter(lst)
    first = prev = item = i.next()
    for item in i:
        yield prev, item
        prev = item
    yield item, first

import networkx

def merge_katrielalex(lsts):
    g = networkx.Graph()
    for lst in lsts:
        for edge in pairs(lst):
            g.add_edge(*edge)
    return networkx.connected_components(g)

# agf's (optimized)
from collections import deque

def merge_agf_optimized(lists):
    sets = deque(set(lst) for lst in lists if lst)
    results = []
    disjoint = 0
    current = sets.pop()
    while True:
        merged = False
        newsets = deque()
        for _ in xrange(disjoint, len(sets)):
            this = sets.pop()
            if not current.isdisjoint(this):
                current.update(this)
                merged = True
                disjoint = 0
            else:
                newsets.append(this)
                disjoint += 1
        if sets:
            newsets.extendleft(sets)
        if not merged:
            results.append(current)
            try:
                current = newsets.pop()
            except IndexError:
                break
            disjoint = 0
        sets = newsets
    return results

# agf's (simple)
def merge_agf_simple(lists):
    newsets, sets = [set(lst) for lst in lists if lst], []
    while len(sets) != len(newsets):
        sets, newsets = newsets, []
        for aset in sets:
            for eachset in newsets:
                if not aset.isdisjoint(eachset):
                    eachset.update(aset)
                    break
            else:
                newsets.append(aset)
    return newsets

# alexis'
def merge_alexis(data):
    bins = range(len(data))  # Initialize each bin[n] == n
    nums = dict()

    data = [set(m) for m in data]  # Convert to sets
    for r, row in enumerate(data):
        for num in row:
            if num not in nums:
                # New number: tag it with a pointer to this row's bin
                nums[num] = r
                continue
            else:
                dest = locatebin(bins, nums[num])
                if dest == r:
                    continue  # already in the same bin

                if dest > r:
                    dest, r = r, dest  # always merge into the smallest bin

                data[dest].update(data[r])
                data[r] = None
                # Update our indices to reflect the move
                bins[r] = dest
                r = dest

    # Filter out the empty bins
    have = [m for m in data if m]
    return have

def locatebin(bins, n):
    while bins[n] != n:
        n = bins[n]
    return n

lsts = []
size = 0
num = 0
max = 0
for line in open("/tmp/test.txt", "r"):
    lst = [int(x) for x in line.split()]
    size += len(lst)
    if len(lst) > max:
        max = len(lst)
    num += 1
    lsts.append(lst)
"""

setup += """
print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
""".format(class_count=class_count)

import timeit
print "niklas"
print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
print "rik"
print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
print "katrielalex"
print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
print "agf (1)"
print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
print "agf (2)"
print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
print "alexis"
print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)

These timings are obviously dependent on the specific parameters to the benchmark, like number of classes, number of lists, list size, etc. Adapt those parameters to your need to get more helpful results.

Below are some example outputs on my machine for different parameters. They show that all the algorithms have their strength and weaknesses, depending on the kind of input they get:

=====================
# many disjoint classes, large lists
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
=====================

niklas
5000 lists, 50 equally distributed classes, average size 298, max size 999
4.80084705353
rik
5000 lists, 50 equally distributed classes, average size 298, max size 999
9.49251699448
katrielalex
5000 lists, 50 equally distributed classes, average size 298, max size 999
21.5317108631
agf (1)
5000 lists, 50 equally distributed classes, average size 298, max size 999
8.61671280861
agf (2)
5000 lists, 50 equally distributed classes, average size 298, max size 999
5.18117713928
=> alexis
=> 5000 lists, 50 equally distributed classes, average size 298, max size 999
=> 3.73504281044

===================
# less number of classes, large lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
===================

niklas
4500 lists, 15 equally distributed classes, average size 296, max size 999
1.79993700981
rik
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.58237695694
katrielalex
4500 lists, 15 equally distributed classes, average size 296, max size 999
19.5465381145
agf (1)
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.75445604324
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 296, max size 999
=> 1.77850699425
alexis
4500 lists, 15 equally distributed classes, average size 296, max size 999
3.23530197144

===================
# less number of classes, smaller lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.1
===================

niklas
4500 lists, 15 equally distributed classes, average size 95, max size 997
0.773697137833
rik
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.0523750782
katrielalex
4500 lists, 15 equally distributed classes, average size 95, max size 997
6.04466891289
agf (1)
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.20285701752
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 95, max size 997
=> 0.714507102966
alexis
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.1286110878
Anthropoid answered 2/2, 2012 at 10:36 Comment(18)
You could use not x.isdisjoint(common) instead of x & common to avoid building the full intersection.Springclean
lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], [], [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.Rackety
@Developer: Yeah, you are right. My thinking was faulty, disjointness is not an equivalency relation!Anthropoid
If you try the given lists you will see 9 exists in two sets of outputs. Therefore the code suffers from the problem originally mentioned in the question, not reliability!Rackety
@Developer: I fixed it. It's slower now, obviously, because one pass through the list of lists doesn't seem to be sufficient.Anthropoid
@Nik: I improved my code, it does much better under your timing test, could you please update the results?Herwin
@developer: Sorry, I can't give any guarantees.Anthropoid
I posted a simplified speed test based on the one you used in your answer in mine; they seem to give equivalent results. Any particular reason you made yours so complicated?Schober
@agf: I wanted to make certain parameters configurable like the number of lists, the number of classes, average/min/max list length etc. to test a few different scenarios. Also, I wanted to save it to a file to for reproducability and to avoid having to recreate the test data dynamically every time. I don't feel that it's particularly complicated, though, it's just not very DRY. I think I just wrote it down in one burst and because it worked, I didn't bother to check if it could be shortified. By the way, I added your code and it actually seems to be a bit faster then Rik's or mine :) Nice.Anthropoid
@Niklas, your timing code is buggy because timeit only runs setup once per call. Do from copy import deepcopy and call timeit like this: timeit.timeit("merge1(deepcopy(lsts))", setup=setup, number=10)Disherison
@alexis: Thanks alot, I didn't know that. Adding the copying into the measured code doesn't appeal to me very much, is there another alternative to run setup code once per run?Anthropoid
I don't know of a way to run code per trial but exclude it from the timing. But I was only half right: As @Schober pointed out, (most?) routines immediately make a list of sets and leave the original data alone. With this caveat, you can leave the tests alone. I modified my code to steer clear of the problem.Disherison
I'm curious what you all think about my solution. As @robert pointed out, the test code has very few distinct bins. This favors algorithms that leave all the work to python's built-in operations, but it doesn't scale well for bigger datasets.Disherison
@alexis: I can integrate your code into the benchmark if you want. Also, what test code do you speak of?Anthropoid
Yes, please add my code. I meant the code that generates /tmp/test.txt. Could you tweak it to generate data with more distinct bins (20-100 distinct results instead of 5) and benchmark that too? It makes an unbelievable difference.Disherison
@alexis: Yes, I made it more adaptable :) Your code really performs best so far if there are many classes.Anthropoid
This is such a fun problem, anyway :-) Thanks for providing the test setup.Disherison
@NiklasB. I was using your solution for same kind of my problem statement & it worked like charm. But I want to merge lists on the based of common elements except for 2 values. For eg: for 63 and 65 I dont want to merge sets if they found common. For rest of them we can. Any help would be appreciated.Infold
H
16

I tried to summurize everything that's been said and done about this topic in this question and in the duplicate one.

I tried to test and time every solution (all the code here).

Testing

This is the TestCase from the testing module:

class MergeTestCase(unittest.TestCase):

    def setUp(self):
        with open('./lists/test_list.txt') as f:
            self.lsts = json.loads(f.read())
        self.merged = self.merge_func(deepcopy(self.lsts))

    def test_disjoint(self):
        """Check disjoint-ness of merged results"""
        from itertools import combinations
        for a,b in combinations(self.merged, 2):
            self.assertTrue(a.isdisjoint(b))

    def test_coverage(self):    # Credit to katrielalex
        """Check coverage original data"""
        merged_flat = set()
        for s in self.merged:
            merged_flat |= s

        original_flat = set()
        for lst in self.lsts:
            original_flat |= set(lst)

        self.assertTrue(merged_flat == original_flat)

    def test_subset(self):      # Credit to WolframH
        """Check that every original data is a subset"""
        for lst in self.lsts:
            self.assertTrue(any(set(lst) <= e for e in self.merged))

This test is supposing a list of sets as result, so I couldn't test a couple of sulutions that worked with lists.

I couldn't test the following:

katrielalex
steabert

Among the ones I could test, two failed:

  -- Going to test: agf (optimized) --
Check disjoint-ness of merged results ... FAIL

  -- Going to test: robert king --
Check disjoint-ness of merged results ... FAIL

Timing

The performances are strongly related with the data test employed.

So far three answers tried to time theirs and others solution. Since they used different testing data they had different results.

  1. Niklas benchmark is very twakable. With his banchmark one could do different tests changing some parameters.

    I've used the same three sets of parameters he used in his own answer, and I put them in three different files:

    filename = './lists/timing_1.txt'
    class_count = 50,
    class_size = 1000,
    list_count_per_class = 100,
    large_list_sizes = (100, 1000),
    small_list_sizes = (0, 100),
    large_list_probability = 0.5,
    
    filename = './lists/timing_2.txt'
    class_count = 15,
    class_size = 1000,
    list_count_per_class = 300,
    large_list_sizes = (100, 1000),
    small_list_sizes = (0, 100),
    large_list_probability = 0.5,
    
    filename = './lists/timing_3.txt'
    class_count = 15,
    class_size = 1000,
    list_count_per_class = 300,
    large_list_sizes = (100, 1000),
    small_list_sizes = (0, 100),
    large_list_probability = 0.1,
    

    This are the results that I got:

    From file: timing_1.txt

    Timing with: >> Niklas << Benchmark
    Info: 5000 lists, average size 305, max size 999
    
    Timing Results:
    10.434  -- alexis
    11.476  -- agf
    11.555  -- Niklas B.
    13.622  -- Rik. Poggi
    14.016  -- agf (optimized)
    14.057  -- ChessMaster
    20.208  -- katrielalex
    21.697  -- steabert
    25.101  -- robert king
    76.870  -- Sven Marnach
    133.399  -- hochl
    

    From file: timing_2.txt

    Timing with: >> Niklas << Benchmark
    Info: 4500 lists, average size 305, max size 999
    
    Timing Results:
    8.247  -- Niklas B.
    8.286  -- agf
    8.637  -- Rik. Poggi
    8.967  -- alexis
    9.090  -- ChessMaster
    9.091  -- agf (optimized)
    18.186  -- katrielalex
    19.543  -- steabert
    22.852  -- robert king
    70.486  -- Sven Marnach
    104.405  -- hochl
    

    From file: timing_3.txt

    Timing with: >> Niklas << Benchmark
    Info: 4500 lists, average size 98, max size 999
    
    Timing Results:
    2.746  -- agf
    2.850  -- Niklas B.
    2.887  -- Rik. Poggi
    2.972  -- alexis
    3.077  -- ChessMaster
    3.174  -- agf (optimized)
    5.811  -- katrielalex
    7.208  -- robert king
    9.193  -- steabert
    23.536  -- Sven Marnach
    37.436  -- hochl
    
  2. With Sven's testing data I got the following results:

    Timing with: >> Sven << Benchmark
    Info: 200 lists, average size 10, max size 10
    
    Timing Results:
    2.053  -- alexis
    2.199  -- ChessMaster
    2.410  -- agf (optimized)
    3.394  -- agf
    3.398  -- Rik. Poggi
    3.640  -- robert king
    3.719  -- steabert
    3.776  -- Niklas B.
    3.888  -- hochl
    4.610  -- Sven Marnach
    5.018  -- katrielalex
    
  3. And finally with Agf's benchmark I got:

    Timing with: >> Agf << Benchmark
    Info: 2000 lists, average size 246, max size 500
    
    Timing Results:
    3.446  -- Rik. Poggi
    3.500  -- ChessMaster
    3.520  -- agf (optimized)
    3.527  -- Niklas B.
    3.527  -- agf
    3.902  -- hochl
    5.080  -- alexis
    15.997  -- steabert
    16.422  -- katrielalex
    18.317  -- robert king
    1257.152  -- Sven Marnach
    

As I said at the beginning all the code is available at this git repository. All the merging functions are in a file called core.py, every function there with its name ending with _merge will be auto loaded during the tests, so it shouldn't be hard to add/test/improve your own solution.

Let me also know if there's something wrong, it's been a lot of coding and I could use a couple of fresh eyes :)

Herwin answered 26/2, 2012 at 16:38 Comment(5)
how about my rewrite of Niklas B. answer. just wondering if the timings of that are relevant.Seroka
@ChessMaster: Sometimes does slightly better, others slightly worse, this is why I didn't put it among the results. If you're interested you can try it yourself, at the link there's a git repository with all the merge functions in a file called core.py. Every function there with the name ending with _merge will be auto loaded. I just pushed yours, so you'll find it already there in skip mode. :)Herwin
Thanks for your great effort.Rackety
Sometimes I'm really surprised how much high-quality effort and knowledge is put into the answers on this site. Really nice job putting this compilation together!Anthropoid
Where is the answer by niklas-b? None of the 14 answers on this page are by niklas-b...Erickaericksen
P
7

Using Matrix Manipulations

Let me preface this answer with the following comment:

THIS IS THE WRONG WAY TO DO THIS. IT IS PRONE TO NUMERICAL INSTABILITY AND IS MUCH SLOWER THAN THE OTHER METHODS PRESENTED, USE AT YOUR OWN RISK.

That being said, I couldn't resist solving the problem from a dynamical point of view (and I hope you'll get a fresh perspective on the problem). In theory this should work all the time, but eigenvalue calculations can often fail. The idea is to think of your list as a flow from rows to columns. If two rows share a common value there is a connecting flow between them. If we were to think of these flows as water, we would see that the flows cluster into little pools when they there is a connecting path between them. For simplicity, I'm going to use a smaller set, though it works with your data set as well:

from numpy import where, newaxis
from scipy import linalg, array, zeros

X = [[0,1,3],[2],[3,1]]

We need to convert the data into a flow graph. If row i flows into value j we put it in the matrix. Here we have 3 rows and 4 unique values:

A = zeros((4,len(X)), dtype=float)
for i,row in enumerate(X):
    for val in row: A[val,i] = 1

In general, you'll need to change the 4 to capture the number of unique values you have. If the set is a list of integers starting from 0 as we have, you can simply make this the largest number. We now perform an eigenvalue decomposition. A SVD to be exact, since our matrix is not square.

S  = linalg.svd(A)

We want to keep only the 3x3 portion of this answer, since it will represent the flow of the pools. In fact we only want the absolute values of this matrix; we only care if there is a flow in this cluster space.

M  = abs(S[2])

We can think of this matrix M as a Markov matrix and make it explicit by row normalizing. Once we have this we compute the (left) eigenvalue decomp. of this matrix.

M /=  M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)
V = abs(V)

Now a disconnected (non-ergodic) Markov matrix has the nice property that, for each non-connected cluster, there is a eigenvalue of unity. The eigenvectors associated with these unity values are the ones we want:

idx = where(U > .999)[0]
C = V.T[idx] > 0

I have to use .999 due to the aforementioned numerical instability. At this point, we are done! Each independent cluster can now pull the corresponding rows out:

for cluster in C:
    print where(A[:,cluster].sum(axis=1))[0]

Which gives, as intended:

[0 1 3]
[2]

Change X to your lst and you'll get: [ 0 1 3 4 5 10 11 16] [2 8].


Addendum

Why might this be useful? I don't know where your underlying data comes from, but what happens when the connections are not absolute? Say row 1 has entry 3 80% of the time - how would you generalize the problem? The flow method above would work just fine, and would be completely parametrized by that .999 value, the further away from unity it is, the looser the association.


Visual Representation

Since a picture is worth 1K words, here are the plots of the matrices A and V for my example and your lst respectively. Notice how in V splits into two clusters (it is a block-diagonal matrix with two blocks after permutation), since for each example there were only two unique lists!

My Example Your sample data


Faster Implementation

In hindsight, I realized that you can skip the SVD step and compute only a single decomp:

M = dot(A.T,A)
M /=  M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)

The advantage with this method (besides speed) is that M is now symmetric, hence the computation can be faster and more accurate (no imaginary values to worry about).

Petrolic answered 2/2, 2012 at 16:1 Comment(12)
Thanks for the answer. It requires a bit deeper reading. So I am interested in testing it and see what happens.Rackety
How did you generate these really nice images?Larger
@ZacharyYoung check MatPlotLib for graphing, see Gallery.Rackety
@Developer: Hah, that's great! I've seen tons of SO questions about MatPlotLib but never bothered to check it out. Looks like I just found my new favorite SO tag :)Larger
At stage A = zeros((4,len(X)), dtype=float) we need then to know how many unique value exist in entire megalist! For the example given it was '4' how about many unkown large lists of lists. How to find unique values among them?Rackety
It fails if you change X = [[0,1,3],[*20*],[3,1]] the method fails since obviously there is no index of 20 while the dimension is 4! at code line : for i,row in enumerate(X): for val in row: A[val,i] = 1.Rackety
Maybe I got lost, what is B in your faster implementation?Herwin
@Developer, let me take those questions in order. @Zachary Young is right about matplotlib, it's a very fast and easy library to generate the images. You're right about needing to know the exact number of unique values, that's where the 4 comes from. In many problems you might already know this, or even an upper bound. As long as it is larger you'll be ok. You also aren't limited to an integer range as you can always map your values onto an order set of the integers. When I tested your 'idx` I changed the 4 to a 16, I'll make that more explicit in an edit.Petrolic
@Rackety I just want to be crystal clear that this isn't the most computationally efficient way - but a unique solution that has generalizations far beyond the original scope of the problem. In that, I feel the answer is useful.Petrolic
@RikPoggi B a typo, should be M, I will fix.Petrolic
@Petrolic I always appreciate different approaches to solve the problem. That is, I want to get the most from your answer. My understanding is that I will need to map all numbers in the lists to a series of continuous indices (to be used in A[val,i]). I have already a function to flatten the lists and the uniq function from scipy does the rest. However replacing values in the lists with indices, yet I don't know an efficient way. After that your method should work.Rackety
@Rackety let C be that unique list you have. len(C) gives the number of unique elements (the 4 value from the comments above). From here make the mapping dict(zip(C,range(len(C))). Then create a new array, every value in the old array is replaced by the one from the mapping. When complete use the inverse mapping dict(zip(range(len(C)),C) to restore.Petrolic
D
6

Here's my answer. I haven't checked it against today's batch of answers.

The intersection-based algorithms are O(N^2) since they check each new set against all the existing ones, so I used an approach that indexes each number and runs on close to O(N) (if we accept that dictionary lookups are O(1)). Then I ran the benchmarks and felt like a complete idiot because it ran slower, but on closer inspection it turned out that the test data ends up with only a handful of distinct result sets, so the quadratic algorithms don't have a lot work to do. Test it with more than 10-15 distinct bins and my algorithm is much faster. Try test data with more than 50 distinct bins, and it is enormously faster.

(Edit: There was also a problem with the way the benchmark is run, but I was wrong in my diagnosis. I altered my code to work with the way the repeated tests are run).

def mergelists5(data):
    """Check each number in our arrays only once, merging when we find
    a number we have seen before.
    """

    bins = range(len(data))  # Initialize each bin[n] == n
    nums = dict()

    data = [set(m) for m in data ]  # Convert to sets    
    for r, row in enumerate(data):
        for num in row:
            if num not in nums:
                # New number: tag it with a pointer to this row's bin
                nums[num] = r
                continue
            else:
                dest = locatebin(bins, nums[num])
                if dest == r:
                    continue # already in the same bin

                if dest > r:
                    dest, r = r, dest   # always merge into the smallest bin

                data[dest].update(data[r]) 
                data[r] = None
                # Update our indices to reflect the move
                bins[r] = dest
                r = dest 

    # Filter out the empty bins
    have = [ m for m in data if m ]
    print len(have), "groups in result"
    return have


def locatebin(bins, n):
    """
    Find the bin where list n has ended up: Follow bin references until
    we find a bin that has not moved.
    """
    while bins[n] != n:
        n = bins[n]
    return n
Disherison answered 26/2, 2012 at 12:56 Comment(4)
This code uses sets to get around the way timeit does repetition, but it works just as well (slightly faster, in fact) if you just append lists to each other and discard duplicates only when have is constructed. So it may have the benefit of being more Fortran-like (I never thought I would say this in a positive sense! :-)Disherison
(+1) Interesting analysis, and it's also one of the fastest solution. I've test it among all the others in various situation in my summary :)Herwin
Thanks. It should do better if you don't have zillions of collisions. I actually took out the structure that optimizes merge tracking, because the data I tested on only have a couple of thousand collisions and it didn't seem worth the complexity. It all depends on what your data really looks like, I guess.Disherison
For use in Python 3+, the behaviour of the range function is different and may not be shuffled. Defining bins as list(range(len(data))) will fix this issue. See this stackoverflow for additional information: #20484695Melodie
H
5

EDIT: OK, the other questions has been closed, posting here.

Nice question! It's much simpler if you think of it as a connected-components problem in a graph. The following code uses the excellent networkx graph library and the pairs function from this question.

def pairs(lst):
    i = iter(lst)
    first = prev = item = i.next()
    for item in i:
        yield prev, item
        prev = item
    yield item, first

lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

import networkx
g = networkx.Graph()
for sub_list in lists:
    for edge in pairs(sub_list):
            g.add_edge(*edge)

networkx.connected_components(g)
[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]

Explanation

We create a new (empty) graph g. For each sub-list in lists, consider its elements as nodes of the graph and add an edge between them. (Since we only care about connectedness, we don't need to add all the edges -- only adjacent ones!) Note that add_edge takes two objects, treats them as nodes (and adds them if they aren't already there), and adds an edge between them.

Then, we just find the connected components of the graph -- a solved problem! -- and output them as our intersecting sets.

Hardhearted answered 19/2, 2012 at 22:46 Comment(6)
Actually I was under the impression that this is already a solved problem, so I don't see much sense in reviving this old thread. However, because I also played with the idea of using a graph library for that, I integrated this solution into my benchmark. It doesn't seem to compete too well, unfortunately, looks like the Python hackers did a nice job at implementing and optimizing sets :)Anthropoid
Thanks for your different approach. It is really good to bring networkx to work for this purpose. It is very good package.Rackety
@NiklasB. cheers =) when I was looking at the question it seemed that nobody had actually written a working solution, but I guess I read it wrong.Hardhearted
I love networkx. btw.. this could have been more simple if for each list you just added an edge between the first element in the list and all the other elements in the list. It could have been about 4 lines of code.Haskell
@robertking feel free to edit =)... I just wrote what came to mind!Hardhearted
@katrielalex heh, I added my own answer. Hopefully it should be faster since each node in a set is only connected to one node from the set, so maximum length path is 2, so it won't take many operations to work out connectedness.Haskell
S
4

This new function only does the minimum necessary number of disjointness tests, something the other similar solutions fail to do. It also uses a deque to avoid as many linear time operations as possible, like list slicing and deletion from early in the list.

from collections import deque

def merge(lists):
    sets = deque(set(lst) for lst in lists if lst)
    results = []
    disjoint = 0
    current = sets.pop()
    while True:
        merged = False
        newsets = deque()
        for _ in xrange(disjoint, len(sets)):
            this = sets.pop()
            if not current.isdisjoint(this):
                current.update(this)
                merged = True
                disjoint = 0
            else:
                newsets.append(this)
                disjoint += 1
        if sets:
            newsets.extendleft(sets)
        if not merged:
            results.append(current)
            try:
                current = newsets.pop()
            except IndexError:
                break
            disjoint = 0
        sets = newsets
    return results

The less overlap between the sets in a given set of data, the better this will do compared to the other functions.

Here is an example case. If you have 4 sets, you need to compare:

1, 2
1, 3
1, 4
2, 3
2, 4
3, 4

If 1 overlaps with 3, then 2 needs to be re-tested to see if it now overlaps with 1, in order to safely skip testing 2 against 3.

There are two ways to deal with this. The first is to restart the testing of set 1 against the other sets after each overlap and merge. The second is to continue with the testing by comparing 1 with 4, then going back and re-testing. The latter results in fewer disjointness tests, as more merges happen in a single pass, so on the re-test pass, there are fewer sets left to test against.

The problem is to track which sets have to be re-tested. In the above example, 1 needs to be re-tested against 2 but not against 4, since 1 was already in its current state before 4 was tested the first time.

The disjoint counter allows this to be tracked.


My answer doesn't help with the main problem of finding an improved algorithm for recoding into FORTRAN; it is just what appears to me to be the simplest and most elegant way to implement the algorithm in Python.

According to my testing (or the test in the accepted answer), it's slightly (up to 10%) faster than the next fastest solution.

def merge0(lists):
    newsets, sets = [set(lst) for lst in lists if lst], []
    while len(sets) != len(newsets):
        sets, newsets = newsets, []
        for aset in sets:
            for eachset in newsets:
                if not aset.isdisjoint(eachset):
                    eachset.update(aset)
                    break
            else:
                newsets.append(aset)
    return newsets

No need for the un-Pythonic counters (i, range) or complicated mutation (del, pop, insert) used in the other implementations. It uses only simple iteration, merges overlapping sets in the simplest manner, and builds a single new list on each pass through the data.

My (faster and simpler) version of the test code:

import random

tenk = range(10000)
lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)]

setup = """
def merge0(lists):
  newsets, sets = [set(lst) for lst in lists if lst], []
  while len(sets) != len(newsets):
    sets, newsets = newsets, []
    for aset in sets:
      for eachset in newsets:
        if not aset.isdisjoint(eachset):
          eachset.update(aset)
          break
      else:
        newsets.append(aset)
  return newsets

def merge1(lsts):
  sets = [set(lst) for lst in lsts if lst]
  merged = 1
  while merged:
    merged = 0
    results = []
    while sets:
      common, rest = sets[0], sets[1:]
      sets = []
      for x in rest:
        if x.isdisjoint(common):
          sets.append(x)
        else:
          merged = 1
          common |= x
      results.append(common)
    sets = results
  return sets

lsts = """ + repr(lsts)

import timeit
print timeit.timeit("merge0(lsts)", setup=setup, number=10)
print timeit.timeit("merge1(lsts)", setup=setup, number=10)
Schober answered 22/2, 2012 at 18:14 Comment(8)
thank you again for pointing out my mistake, i think my code is fine now.Seroka
Good to see another method. Thanks. It is elegant.Rackety
What's wrong with pop and insert? It's the obvious way to treat a list as a stack, and del just removes elements from a list. Also the next fastest solution happens to be mine, so you should timeit against that. With your test speed I got that mine is still faster, with Niklas test sometimes yours is faster, sometimes it's the other way around, but the difference seems to be very little, nothing that a local declaration can't shadow. Sir, there's no need to brag, if you had questions about my code you could have just asked there instead of mocking it here.Herwin
@RikPoggi pop from the end of a list or deque is constant time, while insert / del from elsewhere in the list is O(n) because all the items after that point in the list have to be moved.Schober
Have you actually looked at my code? Or are you just throwing random Big-O? I didn't guess, I've timed and profiled my code, and it also seems that I've done a good job since according to your own test mine is faster than yours. There's one insert(0,i) that's going to cost exactly like an append(i) (and I use insert to avoid to reverse the list later). del doesn't really cost much, and anyway that's how my code works: it does a minimum amount of disjoint check at the price of keeping the result table updated.Herwin
The benchmarking code is buggy: timeit() does not rerun the setup code for each iteration, and these mergers use list destructively. Change the invocation to print timeit.timeit("merge0(deepcopy(lsts))", setup=setup, number=10), and your times will go up by a factor of 10. (Add from copy import deepcopy to the setup)Disherison
@Disherison Sorry, they don't. The list comprehension creates a new list of sets without changing the original list of lists.Schober
Hmm, you're right that if only the sets are modified, the list should stay the same. Some algorithms modify the list in place, though, and it's not right to penalize them since the task only needs to run once.Disherison
T
3

Here's an implementation using a disjoint-set data structure (specifically a disjoint forest), thanks to comingstorm's hint at merging sets which have even one element in common. I'm using path compression for a slight (~5%) speed improvement; it's not entirely necessary (and it prevents find being tail recursive, which could slow things down). Note that I'm using a dict to represent the disjoint forest; given that the data are ints, an array would also work although it might not be much faster.

def merge(data):
    parents = {}
    def find(i):
        j = parents.get(i, i)
        if j == i:
            return i
        k = find(j)
        if k != j:
            parents[i] = k
        return k
    for l in filter(None, data):
        parents.update(dict.fromkeys(map(find, l), find(l[0])))
    merged = {}
    for k, v in parents.items():
        merged.setdefault(find(v), []).append(k)
    return merged.values()

This approach is comparable to the other best algorithms on Rik's benchmarks.

Tenebrae answered 22/8, 2012 at 0:15 Comment(1)
FWIW I seem to find the opposite, that this takes roughly 3-4 times longer. Guess (as noted before) it depends strongly on what sets you're testing it on..Pet
H
2

This would be my updated approach:

def merge(data):
    sets = (set(e) for e in data if e)
    results = [next(sets)]
    for e_set in sets:
        to_update = []
        for i,res in enumerate(results):
            if not e_set.isdisjoint(res):
                to_update.insert(0,i)

        if not to_update:
            results.append(e_set)
        else:
            last = results[to_update.pop(-1)]
            for i in to_update:
                last |= results[i]
                del results[i]
            last |= e_set

    return results

Note: During the merging empty lists will be removed.

Update: Reliability.

You need two tests for a 100% reliabilty of success:

  • Check that all the resulting sets are mutually disjointed:

    merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}]
    
    from itertools import combinations
    for a,b in combinations(merged,2):
        if not a.isdisjoint(b):
            raise Exception(a,b)    # just an example
    
  • Check that the merged set cover the original data. (as suggested by katrielalex)

I think this will take some time, but maybe it'll be worth it if you want to be 100% sure.

Herwin answered 2/2, 2012 at 14:55 Comment(16)
lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], [], [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] the result [set([1, 3, 5, **9**, 17, 21, 24, 28, 29, 30, 35, 43, 48, 49, 56, 57, 62, 63, 65, 68, 79, 89, 94, 97]), set([6, **9**, 14, 19, 32, 34, 37, 55, 64, 70, 75, 83, 93, 97]), set([43, 71]), set(), set([16])] is incorrect.Rackety
If you try the given lists you will see 9 exists in two sets of outputs. Therefore the code suffers from the problem originally mentioned in the question, not reliability!Rackety
@Developer: I see.. that's because there's a list that have 2 different numbers each one in common with 2 disjoined sets. I'll take a look at it.Herwin
@Rik: I can't reproduce your timings. Even with my fixed version, the difference is only 10% or so (I added the benchmark to my answer). Please add the test code, otherwise this isn't of much use.Anthropoid
@NiklasBaumstark: I fixed my code and posted my timing code, if it has problem let me know. It also seems that sometimes our solutions are different, because yours doesn't add the empty set, but I'm not sure, maybe is my fault.Herwin
@Developer: update answer, it should work now. Let me know if you do other tests.Herwin
@Rik: I remove the empty list by purpose in the very first line of my code. I don't think it's sensible as an outputAnthropoid
@Rik: I don't think your timings are really saying anything... You should use timeit and more real-world sample data. In my benchmark, your solution is about 5 times slower than mine?Anthropoid
@NiklasBaumstark: I could agree with you, just noticing since it there was in the checked output given by the OP, but maybe he's not interested either.Herwin
@Nik: as you wish, I just wanted to know why we had so different results.Herwin
@Developer: Me and Niklas have different benchmark results, that's because we're using different test-data, so which one will be faster I guess it'll depend of what data you're going to use. Please let us now how this will work out for you.Herwin
@RikPoggi to me at the first stage the reliability of the method is important. As I put in my comments in updates sections of the question, the highest speed could be acheived by implementation the intensive loops or array works in Fortran. Your method and Nikas work well. How can ensure they will work for all cases?Rackety
@Developer: I've updated my code, now it's a little bit faster but mainly I wanted to remove empty lists during merges (as Nik's solution already does), so if you want you can test mine and Nik's with random inputs and see if they both return the same result. The two solutions seems a bit different to me, so I guess this could count as a reliability. Otherwise I'm afraid you'll have to manually provide known input and output.Herwin
@Developer: One way to see if the merging was succefull could be check if the resulting sets are mutually disjoined, this will take some time, but you'll be 100% sure. I updated my question.Herwin
Thanks, that looks a good approach. If it was fast could be included in the merge function, of course with a bit more friendly exception handling. I will try to have some experience using your idea.Rackety
@Rik: not only do you have to test mutual disjointness, you also have to show that they cover the original set (otherwise, you could return an empty list).Hardhearted
S
2

Just for fun...

def merge(mylists):
    results, sets = [], [set(lst) for lst in mylists if lst]
    upd, isd, pop = set.update, set.isdisjoint, sets.pop
    while sets:
        if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]:
            results.append(pop(0))
    return results

and my rewrite of the best answer

def merge(lsts):
  sets = map(set,lsts)
  results = []
  while sets:
    first, rest = sets[0], sets[1:]
    merged = False
    sets = []
    for s in rest:
      if s and s.isdisjoint(first):
        sets.append(s)
      else:
        first |= s
        merged = True
    if merged: sets.append(first)
    else: results.append(first)
  return results
Seroka answered 21/2, 2012 at 18:52 Comment(3)
Your algorithm is incorrect. You only ever update the first set. Try merge([[1], [2, 3], [3, 4]]).Schober
@Schober algorithm is ok i think but i tried to make the code smaller by using update = sets[0].update and isdisjoint = sets[0].isdisjoint but that does not work well in this case, thank you.Seroka
Yep, that fixes it. The one you added performs terribly though in my tests.Schober
C
0

This is slower than the solution offered by Niklas (I got 3.9s on the test.txt instead of 0.5s for his solution), but yields the same result and might be easier to implement in e.g. Fortran, since it doesn't use sets, only sorting of the total amount of elements and then a single run through all of them.

It returns a list with the ids of the merged lists, so also keeps track of empty lists, they stay unmerged.

def merge(lsts):
        # this is an index list that stores the joined id for each list
        joined = range(len(lsts))
        # create an ordered list with indices
        indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst)
        # loop throught the ordered list, and if two elements are the same and
        # the lists are not yet joined, alter the list with joined id
        el_0,idx_0 = None,None
        for el,idx in indexed_list:
                if el == el_0 and joined[idx] != joined[idx_0]:
                        old = joined[idx]
                        rep = joined[idx_0]
                        joined = [rep if id == old else id for id in joined]
                el_0, idx_0 = el, idx
        return joined
Coddle answered 7/2, 2012 at 8:58 Comment(1)
Thanks for sharing your idea. The good point is that you use labeling (i.e., ID) which can be used for further developments.Rackety
M
0

Here's a function (Python 3.1) to check if the result of a merge function is OK. It checks:

  • Are the result sets disjoint? (number of elements of union == sum of numbers of elements)
  • Are the elements of the result sets the same as of the input lists?
  • Is every input list a subset of a result set?
  • Is every result set minimal, i.e. is it impossible to split it into two smaller sets?
  • It does not check if there are empty result sets - I don't know if you want them or not...

.

from itertools import chain

def check(lsts, result):
    lsts = [set(s) for s in lsts]
    all_items = set(chain(*lsts))
    all_result_items = set(chain(*result))
    num_result_items = sum(len(s) for s in result)
    if num_result_items != len(all_result_items):
        print("Error: result sets overlap!")
        print(num_result_items, len(all_result_items))
        print(sorted(map(len, result)), sorted(map(len, lsts)))
    if all_items != all_result_items:
        print("Error: result doesn't match input lists!")
    if not all(any(set(s).issubset(t) for t in result) for s in lst):
        print("Error: not all input lists are contained in a result set!")

    seen = set()
    todo = list(filter(bool, lsts))
    done = False
    while not done:
        deletes = []
        for i, s in enumerate(todo): # intersection with seen, or with unseen result set, is OK
            if not s.isdisjoint(seen) or any(t.isdisjoint(seen) for t in result if not s.isdisjoint(t)):
                seen.update(s)
                deletes.append(i)
        for i in reversed(deletes):
            del todo[i]
        done = not deletes
    if todo:
        print("Error: A result set should be split into two or more parts!")
        print(todo)
Matutinal answered 20/2, 2012 at 3:21 Comment(1)
It would be neat if you could write this in unit test language =)Hardhearted
H
0
lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

import networkx as nx
g = nx.Graph()

for sub_list in lists:
    for i in range(1,len(sub_list)):
        g.add_edge(sub_list[0],sub_list[i])

print nx.connected_components(g)
#[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]

Performance:

5000 lists, 5 classes, average size 74, max size 1000
15.2264976415

Performance of merge1:

print timeit.timeit("merge1(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26998780571

So it is 11x slower than the fastest.. but the code is much more simple and readable!

Haskell answered 21/2, 2012 at 23:40 Comment(0)
H
0

Firstly I'm not exactly sure if the benchmarks are fair:

Adding the following code to the start of my function:

c = Counter(chain(*lists))
    print c[1]
"88"

This means that out of all the values in all the lists, there are only 88 distinct values. Usually in the real world duplicates are rare, and you would expect a lot more distinct values. (of course i don't know where your data from so can't make assumptions).

Because Duplicates are more common, it means sets are less likely to be disjoint. This means the set.isdisjoint() method will be much faster because only after a few tests it will find that the sets aren't disjoint.

Having said all that, I do believe the methods presented that use disjoint are the fastest anyway, but i'm just saying, instead of being 20x faster maybe they should only be 10x faster than the other methods with different benchmark testing.

Anyway, i Thought I would try a slightly different technique to solve this, however the merge sorting was too slow, this method is about 20x slower than the two fastest methods using the benchmarking:

I thought I would order everything

import heapq
from itertools import chain
def merge6(lists):
    for l in lists:
        l.sort()
    one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!!
    previous = one_list.next()

    d = {i:i for i in range(len(lists))}
    for current in one_list:
        if current[0]==previous[0]:
            d[current[1]] = d[previous[1]]
        previous=current

    groups=[[] for i in range(len(lists))]
    for k in d:
        groups[d[k]].append(lists[k]) #add a each list to its group

    return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way.


lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
print merge6(lists)
"[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]""



import timeit
print timeit.timeit("merge1(lsts)", setup=setup, number=10)
print timeit.timeit("merge4(lsts)", setup=setup, number=10)
print timeit.timeit("merge6(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26732238315
5000 lists, 5 classes, average size 74, max size 1000
1.16062907437
5000 lists, 5 classes, average size 74, max size 1000
30.7257182826
Haskell answered 23/2, 2012 at 22:16 Comment(1)
Your point is correct for the data being used. Data could have any form of relationship i.e., 0 to n joined lists. Statistically however between 10% to 30% are of interest.Rackety
S
0

I found @Niklas B.'s answer really helpful... but it took me a while to read through it and understand the logic. This is a re-write of exactly the same code with new variable names and more explanation... to help the other N00bs out there!

def mergeUntilOnlyDisjointSetsRemain(_listsOfLists):

    """Function for merging lists until there are only disjoint sets"""
    
    """Imagine this algorithm as if it were processing train cars filled with
    integers. It takes the first car of the train, separates it from the rest, 
    and then compares the first car to each subsequent car.
    Start by renaming the first car to 'common'
    If the two train cars have a common integer, you merge the two cars into
    common, and continue down the line until you reach the end of the train.
    Once you reach the end of the train, place the common car in results, (which
    is essentially a collection of train cars that have already been compared 
    to all other cars).
    You can exit the loop as soon as you get to the end of the train without
    merging any of the cars. This is controlled by the continueMerge variable.
    This variable is only reset to True after a merge operation.
    """

    # Start by creating a trainCar(i.e. a set) from each list in our listOfLists
    freightTrain = [set(trainCar) for trainCar in _listsOfLists if trainCar]

    # This continueMerge means that we have not yet compared all cars in the train.
    continueMerge = True
    while continueMerge:
        # Reset the while loop trigger.
        continueMerge = False
        # Create a fresh empty list of cars that have already been cross checked
        crossCheckedCars = []
        # While there are still elements in freightTrain
        while freightTrain:
            # Split the freightTrain up, into first car vs all the remaining cars
            commonFirstTrainCar = freightTrain[0]
            remainingCars = freightTrain[1:]
            # The freightTrain is now empty
            freightTrain = []
            # Iterate over all the remaining traincars
            for currentTrainCar in remainingCars:
                # If there are not any common integers with the first car...
                if currentTrainCar.isdisjoint(commonFirstTrainCar):
                    # Add each train car back onto the freightTrain
                    freightTrain.append(currentTrainCar)
                # But if they share a common integer...
                else:
                    # Trigger the reset switch to continueMerging cars
                    continueMerge = True
                    # and Join he cars together
                    commonFirstTrainCar |= currentTrainCar
            # Once we have checked commonFirstTrainCar, remove it from the 
            # freightTrain and place it in crossCheckedCars
            crossCheckedCars.append(commonFirstTrainCar)

        # Now we have compared the first car to all subsequent cars
        # (... but we aren't finished because the 5th and 7th cars might have
        # had a common integer with each other... but only 1st and 5th cars
        # shared an integer the first time around... so the 1st and 5th cars
        # were merged, but the 7th car is still alone!)

        # Reset the system by creating a new freightTrain
        freightTrain = crossCheckedCars

    # Post-process the freight train to turn it into lists instead of sets
    listsForReturnTripHome = []
    for processedTraincar in freightTrain:
        listsForReturnTripHome.append(list(processedTraincar))
    return listsForReturnTripHome
Salomon answered 2/12, 2021 at 18:50 Comment(0)
S
0

I needed this function for myself today and I came upon both this SO thread and the duplicate. First off, a huge thank you to all who participated, especially @alexis, whose answer heavily influenced mine, and @Rik Poggi, whose GitHub repository proved extremely useful for testing and benchmarking my version. This is the kind of thread that make SO invaluable!

A few notes:

  • I reworked the repo to move the 'deepcopy(lsts)' from the benched execution to the setup. Using timeit.repeat, we have the exact same behavior and safeties, but the deepcopy doesn't impact timings. This explains why the timings reported below are so low.
  • I added a profiler script to the repo, so if you add your function, you can profile it easily against any of the lists.
  • My fork of @Rik Poggi's repo is here.
  • As said above, my version is a rework of @alexis answer. We both derive from the union-find algorithm, but the main difference is that my version updates all the nodes' parent (bin_ref) upon merging, not just the main node. This allows me to avoid the recursive tree-traversal in @alexis' answer (locatebin) and I think it makes the code clearer.
  • This is a rather opaque way of doing what we want (unless you're a mathematician, which I'm not), so I tried to make the code as readable as possible. I believe that "readable" -> "maintainable" in this case.
  • My version doesn't make too many assumptions about the input data. My requirements was list[tuple[Path]], so I made a version that didn't care too much about the data inside. As long as you pass a Iterable[Iterable[T]] where T is hashable, you'll be fine.

About the results:

  • My version seems to perform better with large datasets. It under-performs slightly (2.5x slower than the best, in the worst case scenario) as the datasets become smaller. For mid-size datasets, prefer @Rik Poggi's version. For small-size datasets, prefer @alexis' version.
from itertools import chain
from typing import Iterable, TypeVar

T = TypeVar('T')


def merge_intersections(lists: Iterable[Iterable[T]]) -> list[set[T]]:
    """
    Merges lists based on whether or not they intersect. Returns a list
    of mutually disjoint sets.

    How it works:
    The general idea is to iterate over the sequences and merge them in
    their "intersection-bins". To do so, for each sequence, we look for
    items that were already encountered during previous iteration. If we
    don't find any, this means the current sequence might be disjoint
    from the rest, so we put it in its own bin. If, on the other hand,
    we find previously encountered items, this means one or more bins
    already exist for these items. If more than one bin exists we merge
    them (updating the reference of all contained items). Finally we put
    the current sequence's items in the bin.

    Usage:
    ::
        >>> result = merge_intersections([[1, 2], [2, 4], [7, 8], [3, 2], [4, 5]])
        >>> result = sorted(sorted(s) for s in result)
        [[1, 2, 3, 4, 5], [7, 8]]
    """
    bins: dict[T: set[T]] = dict()
    bin_refs: dict[T: T] = dict()
    for lst in lists:
        if not lst:
            continue

        # Gather the bin refs of all items in the list that we have
        # already seen.
        encountered_items_bin_refs = {
            bin_refs[item]
            for item in lst
            if item in bin_refs
        }
        if len(encountered_items_bin_refs) >= 1:
            # Some of the items in `lst` have already been seen in a
            # previous iteration. They are therefore already attached
            # to a bin. Select any of their corresponding bin ref.
            bin_ref = encountered_items_bin_refs.pop()
            # If the previously-seen items were not all attached to the
            # same bin, their respective bins need to be merged into
            # the selected one.
            if len(encountered_items_bin_refs) > 0:
                to_merge_bins = [bins.pop(ref) for ref in encountered_items_bin_refs]
                bins[bin_ref].update(chain(*to_merge_bins))
                for item in chain(*to_merge_bins):
                    bin_refs[item] = bin_ref
            bins[bin_ref].update(lst)
        else:
            # None of the items in `lst` have already been seen in a
            # previous iteration. Therefore, we can safely pick any
            # item as our new bin ref and create the corresponding bin.
            bin_ref = next(iter(lst))
            bins[bin_ref] = set(lst)
        for item in lst:
            bin_refs[item] = bin_ref
    return list(bins.values())

Benchmark results:

Timing with: >> Niklas << Benchmark
Info: 5000 lists, average size 305, max size 999
(from file: ./lists/timing_1.txt)
0.200 (1x) -- takeshi
0.328 (1.6x) -- alexis
0.473 (2.4x) -- agf
0.746 (3.7x) -- Rik. Poggi
0.776 (3.9x) -- Niks rewrite
0.792 (4x) -- Niklas B.
1.419 (7.1x) -- agf (optimized)
1.480 (7.4x) -- ChessMaster
1.579 (7.9x) -- katrielalex
1.627 (8.2x) -- steabert
1.938 (9.7x) -- robert king
9.273 (46x) -- Sven Marnach
33.779 (169x) -- hochl

Timing with: >> Niklas << Benchmark
Info: 4500 lists, average size 305, max size 999
(from file: ./lists/timing_2.txt)
0.158 (1x) -- takeshi
0.166 (1x) -- agf
0.231 (1.5x) -- Niks rewrite
0.232 (1.5x) -- Rik. Poggi
0.233 (1.5x) -- Niklas B.
0.268 (1.7x) -- alexis
0.408 (2.6x) -- ChessMaster
0.422 (2.7x) -- agf (optimized)
1.408 (8.9x) -- katrielalex
1.483 (9.4x) -- steabert
1.924 (12x) -- robert king
8.622 (55x) -- Sven Marnach
25.952 (164x) -- hochl

Timing with: >> Niklas << Benchmark
Info: 4500 lists, average size 98, max size 999
(from file: ./lists/timing_3.txt)
0.059 (1x) -- takeshi
0.068 (1.1x) -- agf
0.094 (1.6x) -- alexis
0.095 (1.6x) -- Rik. Poggi
0.095 (1.6x) -- Niklas B.
0.097 (1.6x) -- Niks rewrite
0.183 (3.1x) -- ChessMaster
0.192 (3.2x) -- agf (optimized)
0.425 (7.2x) -- katrielalex
0.586 (9.9x) -- robert king
0.787 (13x) -- steabert
2.827 (48x) -- Sven Marnach
9.162 (155x) -- hochl

Timing with: >> Sven << Benchmark
Info: 200 lists, average size 10, max size 10
0.000 (1x) -- alexis
0.001 (1.3x) -- ChessMaster
0.001 (1.8x) -- agf (optimized)
0.001 (1.9x) -- takeshi
0.002 (3.7x) -- steabert
0.002 (3.7x) -- robert king
0.002 (4.2x) -- katrielalex
0.002 (4.6x) -- Rik. Poggi
0.002 (4.8x) -- agf
0.002 (5.1x) -- Niklas B.
0.002 (5.5x) -- hochl
0.003 (6.1x) -- Niks rewrite
0.003 (7.7x) -- Sven Marnach

Timing with: >> Agf << Benchmark
Info: 2000 lists, average size 253, max size 500
0.030 (1x) -- Rik. Poggi
0.035 (1.2x) -- ChessMaster
0.036 (1.2x) -- agf
0.036 (1.2x) -- agf (optimized)
0.037 (1.2x) -- Niks rewrite
0.038 (1.3x) -- Niklas B.
0.060 (2x) -- hochl
0.074 (2.5x) -- takeshi
0.118 (3.9x) -- alexis
0.520 (17x) -- katrielalex
0.528 (18x) -- steabert
0.694 (23x) -- robert king
34.746 (1158x) -- Sven Marnach

Test results:

### Going to test: takeshi ###

test_coverage (__main__.MergeTestCase)
Check coverage original data ... ok
test_disjoint (__main__.MergeTestCase)
Check disjoint-ness of merged results ... ok
test_subset (__main__.MergeTestCase)
Check that every original data is a subset ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.008s

OK
Sass answered 5/6, 2023 at 3:52 Comment(0)
P
-1

My solution, works well on small lists and is quite readable without dependencies.

def merge_list(starting_list):
    final_list = []
    for i,v in enumerate(starting_list[:-1]):
        if set(v)&set(starting_list[i+1]):
            starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
        else:
            final_list.append(v)
    final_list.append(starting_list[-1])
    return final_list

Benchmarking it:

lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

%timeit merge_list(lists)

100000 loops, best of 3: 4.9 µs per loop

Porett answered 23/2, 2015 at 22:11 Comment(0)
F
-1

This can be solved in O(n) by using the union-find algorithm. Given the first two rows of your data, edges to use in the union-find are the following pairs: (0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)

Favoritism answered 2/9, 2016 at 18:33 Comment(0)
M
-1

Use flag to ensure you get the final mutual exclusive results

def merge(lists): 
    while(1):
        flag=0
        for i in range(0,len(lists)):
            for j in range(i+1,len(lists)):
                if len(intersection(lists[i],lists[j]))!=0:
                    lists[i]=union(lists[i],lists[j])
                    lists.remove(lists[j])
                    flag+=1
                    break
        if flag==0:
            break
    return lists
Mining answered 21/2, 2020 at 23:3 Comment(0)
D
-1
from itertools import combinations


def merge(elements_list):
    d = {index: set(elements) for index, elements in enumerate(elements_list)}

    while any(not set.isdisjoint(d[i], d[j]) for i, j in combinations(d.keys(), 2)):
        merged = set()
        for i, j in combinations(d.keys(), 2):
            if not set.isdisjoint(d[i], d[j]):
                d[i] = set.union(d[i], d[j])
                merged.add(j)

        for k in merged:
            d.pop(k)    

    return [v for v in d.values() if v]

lst = [[65, 17, 5, 30, 79, 56, 48, 62],
       [6, 97, 32, 93, 55, 14, 70, 32],
       [75, 37, 83, 34, 9, 19, 14, 64],
       [43, 71],
       [],
       [89, 49, 1, 30, 28, 3, 63],
       [35, 21, 68, 94, 57, 94, 9, 3],
       [16],
       [29, 9, 97, 43],
       [17, 63, 24]]

print(merge(lst))
Durman answered 30/5, 2020 at 8:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.