How do I make my implementation of greedy set cover faster?
Asked Answered
M

3

3

I came up with the following implementation for the Greedy Set Cover after much discussion regarding my original question here. From the help I received, I encoded the problem into a "Greedy Set Cover" and after receiving some more help here, I came up with the following implementation. I am thankful to everyone for helping me out with this. The following implementation works fine but I want to make it scalable/faster.

By scalable/faster, I mean to say that:

  • My dataset contains about 50K-100K sets in S
  • The number of elements in U itself is very small in the order of 100-500
  • The size of each set in S could be anywhere from 0 to 40

And here goes my attempt:

U = set([1,2,3,4])
R = U
S = [set([1,2]), 
     set([1]), 
     set([1,2,3]), 
     set([1]), 
     set([3,4]), 
     set([4]), 
     set([1,2]), 
     set([3,4]), 
     set([1,2,3,4])]
w = [1, 1, 2, 2, 2, 3, 3, 4, 4]

C = []
costs = []

def findMin(S, R):
    minCost = 99999.0
    minElement = -1
    for i, s in enumerate(S):
        try:
            cost = w[i]/(len(s.intersection(R)))
            if cost < minCost:
                minCost = cost
                minElement = i
        except:
            # Division by zero, ignore
            pass
    return S[minElement], w[minElement]

while len(R) != 0:
    S_i, cost = findMin(S, R)
    C.append(S_i)
    R = R.difference(S_i)
    costs.append(cost)

print "Cover: ", C
print "Total Cost: ", sum(costs), costs

I am not an expert in Python but any Python-specific optimizations to this code would be really nice.

Mesocarp answered 29/10, 2011 at 23:22 Comment(0)
E
3

What sorts of times are you getting vs what you need? Surely most of the execution time is spent in c-level code finding set intersections, so there's not much optimization you can do? With some random data (results may vary with your data of course, not sure if these are good values) of 100000 sets, 40 elements in each set, 500 unique elements, weights random from 1 to 10,

print 'generating test data'    
num_sets = 100000
set_size = 40
elements = range(500)
U = set(elements)
R = U
S = []
for i in range(num_sets):
    random.shuffle(elements)
    S.append(set(elements[:set_size]))
w = [random.randint(1,100) for i in xrange(100)]

C = []
costs = []

I got performance like this with cProfile:

         8200209 function calls in 14.391 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   14.391   14.391 <string>:1(<module>)
       41    4.802    0.117   14.389    0.351 test.py:23(findMin)
        1    0.001    0.001   14.391   14.391 test.py:40(func)
  4100042    0.428    0.000    0.428    0.000 {len}
       82    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
       41    0.001    0.000    0.001    0.000 {method 'difference' of 'set' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  4100000    9.160    0.000    9.160    0.000 {method 'intersection' of 'set' objects}

Hm, so mostly apparently 1/3 of the time isn't in set intersections. But I personally wouldn't optimize any more, especially at the cost of clarity. There's not going to be much you can do with the other 2/3, so why bother?

Elbe answered 30/10, 2011 at 0:2 Comment(1)
+1 Thank you. You are right. I was looking for unnecessary optimizations. In my case, it was running in about 15 seconds which is good for me. Thank you once again.Mesocarp
B
7

I use a trick when I implemented the famous greedy algorithm for set cover (no weights) in Matlab. It is possible that you could extend this trick to the weighted case somehow, using set cardinality / set weight instead of set cardinality. Moreover, if you use NumPy library, exporting Matlab code to Python should be very easy.

Here is the trick:

  1. (optional) I sorted the sets in descending order with respect to the cardinality (i.e. number of elements they contain). I also stored their cardinalities.
  2. I select a set S, in my implementation it is the largest (i.e. first set of the list), and I count how many uncovered elements it contains. Let's say that it contains n uncovered elements.
  3. Since now I know there is a set S with n uncovered elements, I don't need to process all the sets with cardinality lower than n elements, because they cannot be better than S. So I just need to search for the optimal set among the sets with cardinality at least n; with my sorting, we can focus on them easily.
Barouche answered 26/11, 2011 at 19:54 Comment(0)
E
3

What sorts of times are you getting vs what you need? Surely most of the execution time is spent in c-level code finding set intersections, so there's not much optimization you can do? With some random data (results may vary with your data of course, not sure if these are good values) of 100000 sets, 40 elements in each set, 500 unique elements, weights random from 1 to 10,

print 'generating test data'    
num_sets = 100000
set_size = 40
elements = range(500)
U = set(elements)
R = U
S = []
for i in range(num_sets):
    random.shuffle(elements)
    S.append(set(elements[:set_size]))
w = [random.randint(1,100) for i in xrange(100)]

C = []
costs = []

I got performance like this with cProfile:

         8200209 function calls in 14.391 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   14.391   14.391 <string>:1(<module>)
       41    4.802    0.117   14.389    0.351 test.py:23(findMin)
        1    0.001    0.001   14.391   14.391 test.py:40(func)
  4100042    0.428    0.000    0.428    0.000 {len}
       82    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
       41    0.001    0.000    0.001    0.000 {method 'difference' of 'set' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  4100000    9.160    0.000    9.160    0.000 {method 'intersection' of 'set' objects}

Hm, so mostly apparently 1/3 of the time isn't in set intersections. But I personally wouldn't optimize any more, especially at the cost of clarity. There's not going to be much you can do with the other 2/3, so why bother?

Elbe answered 30/10, 2011 at 0:2 Comment(1)
+1 Thank you. You are right. I was looking for unnecessary optimizations. In my case, it was running in about 15 seconds which is good for me. Thank you once again.Mesocarp
M
0

Years later, if you're willing to use numpy + convert your list of sets to a sparse CSC matrix, then there are savings to be gained.

Assume you've converted your list of sets into row-indices and collected those sets column-wise into a SciPy CSC array subsets. Now, the size of the universe U is just the number of rows and the number of sets the number of columns.

To find the set which has the largest number of uncovered elements, it suffices to multiply subsets from the left by an indicator vector giving the elements currently included in the cover. The resulting vector set_counts gives the number of elements of each set not included in the cover; by dividing the (positive) weights by this vector and minimizing, we can obtain the greedy pick via np.argmin.

Below is simplified code (taken from here that does just this:

def greedy_set_cover(subsets: csc_array, weights: np.ndarray):
    n, J = subsets.shape

    ## The below extracts set at index j efficiently 
    slice_col = lambda j: subsets.indices[slice(*subsets.indptr[[j,j+1]])] 
    
    ## Vectorized version of the set version that uses matrix multiplication
    set_counts = subsets.sum(axis=0)
    point_cover, soln = np.zeros(n, dtype=bool), np.zeros(J, dtype=bool)
    while not np.all(point_cover):
        opt_s = np.argmin(weights / set_counts)
        point_cover[slice_col(opt_s)] = True
        set_counts = np.maximum((1 - point_cover) @ subsets, 1/n)
        soln[opt_s] = True
    return np.flatnonzero(soln), cost

On my machine, using timeit I'm seeing the code you have give take on average ~5 seconds using Thomas's test case.

timeit.timeit(lambda: greedy_set_cover_simple(U, S, w), number=15)/15
# 4.851530780732476 seconds 

In contrast, the numpy solution takes just under 0.4 seconds on average:

timeit.timeit(lambda: greedy_set_cover(subsets, w), number=15)/15
# 0.376905655732844 seconds 

There is overhead in converting the sets S into a sparse array (about ~0.6 seconds on my machine).

Marras answered 25/7, 2024 at 16:25 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.