Percentage Overlap of Two Lists
Asked Answered
P

4

18

This is more of a math problem than anything else. Lets assume I have two lists of different sizes in Python

listA = ["Alice", "Bob", "Joe"]
listB = ["Joe", "Bob", "Alice", "Ken"]

I want to find out what percentage overlap these two lists have. Order is not important within the lists. Finding overlap is easy, I've seen other posts on how to do that but I can't quite extend it in my mind to finding out what percentage they overlap. If I compared the lists in different orders would the result come out differently? What would be the best way of doing this?

Pedi answered 28/4, 2015 at 20:15 Comment(4)
The order doesn't matters here much, However, you first need to define the formula for percentage , it could be some thing like : 2*number of matches/(len(lista)+len(listb)) or something elseHorrible
What if the lists are [1,1,1] and [1]. Would the overlap be 100% or 33%?Pun
What is the expected output for these two lists?Ruffo
Do you care about the percentage of distinct elements that the two lists have in common (in which case set() will be very helpful), or about the percentage of ALL elements, including repeats, that the two lists share?Mealtime
T
16

From the principal point of view, I'd say that there are two sensible questions you might be asking:

  1. What percentage the overlap is if compared to the first list? I.e. how big is the common part in comparison to the first list?
  2. The same thing for the second list.
  3. What percentage the overlap is if compared to the "universe" (i.e. the union of both lists)?

There can surely be found other meanings as well and there would be many of them. All in all you should probably know what problem you're trying to solve.

From programming point of view, the solution is easy:

listA = ["Alice", "Bob", "Joe"]
listB = ["Joe", "Bob", "Alice", "Ken"]

setA = set(listA)
setB = set(listB)

overlap = setA & setB
universe = setA | setB

result1 = float(len(overlap)) / len(setA) * 100
result2 = float(len(overlap)) / len(setB) * 100
result3 = float(len(overlap)) / len(universe) * 100
Terriss answered 28/4, 2015 at 20:45 Comment(0)
R
9
>>> len(set(listA)&set(listB)) / float(len(set(listA) | set(listB))) * 100
75.0

I would calculate the common items out of the total distinct items.

len(set(listA)&set(listB)) returns the common items (3 in your example).

len(set(listA) | set(listB)) returns the total number of distinct items (4).

Multiply by 100 and you get percentage.

Ruffo answered 28/4, 2015 at 20:21 Comment(1)
Note that this answer and @JuniorCompressor answer are different, both correct but depends on the specific requirement.Ruffo
K
8

The maximum difference is when two lists have completely different elements. So we have at most n + m discrete elements, where n is size of first list and m is the size of second list. One measure can be:

2 * c / (n + m)

where c is the number of common elements. This can be calculated like this as percentage:

200.0 * len(set(listA) & set(listB)) / (len(listA) + len(listB))
Kinslow answered 28/4, 2015 at 20:21 Comment(1)
This fails for the following example: listA = ["Alice", "Alice"] listB = ["Alice", "Alice"]Tadeas
B
1
def computeOverlap(L1, L2):
    d1, d2 = {}, {}
    for e in L1:
        if e not in d1:
            d1[e] = 1
        d1[e] += 1

    for e in L2:
        if e not in d2:
            d2[e] = 0
        d2[e] += 1

    o1, o2 = 0, 0
    for k in d1:
        o1 += min(d1[k], d2.get(k,0))
    for k in d2:
        o2 += min(d1.get(k,0), d2[k])

    print((100*o1) if o1 else 0 "% of the first list overlaps with the second list")
    print((100*o2) if o2 else 0 "% of the second list overlaps with the first list")

Of course, you could just do this with a defaultdict and a counter, to make things a little easier:

from collections import defaultdict, Counter

def computeOverlap(L1, L2):
    d1 = defaultdict(int, Counter(L1))
    d2 = defaultdict(int, Counter(L2))

    o1, o2 = 0, 0
    for k in d1:
        o1 += min(d1[k], d2[k])
    for k in d2:
        o2 += min(d1[k,0], d2[k])

    print((100*o1) if o1 else 0 "% of the first list overlaps with the second list")
    print((100*o2) if o2 else 0 "% of the second list overlaps with the first list")
Bedraggle answered 28/4, 2015 at 20:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.