Return 'similar score' based on two dictionaries' similarity in Python?

Asked 14/3, 2016 at 6:32 Answered 14/5, 2020 at 11:7

I know it's possible to return how similar two strings are by using the following function:

from difflib import SequenceMatcher
def similar(a, b):
    output=SequenceMatcher(None, a, b).ratio()
    return output

In [37]: similar("Hey, this is a test!","Hey, man, this is a test, man.")
Out[37]: 0.76
In [38]: similar("This should be one.","This should be one.")
Out[38]: 1.0

But is it possible to score two dictionaries based on the similarity of keys and their corresponding values? Not a number of in common keys, or what is in common, but a score from 0 to 1, like the example above with strings.

I'm trying to find the similarity score between ratings['Shane'] and ratings['Joe'] in this dictionary:

ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

I am using Python 2.7.10

Guessrope answered 14/3, 2016 at 6:32 Comment(11)

And what's the expected output? The number of common keys? (like the keys are the same except Taken 3. Or the actual values? What about multi-level dictionaries? – Ruffi 14/3, 2016 at 6:36

Check out the en.m.wikipedia.org/wiki/Jaccard_index Pretty easy implement from a suitable set. – Bartolemo 14/3, 2016 at 6:36

Result will depend on your metric. – Cymar 14/3, 2016 at 6:37

@Ruffi I'm hoping to find something that takes everything into account. – Guessrope 14/3, 2016 at 6:37

Are you wanting some correlation coefficient? – Mccartney 14/3, 2016 at 6:43

Hi Trivision, Did you try my answer?> – Saltzman 14/3, 2016 at 6:50

Yes, pretty much. And I have all the data in one dictionary, as different users would mean adding more dictionary variables, so I just tried to cram all the data in multiple nested dictionaries, lol – Guessrope 14/3, 2016 at 6:51

@Backtrack About to plugin the variables now! – Guessrope 14/3, 2016 at 6:52

Are the values really floats? Or should they be integers? – Mccartney 14/3, 2016 at 7:44

@PeterWood Floats are not necessary to this, no. – Guessrope 14/3, 2016 at 7:59

@TrivisionZero, Updated a new answer check : "Update2: Normalized length and considered keys" – Saltzman 14/3, 2016 at 9:27

import math

ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

def cosine_similarity(vec1,vec2):
        sum11, sum12, sum22 = 0, 0, 0
        for i in range(len(vec1)):
            x = vec1[i]; y = vec2[i]
            sum11 += x*x
            sum22 += y*y
            sum12 += x*y
        return sum12/math.sqrt(sum11*sum22)

list1 = list(ratings['Shane'].values())
list2 =  list(ratings['Joe'].values())

sim = cosine_similarity(list1,list2)
print(sim)

output

o/p : 0.9205746178983233

Updated When i use :

ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
         'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

output :0.9574271077563381

Update2: Normalized length and considered keys

from math import*


ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
         'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0},
         'Bob': {'Panic Room':5.0,'Nonstop':5.0}}


def square_rooted(x):

    return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):

    input1 = {}
    input2 = {}
    vector2 = []
    vector1 =[]

    if len(x) > len(y):
        input1 = x
        input2 = y
    else:
        input1 = y
        input2 = x


    vector1 = list(input1.values())

    for k in input1.keys():    # Normalizing input vectors. 
        if k in input2:
            vector2.append(float(input2[k])) #picking the values for the common keys from input 2
        else :
            vector2.append(float(0))


    numerator = sum(a*b for a,b in zip(vector2,vector1))
    denominator = square_rooted(vector1)*square_rooted(vector2)
    return round(numerator/float(denominator),3)


print("Similarity between Shane and Joe")
print (cosine_similarity(ratings['Shane'],ratings['Joe']))

print("Similarity between Joe and Bob")
print (cosine_similarity(ratings['Joe'],ratings['Bob']))

print("Similarity between Shane and Bob")
print (cosine_similarity(ratings['Shane'],ratings['Bob']))

output:

Similarity between Shane and Joe
0.887
Similarity between Joe and Bob
0.346
Similarity between Shane and Bob
0.615

Nice explanation between jaccurd and cosine : https://datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

i am using Python 3.4

NOTE: I have assigned 0 to missing values. But you can assign some proper values too. Refer : http://www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-building-model-part-2/

Saltzman answered 14/3, 2016 at 6:43 Comment(28)

You're not looking at values are you? Just keys. Or did I miss something? – Bartolemo 14/3, 2016 at 6:51

@JLPeyret, Ah, Cool Change it to values. Simple ? – Saltzman 14/3, 2016 at 6:53

"can't multiply sequence by non-int of type 'str'" is what I keep getting on the "sum11 += x*x" line. – Guessrope 14/3, 2016 at 6:54

When changing the dictionary to this: ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}, it says that they are exactly alike with the output being "1.0" when they're not. I simply changed the 127 Hours key from 3.0 to 5.0 – Guessrope 14/3, 2016 at 7:2

Did you get 1.0. No way i am getting 0.9574271077563381. for the above o/p in the comment – Saltzman 14/3, 2016 at 7:6

@Backtrack Okay, so I figured out the issue, I think... I'm using Python 2.7.10. Do I need to find a new IDE, or is there some part of the code I could change to work with my version? It's weird though, I got the same value before I changed that one variable, it's just after I did it got really weird. – Guessrope 14/3, 2016 at 7:12

Here's another error I run into in 3.4 when comparing the new user, Bob with another user: "Traceback (most recent call last): File "python", line 17, in <module> File "python", line 8, in cosine_similarity IndexError: list index out of range" when adding another person with only two ratings: "ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}, 'Bob': {'Panic Room':5.0,'Nonstop':5.0}}" – Guessrope 14/3, 2016 at 7:53

Hi @TrivisionZero, Check my "Update1" I have done with all 3 inputs. Verify it – Saltzman 14/3, 2016 at 8:3

@TrivisionZero Would you want {'Avatar': 5.0} to be closer to {'127 Hours': 5.0} than {'Avatar': 1.0}? Cosine similarity implemented with just values doesn't account for that. It would give the same similarity. – Mccartney 14/3, 2016 at 8:8

@PeterWood I think the same similarity for that is fine because we don't know the similarity of the actual movies. – Guessrope 14/3, 2016 at 8:9

@TrivisionZero But that doesn't make sense. If I rate a film 5 and you rate a different film 5, there should be no similarity (unless you're comparing how enthusiastic we are). – Mccartney 14/3, 2016 at 8:17

Hi @PeterWood, Check my update2. Now i have normalized the length and considered keys. – Saltzman 14/3, 2016 at 9:26

@PeterWood, If you are satisfied consider +1 and add your comments – Saltzman 14/3, 2016 at 9:29

Your normalization implies that if person didn't watch a film, it rates it zero, that is obviously wrong. – Cymar 14/3, 2016 at 13:20

Hi @Lol4t0, Say I have 2 doc, doc1 = "Cat and rat" and doc2 = "dog and cat" In this scenario, What will be your vector. tf ? . Won't you assign zero (0) to missing word in doc1 and doc2. Correct me if i am wrong. But i will not accept if you just that mention it is wrong – Saltzman 14/3, 2016 at 14:0

Films != Texts. Nature of data is extremely important in ml cases like this – Cymar 14/3, 2016 at 14:2

Okay lets say, Lol4t0:{'A': 2.0, 'B' : '3.0'} backtrack:{'C':1.0, A:'3.7'}. In this scenario, How do we find similarity between us ? meaning the vectors ? – Saltzman 14/3, 2016 at 14:6

I don't know right now, that's why I didn't write an answer. But if you haven't watched some film, it doesn't mean you wouldn't like it, right? – Cymar 14/3, 2016 at 14:8

Refer this functionspace.com/articles/63/… and ". Note that we consider missing values as 0, the implications of which will be discussed later on." – Saltzman 14/3, 2016 at 14:9

As far as i know, I have been using 0 for missing values. Indeed it has some effect on it. But it is not completely wrong. – Saltzman 14/3, 2016 at 14:10

Later, they do say that zero is poor decision and propose better values for missing values, read the whole article. Good reference though! – Cymar 14/3, 2016 at 14:13

Yes as i said, Read my whole comment (previous one). It has some impact but 0 is value to handle missing values in a simpler way i would say. But they did not mention it is completely wron – Saltzman 14/3, 2016 at 14:16

The similarity of Shane and Bob seems much more realistic, I remember setting the ratings opposite for bob and it gave me a number near .98 on Python 3. I'm away from a computer, so I will test it later – Guessrope 14/3, 2016 at 17:35

Now I'm getting "numerator = sum(a*b for a,b in zip(vector2,vector1)) 190 denominator = square_rooted(vector1)*square_rooted(vector2) --> 191 return round(numerator/float(denominator),3) 192 193 ZeroDivisionError: float division by zero " randomly – Guessrope 15/3, 2016 at 5:59

How would I fix that? – Guessrope 22/3, 2016 at 4:31

@TrivisionZero, What was the input. – Saltzman 22/3, 2016 at 5:56

@TrivisionZero, One thing i am suspecting is that when you give input that are disjoint set. in that case the denominator will become zero – Saltzman 22/3, 2016 at 5:57

No, they had common movies. It only seems to happen with a very large set of data, such as 600 entries. I think I came up with a solution though, that I will post if it ends up working – Guessrope 23/3, 2016 at 0:59

https://en.m.wikipedia.org/wiki/Jaccard_index

and now some cleaned-up sample code.

def jac(s1,s2):
    """the jaccard index between 2 sets"""
    s_union = s1.union(s2)
    s_inter = s1.intersection(s2)

    len_union = len(s_union)
    if not len_union:
        return 0

    return len(s_inter)*1.0/len_union

from itertools import permutations

ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
     'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0},
     'Bob': {'Panic Room':5.0,'Nonstop':5.0}}

def common_movie(dict0, dict1):
    """have we rated the same movies?"""
    set0 = set(dict0.items())
    set1 = set(dict1.items())
    return jac(set0, set1)

def movies_and_ratings(dict0, dict1):
    """how do our movies and ratings line up?"""

    set_keys0 = set(dict0.keys())
    set_keys1 = set(dict1.keys())

    key_commonality = jac(set_keys0, set_keys1)

    set0 = set(dict0.items())
    set1 = set(dict1.items())

    item_commonality = jac(set0, set1)

    #ok, so now we give a proximity on key match, even if key + data dont match
    return 0.3 * key_commonality + 0.7 * item_commonality

def common_movie_ratings(dict0, dict1):
    """how do our ratings correspond on the same movies?"""

    set_keys0 = set(dict0.keys())
    set_keys1 = set(dict1.keys())

    set_common = set_keys0.intersection(set_keys1)

    set0 = set([v for k, v in dict0.items() if k in set_common])
    set1 = set([v for k, v in dict1.items() if k in set_common])

    return jac(set0, set1)

for pair in permutations(ratings.keys(), 2):

    dict0, dict1 = ratings[pair[0]], ratings[pair[1]]
    print "\n %s vs %s" % (pair)

    #make no assumption on key/value
    #order coming out of a dictionary.  So, you need to order them. 
    li = dict0.items()
    li.sort()
    print "  %s" % (li)
    li = dict1.items()
    li.sort()
    print "  %s" % (li)

    print "     common_movie    :%s" % common_movie(dict0, dict1)
    print "     movies_and_ratings:%s" % movies_and_ratings(dict0, dict1)
    print "     common_movie_ratings  :%s" % common_movie_ratings(dict0, dict1)

The output:

 Shane vs Bob
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
     common_movie    :0.25
     movies_and_ratings:0.25
     common_movie_ratings  :1.0

 Shane vs Joe
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
     common_movie    :0.166666666667
     movies_and_ratings:0.341666666667
     common_movie_ratings  :0.333333333333

 Bob vs Shane
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
     common_movie    :0.25
     movies_and_ratings:0.25
     common_movie_ratings  :1.0

 Bob vs Joe
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
     common_movie    :0.0
     movies_and_ratings:0.06
     common_movie_ratings  :0.0

 Joe vs Shane
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
     common_movie    :0.166666666667
     movies_and_ratings:0.341666666667
     common_movie_ratings  :0.333333333333

 Joe vs Bob
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
     common_movie    :0.0
     movies_and_ratings:0.06
     common_movie_ratings  :0.0

Bartolemo answered 14/3, 2016 at 6:40 Comment(15)

This did not seem to work in my situation... "AttributeError: 'set' object has no attribute 'intersect'" – Guessrope 14/3, 2016 at 6:45

Not on a comp so you may need to check syntax. I did correct intersect. But I've used Jaccard before and they are pretty good at 0..1 ratings. – Bartolemo 14/3, 2016 at 6:47

Now it's giving a 'unsupported operand type(s) for /: 'set' and 'set'' error on the division part. – Guessrope 14/3, 2016 at 6:49

Arggg. Sorry about that. You need to look at the sizes. I added len to both. My bad. – Bartolemo 14/3, 2016 at 6:54

@JLPeyret integer division... need to qualify Python version – Mccartney 14/3, 2016 at 6:55

It now gives an output, but it's just 0.0 – Guessrope 14/3, 2016 at 7:8

Tell you what. If you aint got an answer accepted tomorrow I'll write it up in unitest form. Use my jac function though, not the orig calc I had. – Bartolemo 14/3, 2016 at 7:10

Uhm, still zero... It could be that our Python versions are different? I'm on 2.7, what are you on? Edit: On an online Python3 IDE, it still says 0... this is weird – Guessrope 14/3, 2016 at 7:14

2.7. However, I think I See the pbm. Not 1 of your keys and values are a mathch. Same issue with backtarcks stuff. You need to look at keys, values and (keys, values) as a whole. Right now, I am just on the 3rd bit. – Bartolemo 14/3, 2016 at 7:17

@TrivisionZero Would you want {'Avatar': 5.0} to be closer to {'Avatar': 3.0} than {'Avatar': 1.0}? The Jaccard index implemented with key value pairs doesn't account for that. It would give the same similarity. – Mccartney 14/3, 2016 at 7:42

Updated the example, with some extra comparisons. One thing to keep in mind about Jaccard and dictionaries (rather than the more specific example of movie ratings) is that the exact contents of the dictionaries do not matter. Which is also a shortcoming, as per @PeterWood remark of 5.0 being closer to 3.0 than 1.0. The order of the keys() and values() coming out of a dictionary also does not matter, since this is set-based. But it might in a list-based approach. – Bartolemo 14/3, 2016 at 21:40

Is there any way to make it look at the values, instead of just the keys, as someone who rates Avatar 5 stars and another who rates it 1 star gets a similar rating of 1.0 – Guessrope 22/3, 2016 at 5:15

Not that I know. It's not a question of values vs keys. It's a question that if you look at the original definition, it checks for equality in values or keys and has no notion of "1 is closer to 2 than it is to 5". It's a really clever similarity definition since you can feed it almost anything that you can put in a set. But it's a better fit for your question title than for what you actually want. If you go to the Wikipedia entry for Jaccard, you can see references to other similarity metrics. "Programming Collective Intelligence" is also a book covering this type of subject in Python. – Bartolemo 22/3, 2016 at 15:42

What you might want to do is to write up some test data (with edge cases), decide for yourself which pairs should rate highly and then use unittests to run different algos through them and check. I would reason in terms of relative position rather than exact numbers. Ex: assertTrue(("Jane", "Joe") > ("Jane", "Bob")). It all depends on how important this classification is to your goals. – Bartolemo 22/3, 2016 at 15:46

I think I actually came up with a perfect solution for me, if it works, I'll post it as another answer option. – Guessrope 23/3, 2016 at 0:57

This is my implementation of the Jaccard Similarity datascience stackexchange post mentioned above.

Suppose, you have a Counter output from the collection library that counts the number of times a certain key is present in an iterable as such:

d1 = {'a': 2, 'b': 1}
d2 = {'a': 1, 'c': 1}

def get_jaccard_similarity(d1,d2):

    if not isinstance(d1, dict) or not isinstance(d2, dict):
        raise TypeError(f'd1 and d2 should be of type dict'
                    f' and not {type(d1).__name__}, {type(d2).__name__}')
    if not d1 and not d2:
        return 1
    elif (d1 and not d2) or (d2 and not d1):
        return 0
    else:
        set_of_all_keys = {*d1.keys(), *d2.keys()}
        nb_of_common_elements_dict = {k:min(d1.get(k,0),d2.get(k, 0))
                                  for k in set_of_all_keys }
        nb_of_total_elements_dict = {k: max(d1.get(k, 0), d2.get(k, 0))
                                  for k in set_of_all_keys}

        return sum(nb_of_common_elements_dict.values())/sum(nb_of_total_elements_dict.values())

Output: 0.75

The datascience stackexchange post derives a Jaccard similarity based the notion of sets. I believe this implementation will give the same result as with sets (dictionnary with values equal to 1), except that it gives the advantage to weight for the number of times a key appear in both (counter) dictionaries

Tasset answered 14/5, 2020 at 11:7 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags