Return 'similar score' based on two dictionaries' similarity in Python?
Asked Answered
G

3

8

I know it's possible to return how similar two strings are by using the following function:

from difflib import SequenceMatcher
def similar(a, b):
    output=SequenceMatcher(None, a, b).ratio()
    return output

In [37]: similar("Hey, this is a test!","Hey, man, this is a test, man.")
Out[37]: 0.76
In [38]: similar("This should be one.","This should be one.")
Out[38]: 1.0

But is it possible to score two dictionaries based on the similarity of keys and their corresponding values? Not a number of in common keys, or what is in common, but a score from 0 to 1, like the example above with strings.

I'm trying to find the similarity score between ratings['Shane'] and ratings['Joe'] in this dictionary:

ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

I am using Python 2.7.10

Guessrope answered 14/3, 2016 at 6:32 Comment(11)
And what's the expected output? The number of common keys? (like the keys are the same except Taken 3. Or the actual values? What about multi-level dictionaries?Ruffi
Check out the en.m.wikipedia.org/wiki/Jaccard_index Pretty easy implement from a suitable set.Bartolemo
Result will depend on your metric.Cymar
@Ruffi I'm hoping to find something that takes everything into account.Guessrope
Are you wanting some correlation coefficient?Mccartney
Hi Trivision, Did you try my answer?>Saltzman
Yes, pretty much. And I have all the data in one dictionary, as different users would mean adding more dictionary variables, so I just tried to cram all the data in multiple nested dictionaries, lolGuessrope
@Backtrack About to plugin the variables now!Guessrope
Are the values really floats? Or should they be integers?Mccartney
@PeterWood Floats are not necessary to this, no.Guessrope
@TrivisionZero, Updated a new answer check : "Update2: Normalized length and considered keys"Saltzman
S
7
import math

ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

def cosine_similarity(vec1,vec2):
        sum11, sum12, sum22 = 0, 0, 0
        for i in range(len(vec1)):
            x = vec1[i]; y = vec2[i]
            sum11 += x*x
            sum22 += y*y
            sum12 += x*y
        return sum12/math.sqrt(sum11*sum22)

list1 = list(ratings['Shane'].values())
list2 =  list(ratings['Joe'].values())

sim = cosine_similarity(list1,list2)
print(sim)

output

o/p : 0.9205746178983233

Updated When i use :

ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
         'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

output :0.9574271077563381

Update2: Normalized length and considered keys

from math import*


ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
         'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0},
         'Bob': {'Panic Room':5.0,'Nonstop':5.0}}


def square_rooted(x):

    return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):

    input1 = {}
    input2 = {}
    vector2 = []
    vector1 =[]

    if len(x) > len(y):
        input1 = x
        input2 = y
    else:
        input1 = y
        input2 = x


    vector1 = list(input1.values())

    for k in input1.keys():    # Normalizing input vectors. 
        if k in input2:
            vector2.append(float(input2[k])) #picking the values for the common keys from input 2
        else :
            vector2.append(float(0))


    numerator = sum(a*b for a,b in zip(vector2,vector1))
    denominator = square_rooted(vector1)*square_rooted(vector2)
    return round(numerator/float(denominator),3)


print("Similarity between Shane and Joe")
print (cosine_similarity(ratings['Shane'],ratings['Joe']))

print("Similarity between Joe and Bob")
print (cosine_similarity(ratings['Joe'],ratings['Bob']))

print("Similarity between Shane and Bob")
print (cosine_similarity(ratings['Shane'],ratings['Bob']))

output:

Similarity between Shane and Joe
0.887
Similarity between Joe and Bob
0.346
Similarity between Shane and Bob
0.615

Nice explanation between jaccurd and cosine : https://datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

i am using Python 3.4

NOTE: I have assigned 0 to missing values. But you can assign some proper values too. Refer : http://www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-building-model-part-2/

Saltzman answered 14/3, 2016 at 6:43 Comment(28)
You're not looking at values are you? Just keys. Or did I miss something?Bartolemo
@JLPeyret, Ah, Cool Change it to values. Simple ?Saltzman
"can't multiply sequence by non-int of type 'str'" is what I keep getting on the "sum11 += x*x" line.Guessrope
When changing the dictionary to this: ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}, it says that they are exactly alike with the output being "1.0" when they're not. I simply changed the 127 Hours key from 3.0 to 5.0Guessrope
Did you get 1.0. No way i am getting 0.9574271077563381. for the above o/p in the commentSaltzman
@Backtrack Okay, so I figured out the issue, I think... I'm using Python 2.7.10. Do I need to find a new IDE, or is there some part of the code I could change to work with my version? It's weird though, I got the same value before I changed that one variable, it's just after I did it got really weird.Guessrope
Here's another error I run into in 3.4 when comparing the new user, Bob with another user: "Traceback (most recent call last): File "python", line 17, in <module> File "python", line 8, in cosine_similarity IndexError: list index out of range" when adding another person with only two ratings: "ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}, 'Bob': {'Panic Room':5.0,'Nonstop':5.0}}"Guessrope
Hi @TrivisionZero, Check my "Update1" I have done with all 3 inputs. Verify itSaltzman
@TrivisionZero Would you want {'Avatar': 5.0} to be closer to {'127 Hours': 5.0} than {'Avatar': 1.0}? Cosine similarity implemented with just values doesn't account for that. It would give the same similarity.Mccartney
@PeterWood I think the same similarity for that is fine because we don't know the similarity of the actual movies.Guessrope
@TrivisionZero But that doesn't make sense. If I rate a film 5 and you rate a different film 5, there should be no similarity (unless you're comparing how enthusiastic we are).Mccartney
Hi @PeterWood, Check my update2. Now i have normalized the length and considered keys.Saltzman
@PeterWood, If you are satisfied consider +1 and add your commentsSaltzman
Your normalization implies that if person didn't watch a film, it rates it zero, that is obviously wrong.Cymar
Hi @Lol4t0, Say I have 2 doc, doc1 = "Cat and rat" and doc2 = "dog and cat" In this scenario, What will be your vector. tf ? . Won't you assign zero (0) to missing word in doc1 and doc2. Correct me if i am wrong. But i will not accept if you just that mention it is wrongSaltzman
Films != Texts. Nature of data is extremely important in ml cases like thisCymar
Okay lets say, Lol4t0:{'A': 2.0, 'B' : '3.0'} backtrack:{'C':1.0, A:'3.7'}. In this scenario, How do we find similarity between us ? meaning the vectors ?Saltzman
I don't know right now, that's why I didn't write an answer. But if you haven't watched some film, it doesn't mean you wouldn't like it, right?Cymar
Refer this functionspace.com/articles/63/… and ". Note that we consider missing values as 0, the implications of which will be discussed later on."Saltzman
As far as i know, I have been using 0 for missing values. Indeed it has some effect on it. But it is not completely wrong.Saltzman
Later, they do say that zero is poor decision and propose better values for missing values, read the whole article. Good reference though!Cymar
Yes as i said, Read my whole comment (previous one). It has some impact but 0 is value to handle missing values in a simpler way i would say. But they did not mention it is completely wronSaltzman
The similarity of Shane and Bob seems much more realistic, I remember setting the ratings opposite for bob and it gave me a number near .98 on Python 3. I'm away from a computer, so I will test it laterGuessrope
Now I'm getting "numerator = sum(a*b for a,b in zip(vector2,vector1)) 190 denominator = square_rooted(vector1)*square_rooted(vector2) --> 191 return round(numerator/float(denominator),3) 192 193 ZeroDivisionError: float division by zero " randomlyGuessrope
How would I fix that?Guessrope
@TrivisionZero, What was the input.Saltzman
@TrivisionZero, One thing i am suspecting is that when you give input that are disjoint set. in that case the denominator will become zeroSaltzman
No, they had common movies. It only seems to happen with a very large set of data, such as 600 entries. I think I came up with a solution though, that I will post if it ends up workingGuessrope
B
4

https://en.m.wikipedia.org/wiki/Jaccard_index

and now some cleaned-up sample code.

def jac(s1,s2):
    """the jaccard index between 2 sets"""
    s_union = s1.union(s2)
    s_inter = s1.intersection(s2)

    len_union = len(s_union)
    if not len_union:
        return 0

    return len(s_inter)*1.0/len_union

from itertools import permutations

ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
     'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0},
     'Bob': {'Panic Room':5.0,'Nonstop':5.0}}

def common_movie(dict0, dict1):
    """have we rated the same movies?"""
    set0 = set(dict0.items())
    set1 = set(dict1.items())
    return jac(set0, set1)

def movies_and_ratings(dict0, dict1):
    """how do our movies and ratings line up?"""

    set_keys0 = set(dict0.keys())
    set_keys1 = set(dict1.keys())

    key_commonality = jac(set_keys0, set_keys1)

    set0 = set(dict0.items())
    set1 = set(dict1.items())

    item_commonality = jac(set0, set1)

    #ok, so now we give a proximity on key match, even if key + data dont match
    return 0.3 * key_commonality + 0.7 * item_commonality

def common_movie_ratings(dict0, dict1):
    """how do our ratings correspond on the same movies?"""

    set_keys0 = set(dict0.keys())
    set_keys1 = set(dict1.keys())

    set_common = set_keys0.intersection(set_keys1)

    set0 = set([v for k, v in dict0.items() if k in set_common])
    set1 = set([v for k, v in dict1.items() if k in set_common])

    return jac(set0, set1)

for pair in permutations(ratings.keys(), 2):

    dict0, dict1 = ratings[pair[0]], ratings[pair[1]]
    print "\n %s vs %s" % (pair)

    #make no assumption on key/value
    #order coming out of a dictionary.  So, you need to order them. 
    li = dict0.items()
    li.sort()
    print "  %s" % (li)
    li = dict1.items()
    li.sort()
    print "  %s" % (li)

    print "     common_movie    :%s" % common_movie(dict0, dict1)
    print "     movies_and_ratings:%s" % movies_and_ratings(dict0, dict1)
    print "     common_movie_ratings  :%s" % common_movie_ratings(dict0, dict1)

The output:

 Shane vs Bob
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
     common_movie    :0.25
     movies_and_ratings:0.25
     common_movie_ratings  :1.0

 Shane vs Joe
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
     common_movie    :0.166666666667
     movies_and_ratings:0.341666666667
     common_movie_ratings  :0.333333333333

 Bob vs Shane
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
     common_movie    :0.25
     movies_and_ratings:0.25
     common_movie_ratings  :1.0

 Bob vs Joe
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
     common_movie    :0.0
     movies_and_ratings:0.06
     common_movie_ratings  :0.0

 Joe vs Shane
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
     common_movie    :0.166666666667
     movies_and_ratings:0.341666666667
     common_movie_ratings  :0.333333333333

 Joe vs Bob
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
     common_movie    :0.0
     movies_and_ratings:0.06
     common_movie_ratings  :0.0
Bartolemo answered 14/3, 2016 at 6:40 Comment(15)
This did not seem to work in my situation... "AttributeError: 'set' object has no attribute 'intersect'"Guessrope
Not on a comp so you may need to check syntax. I did correct intersect. But I've used Jaccard before and they are pretty good at 0..1 ratings.Bartolemo
Now it's giving a 'unsupported operand type(s) for /: 'set' and 'set'' error on the division part.Guessrope
Arggg. Sorry about that. You need to look at the sizes. I added len to both. My bad.Bartolemo
@JLPeyret integer division... need to qualify Python versionMccartney
It now gives an output, but it's just 0.0Guessrope
Tell you what. If you aint got an answer accepted tomorrow I'll write it up in unitest form. Use my jac function though, not the orig calc I had.Bartolemo
Uhm, still zero... It could be that our Python versions are different? I'm on 2.7, what are you on? Edit: On an online Python3 IDE, it still says 0... this is weirdGuessrope
2.7. However, I think I See the pbm. Not 1 of your keys and values are a mathch. Same issue with backtarcks stuff. You need to look at keys, values and (keys, values) as a whole. Right now, I am just on the 3rd bit.Bartolemo
@TrivisionZero Would you want {'Avatar': 5.0} to be closer to {'Avatar': 3.0} than {'Avatar': 1.0}? The Jaccard index implemented with key value pairs doesn't account for that. It would give the same similarity.Mccartney
Updated the example, with some extra comparisons. One thing to keep in mind about Jaccard and dictionaries (rather than the more specific example of movie ratings) is that the exact contents of the dictionaries do not matter. Which is also a shortcoming, as per @PeterWood remark of 5.0 being closer to 3.0 than 1.0. The order of the keys() and values() coming out of a dictionary also does not matter, since this is set-based. But it might in a list-based approach.Bartolemo
Is there any way to make it look at the values, instead of just the keys, as someone who rates Avatar 5 stars and another who rates it 1 star gets a similar rating of 1.0Guessrope
Not that I know. It's not a question of values vs keys. It's a question that if you look at the original definition, it checks for equality in values or keys and has no notion of "1 is closer to 2 than it is to 5". It's a really clever similarity definition since you can feed it almost anything that you can put in a set. But it's a better fit for your question title than for what you actually want. If you go to the Wikipedia entry for Jaccard, you can see references to other similarity metrics. "Programming Collective Intelligence" is also a book covering this type of subject in Python.Bartolemo
What you might want to do is to write up some test data (with edge cases), decide for yourself which pairs should rate highly and then use unittests to run different algos through them and check. I would reason in terms of relative position rather than exact numbers. Ex: assertTrue(("Jane", "Joe") > ("Jane", "Bob")). It all depends on how important this classification is to your goals.Bartolemo
I think I actually came up with a perfect solution for me, if it works, I'll post it as another answer option.Guessrope
T
1

This is my implementation of the Jaccard Similarity datascience stackexchange post mentioned above.

Suppose, you have a Counter output from the collection library that counts the number of times a certain key is present in an iterable as such:

d1 = {'a': 2, 'b': 1}
d2 = {'a': 1, 'c': 1}

def get_jaccard_similarity(d1,d2):

    if not isinstance(d1, dict) or not isinstance(d2, dict):
        raise TypeError(f'd1 and d2 should be of type dict'
                    f' and not {type(d1).__name__}, {type(d2).__name__}')
    if not d1 and not d2:
        return 1
    elif (d1 and not d2) or (d2 and not d1):
        return 0
    else:
        set_of_all_keys = {*d1.keys(), *d2.keys()}
        nb_of_common_elements_dict = {k:min(d1.get(k,0),d2.get(k, 0))
                                  for k in set_of_all_keys }
        nb_of_total_elements_dict = {k: max(d1.get(k, 0), d2.get(k, 0))
                                  for k in set_of_all_keys}

        return sum(nb_of_common_elements_dict.values())/sum(nb_of_total_elements_dict.values())

Output: 0.75

The datascience stackexchange post derives a Jaccard similarity based the notion of sets. I believe this implementation will give the same result as with sets (dictionnary with values equal to 1), except that it gives the advantage to weight for the number of times a key appear in both (counter) dictionaries

Tasset answered 14/5, 2020 at 11:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.