python: How to calculate the cosine similarity of two word lists?
Asked Answered
W

2

8

I want to calculate the cosine similarity of two lists like following:

A = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']

B = [u'home (private)', u'school', u'bank', u'shopping mall']

I know the cosine similarity of A and B should be

3/(sqrt(7)*sqrt(4)).

I try to reform the lists into forms like 'home bank bank building factory', which looks like a sentence, however, some elements (e.g. home (private)) have blank space in itself and some elements have brackets so I find it difficult to calculate the word occurrence.

Do you know how to calculate the word occurrence in this complicated list, so that for list B, word occurrence can be represented as

{'home (private):1, 'school':1, 'bank': 1, 'shopping mall':1}? 

Or do you know how to calculate the cosine similarity of these two lists?

Thank you very much

Werby answered 2/3, 2015 at 20:55 Comment(2)
How would you define cosine similarity ? from where these variables 3/(sqrt(7)*sqrt(4)). came ?Illa
I just know one way to define cosine similarity, which is just dot(A,B)/|A|.|B|, just like A = [2, 1, 1, 1, 0, 0] and B = [1,1,0,0,1,1], and their cosine similarity is 3/(sqrt(7)*sqrt(4))Werby
C
8
from collections import Counter

# word-lists to compare
a = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']
b = [u'home (private)', u'school', u'bank', u'shopping mall']

# count word occurrences
a_vals = Counter(a)
b_vals = Counter(b)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]        # [0, 0, 1, 1, 2, 1]
b_vect = [b_vals.get(word, 0) for word in words]        # [1, 1, 1, 0, 1, 0]

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             # sqrt(7)
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             # sqrt(4)
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))    # 3
cosine = dot / (len_a * len_b)                          # 0.5669467
Cherlycherlyn answered 2/3, 2015 at 21:22 Comment(2)
Thanks a lot for your answering. It seems very cool but in words = list(a_vals.keys() | b_vals.keys()), interpreter says 'TypeError: unsupported operand type(s) for |: 'list' and 'list'. Any idea? 'Werby
Sorry, I tested in Python 3.4. For 2.x you would do word = list(set(a_vals) | set(b_vals)).Cherlycherlyn
C
3

First build a dictionary (this is the technical term for a list of all distinct words in a set or corpus).

vocab = {}
i = 0

# loop through each list, find distinct words and map them to a
# unique number starting at zero

for word in A:
    if word not in vocab:
        vocab[word] = i
        i += 1


for word in B:
    if word not in vocab:
        vocab[word] = i
        i += 1

The vocab dictionary now maps each word to a unique number starting at zero. We'll use these numbers as indices into an array (or vector).

In the next step we'll create something called a term frequency vector for each input list. We're going to use a library called numpy here. It's a very popular way to do this sort of scientific computation. If you're interested in cosine similarity (or other machine learning techniques), it's worth your time.

import numpy as np

# create a numpy array (vector) for each input, filled with zeros
a = np.zeros(len(vocab))
b = np.zeros(len(vocab))

# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary

for word in A:
    index = vocab[word] # get index from dictionary
    a[index] += 1 # increment count for that index

for word in B:
    index = vocab[word]
    b[index] += 1

The final step is actual calculation of the cosine similarity.

# use numpy's dot product to calculate the cosine similarity
sim = np.dot(a, b) / np.sqrt(np.dot(a, a) * np.dot(b, b))

The variable sim now contains your answer. You can pull each of these sub-expressions out and verify that they match your original formula.

With a little refactoring this technique is pretty scalable (relatively large numbers of input lists, with a relatively large number of distinct words). For really large corpora (like wikipedia) you should check out Natural Language Processing libraries made for this sort of thing. Here a few good ones.

  1. nltk
  2. gensim
  3. spaCy
Cuprum answered 3/11, 2015 at 16:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.