First build a dictionary (this is the technical term for a list of all distinct words in a set or corpus).
vocab = {}
i = 0
# loop through each list, find distinct words and map them to a
# unique number starting at zero
for word in A:
if word not in vocab:
vocab[word] = i
i += 1
for word in B:
if word not in vocab:
vocab[word] = i
i += 1
The vocab
dictionary now maps each word to a unique number starting at zero. We'll use these numbers as indices into an array (or vector).
In the next step we'll create something called a term frequency vector for each input list. We're going to use a library called numpy
here. It's a very popular way to do this sort of scientific computation. If you're interested in cosine similarity (or other machine learning techniques), it's worth your time.
import numpy as np
# create a numpy array (vector) for each input, filled with zeros
a = np.zeros(len(vocab))
b = np.zeros(len(vocab))
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
The final step is actual calculation of the cosine similarity.
# use numpy's dot product to calculate the cosine similarity
sim = np.dot(a, b) / np.sqrt(np.dot(a, a) * np.dot(b, b))
The variable sim
now contains your answer. You can pull each of these sub-expressions out and verify that they match your original formula.
With a little refactoring this technique is pretty scalable (relatively large numbers of input lists, with a relatively large number of distinct words). For really large corpora (like wikipedia) you should check out Natural Language Processing libraries made for this sort of thing. Here a few good ones.
- nltk
- gensim
- spaCy
cosine similarity
? from where these variables3/(sqrt(7)*sqrt(4)).
came ? – Illa