First, apologies for being long-winded.
I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you generate such a recommendation engine (new methods, new data, etc)?
I currently have two dictionaries – one with personality data called personality_feature_dict that includes the personality type and associated descriptor words: {'Type 1': ['able', 'accepting', 'according', 'accountable'...]}
and the other called book_feature_dict containing book titles and their own descriptor words, which were extracted using TF-IDF {'Book Title': ['actually', 'administration', 'age', 'allow', 'anti'...]}
As it stands, I'm using the following code to calculate the similarity between dictionary values from each to identify % similarity. First, I create a larger corpus using all dictionary items.
book_values = list(book_feature_dict.values())
personality_values = list(personality_feature_dict.values())
texts = book_values + personality_values
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
import numpy as np
np.random.seed(1)
Then I create an LDA model to identify similarities. My knowledge of LDA modeling is limited, so if you spot an error here, I appreciate you flagging it!
from gensim.models import ldamodel
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=4, minimum_probability=1e-8)
Finally, I iterate through sets of values as bags of words and compare how the first personality type or (personality_feature_dict.values())[personality_num]
compares to 99 book descriptions/values by finding the Hellinger distance.
from gensim.matutils import hellinger
personality_num = 0
i = 0
while i < 98:
s_0 = list(book_feature_dict.values())[i]
s_0_bow = model.id2word.doc2bow(s_0)
s_0_lda_bow = model[s_0_bow]
e_0 = list(personality_feature_dict.values())[personality_num]
e_0_bow = model.id2word.doc2bow(e_0)
e_0_lda_bow = model[e_0_bow]
x = 100 - (hellinger(e_0_lda_bow, s_0_lda_bow)*100)
i = i+1
Finally, I print all instances where the LDA model comes back with a high correlation as a percentage.
if x > 50:
print (list(personality_feature_dict.keys())[personality_num])
print('similarity to ', (list(book_feature_dict.keys())[i]), 'is')
print(x, '%', '\n\n')
The result looks something like this:
Personality Type
similarity to Name of Book 1 is
84.6029228744518 %
Personality Type
similarity to Name of Book 2 is
83.09513184950528 %
Personality Type
similarity to Name of Book 3 is
85.44322295890642 %
...