Is there a Python library or tool that analyzes two bodies of text for similarities in order to provide recommendations?

First, apologies for being long-winded.

I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you generate such a recommendation engine (new methods, new data, etc)?

I currently have two dictionaries – one with personality data called personality_feature_dict that includes the personality type and associated descriptor words: {'Type 1': ['able', 'accepting', 'according', 'accountable'...]} and the other called book_feature_dict containing book titles and their own descriptor words, which were extracted using TF-IDF {'Book Title': ['actually', 'administration', 'age', 'allow', 'anti'...]}

As it stands, I'm using the following code to calculate the similarity between dictionary values from each to identify % similarity. First, I create a larger corpus using all dictionary items.

book_values = list(book_feature_dict.values())
personality_values = list(personality_feature_dict.values()) 

texts = book_values + personality_values

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

import numpy as np
np.random.seed(1)

Then I create an LDA model to identify similarities. My knowledge of LDA modeling is limited, so if you spot an error here, I appreciate you flagging it!

from gensim.models import ldamodel
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=4, minimum_probability=1e-8)

Finally, I iterate through sets of values as bags of words and compare how the first personality type or (personality_feature_dict.values())[personality_num] compares to 99 book descriptions/values by finding the Hellinger distance.

from gensim.matutils import hellinger
personality_num = 0
i = 0

while i < 98:

    s_0 = list(book_feature_dict.values())[i]
    s_0_bow = model.id2word.doc2bow(s_0)
    s_0_lda_bow = model[s_0_bow]

    e_0 = list(personality_feature_dict.values())[personality_num]
    e_0_bow = model.id2word.doc2bow(e_0)
    e_0_lda_bow = model[e_0_bow]

    x = 100 - (hellinger(e_0_lda_bow, s_0_lda_bow)*100)
    i = i+1

Finally, I print all instances where the LDA model comes back with a high correlation as a percentage.

    if x > 50:
        print (list(personality_feature_dict.keys())[personality_num])
        print('similarity to ', (list(book_feature_dict.keys())[i]), 'is')
        print(x, '%', '\n\n')

The result looks something like this:

Personality Type 
similarity to  Name of Book 1 is
84.6029228744518 % 


Personality Type 
similarity to  Name of Book 2 is
83.09513184950528 % 


Personality Type 
similarity to  Name of Book 3 is
85.44322295890642 % 

...

Your question if very, very broad. As such it does not necessarily even fit StackOverflow.

To me it seems that you are attempting to filter books using a specific set of vocabulary. For that you do not need to get into LDA modelling. A simple cosine similarity between binary word vectors or embeddings distance would do (e.g. using FastText, Word2Vec, GloVe embeddings).

The questionable part about the way you trained the LDA model is that you are uncovering the latent topics across your corpus of books. The words for personality traits can be arbitrarily distributed across all of the topics and are unlikely to be strong clues about which topic a given book belongs to. Therefore, the similarity you are measuring in the 4-dimensional latent topic space is not a good indication for alignment with particular personality-related words (and themes).

I would recommend using embeddings and some way to aggregate them across larger volume of text (e.g. doc2vec from gensim).

Recommended topics

Hot tags