One sentence backdrop: I have text data from auto-transcribed talks, and I want to compare their similarity of their content (e.g. what they are talking about) to do clustering and recommendation. I am quite new to NLP.
Data: The data I am using is available here. For all the lazy ones
clone https://github.com/TMorville/transcribed_data
and here is a snippet of code to put it in a df:
import os, json
import pandas as pd
from pandas.io.json import json_normalize
def td_to_df():
path_to_json = '#FILL OUT PATH'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('td.json')]
tddata = pd.DataFrame(columns=['trans', 'confidence'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json_normalize(json.load(json_file))
tddata['trans'].loc[index] = str(json_text['trans'][0])
tddata['confidence'].loc[index] = str(json_text['confidence'][0])
return tddata
Approach: So far, I have only used the spaCy package to do "out of the box" similarity. I simply apply the nlp model on the entire corpus of text, and compare it to all others.
def similarity_get():
tddata = td_to_df()
nlp = spacy.load('en_core_web_lg')
baseline = nlp(tddata.trans[0])
for text in tddata.trans:
print (baseline.similarity(nlp(text)))
Problem: Practically all similarities comes out as > 0.95. This is more or less independent of the baseline. Now, this may not come a major surprise given the lack of preprocessing.
Solution strategy: Following the advice in this post, I would like to do the following (using spaCy where possible): 1) Remove stop words. 2) Remove most frequent words. 3) Merge word pairs. 4) Possibly use Doc2Vec outside of spaCy.
Questions: Does the above seem like a sound strategy? If no, what's missing? If yes, how much of this already happening under the hood by using the pre-trained model loaded in nlp = spacy.load('en_core_web_lg')
?
I can't seem find the documentation that demonstrates what exactly these models are doing, or how to configure it. A quick google search yields nothing and even the, very neat, api documentation does not seem to help. Perhaps I am looking in the wrong place?
nlp = spacy.load('en_core_web_lg')
, you load the word vectors which will be used in doc2vec. – Sweaty