NLP, spaCy: Strategy for improving document similarity

One sentence backdrop: I have text data from auto-transcribed talks, and I want to compare their similarity of their content (e.g. what they are talking about) to do clustering and recommendation. I am quite new to NLP.

Data: The data I am using is available here. For all the lazy ones

clone https://github.com/TMorville/transcribed_data

and here is a snippet of code to put it in a df:

import os, json
import pandas as pd

from pandas.io.json import json_normalize 

def td_to_df():
    
    path_to_json = '#FILL OUT PATH'
    json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('td.json')]

    tddata = pd.DataFrame(columns=['trans', 'confidence'])

    for index, js in enumerate(json_files):
        with open(os.path.join(path_to_json, js)) as json_file:
            json_text = json_normalize(json.load(json_file))

            tddata['trans'].loc[index] = str(json_text['trans'][0])
            tddata['confidence'].loc[index] = str(json_text['confidence'][0])

    return tddata

Approach: So far, I have only used the spaCy package to do "out of the box" similarity. I simply apply the nlp model on the entire corpus of text, and compare it to all others.

def similarity_get():
    
    tddata = td_to_df()
    
    nlp = spacy.load('en_core_web_lg')
    
    baseline = nlp(tddata.trans[0])
    
    for text in tddata.trans:
        print (baseline.similarity(nlp(text)))

Problem: Practically all similarities comes out as > 0.95. This is more or less independent of the baseline. Now, this may not come a major surprise given the lack of preprocessing.

Solution strategy: Following the advice in this post, I would like to do the following (using spaCy where possible): 1) Remove stop words. 2) Remove most frequent words. 3) Merge word pairs. 4) Possibly use Doc2Vec outside of spaCy.

Questions: Does the above seem like a sound strategy? If no, what's missing? If yes, how much of this already happening under the hood by using the pre-trained model loaded in nlp = spacy.load('en_core_web_lg')?

I can't seem find the documentation that demonstrates what exactly these models are doing, or how to configure it. A quick google search yields nothing and even the, very neat, api documentation does not seem to help. Perhaps I am looking in the wrong place?

You can do most of that with SpaCY and some regexes.

So, you have to take a look at the SpaCY API documentation.

Basic steps in any NLP pipeline are the following:

Language detection (self explanatory, if you're working with some dataset, you know what the language is and you can adapt your pipeline to that). When you know a language you have to download a correct models from SpaCY. The instructions are here. Let's use English for this example. In your command line just type python -m spacy download en and then import it to the preprocessing script like this:
```
import spacy
nlp = spacy.load('en')
```
Tokenization - this is the process of splitting the text into words. It's not enough to just do text.split() (ex. there's would be treated as a single word but it's actually two words there and is). So here we use Tokenizers. In SpaCy you can do something like:
```
nlp_doc = nlp(text)
```

where text is your dataset corpus or a sample from a dataset. You can read more about the document instance here

Punctuation removal - pretty self explanatory process, done by the method in the previous step. To remove punctuation, just type:

import re

# removing punctuation tokens
text_no_punct = [token.text for token in doc if not token.is_punct]

# remove punctuation tokens that are in the word string like 'bye!' -> 'bye'
REPLACE_PUNCT = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
text_no_punct = [REPLACE_PUNCT.sub("", tok.text) for tok in text_no_punct]

POS tagging - short for Part-Of-Speech tagging. It is the process of marking up a word in a text as corresponding to a particular part of speech. For example:

A/DT Part-Of-Speech/NNP Tagger/NNP is/VBZ a/DT piece/NN of/IN
software/NN that/WDT reads/VBZ text/NN in/IN some/DT
language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO
each/DT word/NN ,/, such/JJ as/IN noun/NN ,/, verb/NN ,/,
adjective/NN ,/, etc./FW./.

where the uppercase codes after the slash are a standard word tags. A list of tags can be found here

In SpaCy, this is already done by putting the text into nlp instance. You can get the tags with:

    for token in doc:
        print(token.text, token.tag_)

Morphological processing: lemmatization - it's a process of transforming the words into a linguistically valid base form, called the lemma:
```
nouns → singular nominative form
verbs → infinitive form
adjectives → singular, nominative, masculine, indefinitive, positive form
```

In SpaCy, it's also already done for you by putting the text into nlp instance. You can get the lemma of every word by:

    for token in doc:
        print(token.text, token.lemma_)

Removing stopwords - stopwords are the words that are not bringing any new information or meaning to the sentence and can be omitted. You guessed, it's also already done for you by nlp instance. To filter the stopwords just type:
```
text_without_stopwords = [token.text for token in doc if not token.is_stop]
doc = nlp(' '.join(text_without_stopwords))
```

Now you have a clean dataset. You can now use word2vec or GloVe pretrained models to create a word vectors and input your data to some model. Alternatively, you can use TF-IDF for creating the word vectors by removing the most common words. Also, contrary to the usual process, you may want to leave the most specific words as your task is to better differentiate between two texts. I hope this is clear enough :)

Recommended topics

Hot tags