Merge related words in NLP
Asked Answered
O

6

23

I'd like to define a new word which includes count values from two (or more) different words. For example:

Words Frequency
0   mom 250
1   2020    151
2   the 124
3   19  82
4   mother  81
... ... ...
10  London  6
11  life    6
12  something   6

I would like to define mother as mom + mother:

Words Frequency
0   mother  331
1   2020    151
2   the 124
3   19  82
... ... ...
9   London  6
10  life    6
11  something   6

This is a way to alternative define group of words having some meaning (at least for my purpose).

Any suggestion would be appreciated.

Oswaldooswalt answered 2/9, 2020 at 12:44 Comment(11)
I guess you need to do some kind of text mining. The below link can help: towardsdatascience.com/…Halfbreed
Thanks Chandan. However it seems to be not accurate. For example, if I look for teaching, teacher would be not included as synonyms (or just as similar word)Oswaldooswalt
Teacher is a noun, teaching is a verb. Instructor and teacher are synonyms and teaching and lecturing can be considered synonyms. In any case, take look here - bionlp-www.utu.fi/wv_demo for using word2vec similarity to find similar words. Other option is wordnet.Homans
thank you Adnan. Yes, it makes sense what you mean and you are right. Sometimes lemmatisation does not work properly so I was thinking if it would be possible to create a new word which can include these terms. For me it would be like a lemmatisation (not synonymous) but with more control (if the list is short)Oswaldooswalt
I retitled this "Merge related words in NLP" since that seems to be your intent. The actual merging action (on a dict or counter or CountVectorizer) is the trivial part; the hard part as you say is inferring which words are related, by looking up some knowledge-base/thesaurus/ using word2vec similarity etc.Hepsiba
Also tagged cluster-analysis word2vec wordnet, but you can edit/remove those if you like.Hepsiba
Following @Hepsiba you need to define first what "a related word" means to you, Word2vec (or another word-embedding) can give you "similar" words, however their word-pair suggestions can be far from your need.Lucero
Thanks for your comments and edit. Yes, I think the title is more appropriate to my intent. In terms of definition, I think word2vec could help me to group this words together (but I would need understand how to use it, so answers that could proof the effectiveness of its use for this purpose would be also greatly considered as appropriate). Otherwise, I have thought to do it manually, i.e. defining different lists as a dictionary/vocabulary of words related to each others and, therefore, counted together when I count their frequency. They are just ideas...Oswaldooswalt
I would get word embedding vectors first and then use some clustering algorithm like K-Means. In this case, you need to decide how many clusters you want to get.Jareb
@Val I posted something new to my answer.Milden
@Life is complex. Thanks a lot for your update. It is very interesting what you have done and built!Oswaldooswalt
M
15

UPDATE 10-21-2020

I decided to build a Python module to handle the tasks that I outlined in this answer. The module is called wordhoard and can be downloaded from pypi


I have attempted to use Word2vec and WordNet in projects where I needed to determine the frequency of a keyword (e.g. healthcare) and the keyword's synonyms (e.g., wellness program, preventive medicine). I found that most NLP libraries didn't produce the results that I needed, so I decided to build my own dictionary with custom keywords and synonyms. This approached has worked for both analyzing and classification text in multiple projects.

I'm sure that someone that is versed in NLP technology might have a more robust solution, but the one below is similar ones that have worked for me time and time again.

I coded my answer to match the Words Frequency data you had in your question, but it can be modified to use any keyword and synonyms dataset.

import string

# Python Dictionary
# I manually created these word relationship - primary_word:synonyms
word_relationship = {"father": ['dad', 'daddy', 'old man', 'pa', 'pappy', 'papa', 'pop'],
          "mother": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}

# This input text is from various poems about mothers and fathers
input_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \
         'that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of ' \
         'her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She ' \
         'has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the ' \
         'greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, ' \
         'This to me you have always been. Through the good times and the bad, Your understanding I have had.'

# converts the input text to lowercase and splits the words based on empty space.
wordlist = input_text.lower().split()

# remove all punctuation from the wordlist
remove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation) 
for s in wordlist]

# list for word frequencies
wordfreq = []

# count the frequencies of a word
for w in remove_punctuation:
wordfreq.append(remove_punctuation.count(w))

word_frequencies = (dict(zip(remove_punctuation, wordfreq)))

word_matches = []

# loop through the dictionaries
for word, frequency in word_frequencies.items():
   for keyword, synonym in word_relationship.items():
      match = [x for x in synonym if word == x]
      if word == keyword or match:
        match = ' '.join(map(str, match))
        # append the keywords (mother), synonyms(mom) and frequencies to a list
        word_matches.append([keyword, match, frequency])

# used to hold the final keyword and frequencies
final_results = {}

# list comprehension to obtain the primary keyword and its frequencies
synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]

# iterate synonym_matches and output total frequency count for a specific keyword
for item in synonym_matches:
  if item[0] not in final_results.keys():
    frequency_count = 0
    frequency_count = frequency_count + item[1]
    final_results[item[0]] = frequency_count
  else:
    frequency_count = frequency_count + item[1]
    final_results[item[0]] = frequency_count

 
print(final_results)
# output
{'mother': 3, 'father': 2}

Other Methods

Below are some other methods and their out-of-box output.


NLTK WORDNET

In this example, I looked up the synonyms for the word 'mother.' Note that WordNet does not have the synonyms 'mom' or 'mum' linked to the word mother. These two words are within my sample text above. Also note that the word 'father' is listed as a synonym for 'mother.'

from nltk.corpus import wordnet

synonyms = []
word = 'mother'
for synonym in wordnet.synsets(word):
   for item in synonym.lemmas():
      if word != synonym.name() and len(synonym.lemma_names()) > 1:
        synonyms.append(item.name())

print(synonyms)
['mother', 'female_parent', 'mother', 'fuss', 'overprotect', 'beget', 'get', 'engender', 'father', 'mother', 'sire', 'generate', 'bring_forth']

PyDictionary

In this example, I looked up the synonyms for the word 'mother' using PyDictionary, which queries synonym.com. The synonyms in this example include the words 'mom' and 'mum.' This example also includes additional synonyms that WordNet did not generate.

BUT, PyDictionary also produced a synonym list for 'mum.' Which has nothing to do with the word 'mother.' It seems that PyDictionary pulled this list from the adjective section of the page instead of the noun section. It's hard for a computer to distinguish between the adjective mum and the noun mum.

from PyDictionary import PyDictionary
dictionary_mother = PyDictionary('mother')

print(dictionary_mother.getSynonyms())
# output 
[{'mother': ['mother-in-law', 'female parent', 'supermom', 'mum', 'parent', 'mom', 'momma', 'para I', 'mama', 'mummy', 'quadripara', 'mommy', 'quintipara', 'ma', 'puerpera', 'surrogate mother', 'mater', 'primipara', 'mammy', 'mamma']}]

dictionary_mum = PyDictionary('mum')

print(dictionary_mum.getSynonyms())
# output 
[{'mum': ['incommunicative', 'silent', 'uncommunicative']}]

Some of the other possible approaches are using the Oxford Dictionary API or querying thesaurus.com. Both these methods also have pitfalls. For instance the Oxford Dictionary API requires an API key and a paid subscription based on query numbers. And thesaurus.com is missing potential synonyms that could be useful in grouping words.

https://www.thesaurus.com/browse/mother
synonyms: mom, parent, ancestor, creator, mommy, origin, predecessor, progenitor, source, child-bearer, forebearer, procreator

UPDATE

Producing a precise synonym lists for each potential word in your corpus is hard and will require a multiple prong approach. The code below using WordNet and PyDictionary to create a superset of synonyms. Like all the other answers, this combine methods also leads to some over counting of word frequencies. I've been trying to reduce this over-counting by combining key and value pairs within my final dictionary of synonyms. The latter problem is much harder than I anticipated and might require me to open my own question to solve. In the end, I think that based on your use case you need to determine, which approach works best and will likely need to combine several approaches.

Thanks for posting this question, because it allowed me to look at other methods for solving a complex problem.

from string import punctuation
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from PyDictionary import PyDictionary

input_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother
         that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of
         her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She
         has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the
         greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend,
         This to me you have always been. Through the good times and the bad, Your understanding I have had."""


def normalize_textual_information(text):
   # split text into tokens by white space
   token = text.split()

   # remove punctuation from each token
   table = str.maketrans('', '', punctuation)
   token = [word.translate(table) for word in token]

   # remove any tokens that are not alphabetic
   token = [word.lower() for word in token if word.isalpha()]

   # filter out English stop words
   stop_words = set(stopwords.words('english'))

   # you could add additional stops like this
   stop_words.add('cannot')
   stop_words.add('could')
   stop_words.add('would')

   token = [word for word in token if word not in stop_words]

   # filter out any short tokens
   token = [word for word in token if len(word) > 1]
   return token


def generate_word_frequencies(words):
   # list to hold word frequencies
   word_frequencies = []

   # loop through the tokens and generate a word count for each token
   for word in words:
      word_frequencies.append(words.count(word))

   # aggregates the words and word_frequencies into tuples and coverts them into a dictionary
   word_frequencies = (dict(zip(words, word_frequencies)))

   # sort the frequency of the words from low to high
   sorted_frequencies = {key: value for key, value in 
   sorted(word_frequencies.items(), key=lambda item: item[1])}

 return sorted_frequencies


def get_synonyms_internet(word):
   dictionary = PyDictionary(word)
   synonym = dictionary.getSynonyms()
   return synonym

 
words = normalize_textual_information(input_text)

all_synsets_1 = {}
for word in words:
  for synonym in wordnet.synsets(word):
    if word != synonym.name() and len(synonym.lemma_names()) > 1:
      for item in synonym.lemmas():
        if word != item.name():
          all_synsets_1.setdefault(word, []).append(str(item.name()).lower())

all_synsets_2 = {}
for word in words:
  word_synonyms = get_synonyms_internet(word)
  for synonym in word_synonyms:
    if word != synonym and synonym is not None:
      all_synsets_2.update(synonym)

 word_relationship = {**all_synsets_1, **all_synsets_2}

 frequencies = generate_word_frequencies(words)
 word_matches = []
 word_set = {}
 duplication_check = set()

 for word, frequency in frequencies.items():
    for keyword, synonym in word_relationship.items():
       match = [x for x in synonym if word == x]
       if word == keyword or match:
         match = ' '.join(map(str, match))
         if match not in word_set or match not in duplication_check or word not in duplication_check:
            duplication_check.add(word)
            duplication_check.add(match)
            word_matches.append([keyword, match, frequency])

 # used to hold the final keyword and frequencies
 final_results = {}

 # list comprehension to obtain the primary keyword and its frequencies
 synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]

 # iterate synonym_matches and output total frequency count for a specific keyword
 for item in synonym_matches:
    if item[0] not in final_results.keys():
      frequency_count = 0
      frequency_count = frequency_count + item[1]
      final_results[item[0]] = frequency_count
 else:
    frequency_count = frequency_count + item[1]
    final_results[item[0]] = frequency_count

# do something with the final results
Milden answered 7/9, 2020 at 4:5 Comment(0)
E
4

It is a hard problem and the best solution depends on the usecase you are trying to solve. It is a hard problem because to combine words you need to understand the semantics of the word. You can combine mom and mother together because they are semantically related.

One way to identfy if two words are semantically realted is by reling the distributed word embeddings (vectors) like word2vec, Glove, fasttext et. You can find teh cosine similarity between the vectors of all the words with respect to a word and may be pick up the top 5 close words and create new words.

Example using word2vec

# Load a pretrained word2vec model
import gensim.downloader as api
model = api.load('word2vec-google-news-300')

vectors = [model.get_vector(w) for w in words]
for i, w in enumerate(vectors):
   first_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][1]
   second_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][2]
   
   print (f"{words[i]} + {words[first_best_match]}")
   print (f"{words[i]} + {words[second_best_match]}")  

Output:

mom + mother
mom + teacher
mother + mom
mother + teacher
london + mom
london + life
life + mother
life + mom
teach + teacher
teach + mom
teacher + teach
teacher + mother

You can try putting the threshold on the cosine similarity and only select those which have cosine similarity greater then this threshold.

One problem with semantic similarity is the they can be semantically opposite and so they are similar (man - woman), on the other hand (man-king) are semantically similar because they are same.

Emilyemina answered 7/9, 2020 at 14:14 Comment(2)
Hi mujjiga, may I ask you what I should I include text or words to be analysed?Procora
words. Find the frequently used words which are non stop words and find the cosine similarity between all of them and see the top most if they make any sense.Emilyemina
R
3

What you trying to achieve is Semantic Textual Similarity.

I want to recommend on Tensorflow Universal Sentence Encoder

for example :

#@title Load the Universal Sentence Encoder's TF Hub module
from absl import logging

import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

def plot_similarity(labels, features, rotation):
  corr = np.inner(features, features)
  sns.set(font_scale=1.2)
  g = sns.heatmap(
      corr,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  g.set_xticklabels(labels, rotation=rotation)
  g.set_title("Semantic Textual Similarity")

def run_and_plot(messages_):
  message_embeddings_ = embed(messages_)
  plot_similarity(messages_, message_embeddings_, 90)

messages = [
    "Mother",
    "Mom",
    "Mama",
    "Dog",
    "Cat"
]

run_and_plot(messages)

enter image description here

the example is written in python but I also created an example of loading the model in to JVM based languages

https://github.com/ntedgi/universal-sentence-encoder

Retha answered 7/9, 2020 at 15:19 Comment(2)
Hi Naor, I am getting this error: ValueError: Must pass 2-d inputProcora
colab.research.google.com/github/tensorflow/hub/blob/master/…. please r un fro this colabRetha
D
2

One other wacky way to address this to use the good old PyDictionary lib. You can use the

dictionary.getSynonyms()

function to loop through all the words in your list and group them. All available synonyms listed will be covered and mapped to one group. There by allowing you to assign the final variable and summing up the synonyms. In your example. You choose the final word as Mother which displays the final count of synonyms.

Demilune answered 7/9, 2020 at 6:58 Comment(0)
J
1

You can generate word embedding vectors and use some clustering algorithm. In the end, you need to tune the algorithm's hyperparameters to achieve the result with high accuracy.

from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA

import spacy

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Load the large english model
nlp = spacy.load("en_core_web_lg")

tokens = nlp("dog cat banana apple teaching teacher mom mother mama mommy berlin paris")

# Generate word embedding vectors
vectors = np.array([token.vector for token in tokens])
vectors.shape
# (12, 300)

Let's use Principal Component Analysis algorithm to visualize our embeddings in 3-dimensional space:

pca_vecs = PCA(n_components=3).fit_transform(vectors)
pca_vecs.shape
# (12, 3)

fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection='3d')
xs, ys, zs = pca_vecs[:, 0], pca_vecs[:, 1], pca_vecs[:, 2]
_ = ax.scatter(xs, ys, zs)

for x, y, z, lable in zip(xs, ys, zs, tokens):
    ax.text(x+0.3, y, z, str(lable))

enter image description here

Let's use DBSCAN algorithm to cluster words:

model = DBSCAN(eps=5, min_samples=1)
model.fit(vectors)

for word, cluster in zip(tokens, model.labels_):
    print(word, '->', cluster)

Output:

dog -> 0
cat -> 0
banana -> 1
apple -> 2
teaching -> 3
teacher -> 3
mom -> 4
mother -> 4
mama -> 4
mommy -> 4
berlin -> 5
paris -> 6
Jareb answered 10/9, 2020 at 9:3 Comment(1)
I found very interesting your approach to my question. I have tried to apply it to sentences, but with not a lot of success, so I opened a question and start a bounty. In case you want to have a look at it: #63780375Oswaldooswalt
L
-1

matthewreagan/WebstersEnglishDictionary

the idea is use this dictonary to identify similar words.

in short: run some knowledge discovery algorithm which extracts knowledge according to english grammar

Here is a thesaurus: its 18MB .

HERE is an excerpt from thesaurus, you may try to match the word alternates via some algorithm.

{"word": "ma", "key": "ma_1", "pos": "noun", "synonyms": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}

FOR a quick fix using external api here is the link: they allow to do much more with api like getting synonyms, finding multiple definitions, finding rhyming words and more.

WORDAPI

Luscious answered 5/9, 2020 at 19:4 Comment(3)
Hi nikhli, thank you for you answer. I would like to understand more on what you suggest in the last step. "HERE is an excerpt from thesaurus". Are you defining a class saying that if I found one of those words I could consider all the terms included them as the same word, which is a noun?Oswaldooswalt
Don't think this "thesaurus" is going to help much - the entry for "mom" doesn't list "mother": {"word": "mom", "key": "mom_1", "pos": "noun", "synonyms": ["mamma", "momma", "mama", "mammy", "ma", "mumm", "mommy", "mum"]} - and "mother" dosn't reference "mom". And of course there is also an entry for "mother" as a verb. Very non-trivial problem.Leprous
Quote match for single word and filter out word collocations == avoid collocations of words like "blink of an eye" = "Quick". its indeed very non trivial but 1 word to 1 word grouping would be great. i dont know much abt ML , but there is something called cosine similarity which identifies how close 2 words are using K-means. you can also use wordapi for quick fix which i updated in answer.Luscious

© 2022 - 2024 — McMap. All rights reserved.