UPDATE 10-21-2020
I decided to build a Python module to handle the tasks that I outlined in this answer. The module is called wordhoard and can be downloaded from pypi
I have attempted to use Word2vec and WordNet in projects where I needed to determine the frequency of a keyword (e.g. healthcare) and the keyword's synonyms (e.g., wellness program, preventive medicine). I found that most NLP libraries didn't produce the results that I needed, so I decided to build my own dictionary with custom keywords and synonyms. This approached has worked for both analyzing and classification text in multiple projects.
I'm sure that someone that is versed in NLP technology might have a more robust solution, but the one below is similar ones that have worked for me time and time again.
I coded my answer to match the Words Frequency data you had in your question, but it can be modified to use any keyword and synonyms dataset.
import string
# Python Dictionary
# I manually created these word relationship - primary_word:synonyms
word_relationship = {"father": ['dad', 'daddy', 'old man', 'pa', 'pappy', 'papa', 'pop'],
"mother": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}
# This input text is from various poems about mothers and fathers
input_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \
'that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of ' \
'her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She ' \
'has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the ' \
'greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, ' \
'This to me you have always been. Through the good times and the bad, Your understanding I have had.'
# converts the input text to lowercase and splits the words based on empty space.
wordlist = input_text.lower().split()
# remove all punctuation from the wordlist
remove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation)
for s in wordlist]
# list for word frequencies
wordfreq = []
# count the frequencies of a word
for w in remove_punctuation:
wordfreq.append(remove_punctuation.count(w))
word_frequencies = (dict(zip(remove_punctuation, wordfreq)))
word_matches = []
# loop through the dictionaries
for word, frequency in word_frequencies.items():
for keyword, synonym in word_relationship.items():
match = [x for x in synonym if word == x]
if word == keyword or match:
match = ' '.join(map(str, match))
# append the keywords (mother), synonyms(mom) and frequencies to a list
word_matches.append([keyword, match, frequency])
# used to hold the final keyword and frequencies
final_results = {}
# list comprehension to obtain the primary keyword and its frequencies
synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]
# iterate synonym_matches and output total frequency count for a specific keyword
for item in synonym_matches:
if item[0] not in final_results.keys():
frequency_count = 0
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
else:
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
print(final_results)
# output
{'mother': 3, 'father': 2}
Other Methods
Below are some other methods and their out-of-box output.
NLTK WORDNET
In this example, I looked up the synonyms for the word 'mother.' Note that WordNet does not have the synonyms 'mom' or 'mum' linked to the word mother. These two words are within my sample text above. Also note that the word 'father' is listed as a synonym for 'mother.'
from nltk.corpus import wordnet
synonyms = []
word = 'mother'
for synonym in wordnet.synsets(word):
for item in synonym.lemmas():
if word != synonym.name() and len(synonym.lemma_names()) > 1:
synonyms.append(item.name())
print(synonyms)
['mother', 'female_parent', 'mother', 'fuss', 'overprotect', 'beget', 'get', 'engender', 'father', 'mother', 'sire', 'generate', 'bring_forth']
PyDictionary
In this example, I looked up the synonyms for the word 'mother' using PyDictionary, which queries synonym.com. The synonyms in this example include the words 'mom' and 'mum.' This example also includes additional synonyms that WordNet did not generate.
BUT, PyDictionary also produced a synonym list for 'mum.' Which has nothing to do with the word 'mother.' It seems that PyDictionary pulled this list from the adjective section of the page instead of the noun section. It's hard for a computer to distinguish between the adjective mum and the noun mum.
from PyDictionary import PyDictionary
dictionary_mother = PyDictionary('mother')
print(dictionary_mother.getSynonyms())
# output
[{'mother': ['mother-in-law', 'female parent', 'supermom', 'mum', 'parent', 'mom', 'momma', 'para I', 'mama', 'mummy', 'quadripara', 'mommy', 'quintipara', 'ma', 'puerpera', 'surrogate mother', 'mater', 'primipara', 'mammy', 'mamma']}]
dictionary_mum = PyDictionary('mum')
print(dictionary_mum.getSynonyms())
# output
[{'mum': ['incommunicative', 'silent', 'uncommunicative']}]
Some of the other possible approaches are using the Oxford Dictionary API or querying thesaurus.com. Both these methods also have pitfalls. For instance the Oxford Dictionary API requires an API key and a paid subscription based on query numbers. And thesaurus.com is missing potential synonyms that could be useful in grouping words.
https://www.thesaurus.com/browse/mother
synonyms: mom, parent, ancestor, creator, mommy, origin, predecessor, progenitor, source, child-bearer, forebearer, procreator
UPDATE
Producing a precise synonym lists for each potential word in your corpus is hard and will require a multiple prong approach. The code below using
WordNet and PyDictionary to create a superset of synonyms. Like all the other answers, this combine methods also leads to some over counting of word frequencies. I've been trying to reduce this over-counting by combining key and value pairs within my final dictionary of synonyms. The latter problem is much harder than I anticipated and might require me to open my own question to solve. In the end, I think that based on your use case you need to determine, which approach works best and will likely need to combine several approaches.
Thanks for posting this question, because it allowed me to look at other methods for solving a complex problem.
from string import punctuation
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from PyDictionary import PyDictionary
input_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother
that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of
her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She
has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the
greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend,
This to me you have always been. Through the good times and the bad, Your understanding I have had."""
def normalize_textual_information(text):
# split text into tokens by white space
token = text.split()
# remove punctuation from each token
table = str.maketrans('', '', punctuation)
token = [word.translate(table) for word in token]
# remove any tokens that are not alphabetic
token = [word.lower() for word in token if word.isalpha()]
# filter out English stop words
stop_words = set(stopwords.words('english'))
# you could add additional stops like this
stop_words.add('cannot')
stop_words.add('could')
stop_words.add('would')
token = [word for word in token if word not in stop_words]
# filter out any short tokens
token = [word for word in token if len(word) > 1]
return token
def generate_word_frequencies(words):
# list to hold word frequencies
word_frequencies = []
# loop through the tokens and generate a word count for each token
for word in words:
word_frequencies.append(words.count(word))
# aggregates the words and word_frequencies into tuples and coverts them into a dictionary
word_frequencies = (dict(zip(words, word_frequencies)))
# sort the frequency of the words from low to high
sorted_frequencies = {key: value for key, value in
sorted(word_frequencies.items(), key=lambda item: item[1])}
return sorted_frequencies
def get_synonyms_internet(word):
dictionary = PyDictionary(word)
synonym = dictionary.getSynonyms()
return synonym
words = normalize_textual_information(input_text)
all_synsets_1 = {}
for word in words:
for synonym in wordnet.synsets(word):
if word != synonym.name() and len(synonym.lemma_names()) > 1:
for item in synonym.lemmas():
if word != item.name():
all_synsets_1.setdefault(word, []).append(str(item.name()).lower())
all_synsets_2 = {}
for word in words:
word_synonyms = get_synonyms_internet(word)
for synonym in word_synonyms:
if word != synonym and synonym is not None:
all_synsets_2.update(synonym)
word_relationship = {**all_synsets_1, **all_synsets_2}
frequencies = generate_word_frequencies(words)
word_matches = []
word_set = {}
duplication_check = set()
for word, frequency in frequencies.items():
for keyword, synonym in word_relationship.items():
match = [x for x in synonym if word == x]
if word == keyword or match:
match = ' '.join(map(str, match))
if match not in word_set or match not in duplication_check or word not in duplication_check:
duplication_check.add(word)
duplication_check.add(match)
word_matches.append([keyword, match, frequency])
# used to hold the final keyword and frequencies
final_results = {}
# list comprehension to obtain the primary keyword and its frequencies
synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]
# iterate synonym_matches and output total frequency count for a specific keyword
for item in synonym_matches:
if item[0] not in final_results.keys():
frequency_count = 0
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
else:
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
# do something with the final results