Using word2vec to classify words in categories
Asked Answered
A

2

19

BACKGROUND

I have vectors with some sample data and each vector has a category name (Places,Colors,Names).

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

My objective is to train a model that take a new input string and predict which category it belongs to. For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. If the new input is "Calgary" it should predict 'Places' as the correct category.

APPROACH

I did some research and came across Word2vec. This library has a "similarity" and "mostsimilarity" function which i can use. So one brute force approach I thought of is the following:

  1. Take new input.
  2. Calculate it's similarity with each word in each vector and take an average.

So for instance for input "pink" I can calculate its similarity with words in vector "names" take a average and then do that for the other 2 vectors also. The vector that gives me the highest similarity average would be the correct vector for the input to belong to.

ISSUE

Given my limited knowledge in NLP and machine learning I am not sure if that is the best approach and hence I am looking for help and suggestions on better approaches to solve my problem. I am open to all suggestions and also please point out any mistakes I may have made as I am new to machine learning and NLP world.

Acetamide answered 6/12, 2017 at 4:16 Comment(2)
use spacy's ner and you can also train the spacy model with your data.Petuu
@AyodhyankitPaul i will google that right now! thanks for feedback and if possible would love it if you can provide small demo, would love to see thisAcetamide
A
32

If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data.

Here's my solution below:

import numpy as np

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('pink'))
print(process('frank'))
print(process('moscow'))

In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). Upon running, it should produce something like this:

{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}

... which looks pretty reasonable. And that's it! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. Remember that the model size only depends on the data you have and words you might want to be able to query.

Ative answered 11/12, 2017 at 17:54 Comment(12)
This is very interesting. So obviously the word embeddings have already been created. When i tried print(process('kobe')) it classified 'kobe' as a place even though 'kobe' is a name however when i added 'kobe' to the data dictionary as type name it classified kobe as a name . I am trying to understand what is happening under the hood. It gave the highest score to name (9.38) but the score for place category was pretty close (9.08).Acetamide
Some terms are naturally on the border. Remember than embeddings are learned from the texts. E.g., paris is frequently used both as a city and a name, Paris Hilton. Same for Kobe: I know only one usage as a name, though very popular, but it's also as a place in Japan - en.wikipedia.org/wiki/Kobe . This is a common problem in classification. As for the general understanding, see this answer - https://mcmap.net/q/666469/-why-are-word-embedding-actually-vectors and further links it refers toAtive
Also I found that when I did print(process('a2')) I got negative scores for all 3 categories. Then I added a new category called "Id" where i added values like a1,a2,b1,b2 . Then i did print(process('a2')) and print(process('c2')) and i got high score for "id" category in both cases. So the code above is it learning meaning of new values under the hood? Since i added a new category called Id it is somehow able to figure out that values like a1,b2,c3 are closely related.Acetamide
Also would it make a difference if a value like "Kobe" occcours several times in Name category. Does this code take frequency of occurance in to consideration?Acetamide
1) Of course it would, but you'll have to change python dict to a list of tuples. It would be simpler to have a separate index of coefficients per different words, if you want to go this way. 2) Negative score is absolutely possible, no problem here. 3) This solution uses an already trained model. If you want to train it yourself, it's totally possible, but bare in mind the size of training data must very large to make a difference. Something comparable to the size of wikipedia.Ative
I am confused because for a word like paris it makes sense that there is word embedding. For a term like "a1","a2" they are not english words so there are no embedding for them. So how is it exactly able to classify them in one category? I see the results i want but want to understand how it is happeningAcetamide
It knows a lot of words, because it was trained on an enormous text corpus. Which, apparently, has something about a1, a2, ... Describing GloVe in detail would need a lot of space, you can start here: nlp.stanford.edu/projects/gloveAtive
I think it's a nice & elegant solution (+1); and terms on the border, such as 'kobe' (which I also knew as a place, and not a name), can be addressed with additional post-processing rules (e.g. when the difference between the two highest scores is below a threshold, return both etc)Etruscan
@Ative This looks good I tested it out just wondering what if i had a category with bi grams or tri grams. Lets say i have a bunch of addresses ('10 hacker road','123 washington street') etc. Would it be possible to use this approach still ?Acetamide
@Acetamide If I understand your question right, it's still possible, but will require a bit more work.Ative
@Ative would love it if you can add that approach to your answer or describe the approachAcetamide
Maybe a little late to ask, but i need to know the meaning of the scores. Im doing this in a project and i dont know exactly how to explain this more than "The most points, the most accurate"Owain
S
1

Also, what its worth, PyTorch has a good and faster implementation of Glove these days.

Scorify answered 14/5, 2018 at 16:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.