How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?
Asked Answered
S

2

15

I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document.

However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential?

Shivers answered 26/10, 2014 at 3:45 Comment(2)
what is "a global vocabulary" ?Douzepers
I need a fixed length feature vector for each sentence, although the number of words in each sentence is different. So I need to count the vocabulary size of my entire corpus and keep the feature vector length equal to the vocabulary size. This is what I mean by global vocabulary. Sorry for the confusion. I was not clear enough with my words.Shivers
D
7

Suppose the size of the vectors is N (usually between 50 or 500). The naive way of generalizing the traditional of generalizing BOW is just replacing 0 bit (in BOW) with N zeros, and replacing 1 bit (in BOW) with the the real vector (say from Word2Vec). Then the size of the features would be N * |V| (Compared to |V| feature vectors in the BOW, where |V| is the size of the vocabs). This simple generalization should work fine for decent number of training instances.

To make the feature vectors smaller, people use various techniques like using recursive combination of vectors with various operations. (See Recursive/Recurrent Neural Network and similar tricks, for example: http://web.engr.illinois.edu/~khashab2/files/2013_RNN.pdf or http://papers.nips.cc/paper/4204-dynamic-pooling-and-unfolding-recursive-autoencoders-for-paraphrase-detection.pdf )

Douzepers answered 5/11, 2014 at 8:35 Comment(1)
I don't understand why the scheme in your first paragraph is any better than plain bag of words. Anything a classifier can learn from this representation (with N*|V| features), it ought to be able to learn from a BOW representation (with |V| features). For instance, consider logistic regression; any model on this representation is equivalent to a corresponding model on a BOW representation. So this seems pointless. Am I missing something?Autocephalous
E
0

To get a fixed length feature vector for each sentence, although the number of words in each sentence is different, do as follows:

  1. tokenize each sentence into constituent words
  2. for each word get word vector (if it is not there ignore the word)
  3. average all the word vectors you got
  4. this will always give you a d-dim vector (d is word vector dim)

below is the code snipet

def getWordVecs(words, w2v_dict):
    vecs = []
    for word in words:
        word = word.replace('\n', '')
        try:
            vecs.append(w2v_model[word].reshape((1,300)))
        except KeyError:
            continue
    vecs = np.concatenate(vecs)
    vecs = np.array(vecs, dtype='float')
    final_vec = np.sum(vecs, axis=0)
return final_vec

words is list of tokens obtained after tokenizing a sentence.

Embracery answered 6/3, 2018 at 8:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.