How to do Text classification using word2vec
Asked Answered
V

2

11

I want to perform text classification using word2vec. I got vectors of words.

ls = []
sentences = lines.split(".")
for i in sentences:
    ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
    vectors.append(model[word].tolist())
data = np.array(vectors)
data

output:

array([[ 0.00933912,  0.07960335, -0.04559333,  0.10600036],
       [ 0.10576613,  0.07267512, -0.10718666, -0.00804013],
       [ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
       [-0.09893986,  0.01500741, -0.04796079, -0.04447284],
       [ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
       [ 0.09352681, -0.03864434, -0.01743148,  0.11251986],.....])

How can i perform classification (product & non product)?

Vandyke answered 4/4, 2018 at 6:10 Comment(0)
P
9

You already have the array of word vectors using model.wv.syn0. If you print it, you can see an array with each corresponding vector of a word.

You can see an example here using Python3:

import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression


#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())

train = []
#getting only the first 4 columns of the file 
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
    train.extend(sentences)
  
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]

Now it's time to use the vector model, in this example we will calculate the LogisticRegression.

# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)

# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....

# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)

Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
    for line in f:
        lastchar = line.strip()[-1]
        if lastchar.isdigit():
            result = int(lastchar) 
            Y_dataset.append(result) 
        else:
            result = 40 


clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])

# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)

You can also calculate the similarity of words belonging to your created model dictionary:

print("\n\nSimilarity value : ",model.wv.similarity('women','men'))

You can find more functions to use here.

Pigeonhearted answered 9/10, 2018 at 16:28 Comment(6)
In the first line you have created the Word2Vec model. Why do you need to train the model on the tokens ? (4th line) model.train(tokens, total_examples=len(tokens), epochs=4000)Breviary
doesn't train.extend(sentences) create a list of characters rather than list of tokens? Shouldn't it be train.append()?Multifoil
@Joel and Krishna, are you sure above code works? When I tried to run it shows error message: AttributeError: 'KeyedVectors' object has no attribute 'syn0' . I think the issue is here: model.wv.syn0Omer
@Omer By the time I did the code it was working. Maybe some libraries version changes are the issue when you run it. However, you have the code base, it is just updating some code parts to have it running smoothly :) I wish I could help you more, but I am currently on vacation and the response was in 2018, so I cannot remember it :/Pigeonhearted
please share versions of libraries, I degrade libraries and try again. Thank you. Well, I would be very happy if I can run your code or mine:#68494594Omer
Joel, would you please share your dataset!Omer
S
2

Your question is rather broad but I will try to give you a first approach to classify text documents.

First of all, I would decide how I want to represent each document as one vector. So you need a method that takes a list of vectors (of words) and returns one single vector. You want to avoid that the length of the document influences what this vector represents. You could for example choose the mean.

def document_vector(array_of_word_vectors):
    return array_of_word_vectors.mean(axis=0) 

where array_of_word_vectors is for example data in your code.

Now you can either play a bit around with distances (for example cosine distance would a nice first choice) and see how far certain documents are from each other or - and that's probably the approach that brings faster results - you can use the document vectors to build a training set for a classification algorithm of your choice from scikit learn, for example Logistic Regression.

The document vectors will become your matrix X and your vector y is an array of 1 and 0, depending on the binary category that you want the documents to be classified into.

Shanonshanta answered 4/4, 2018 at 9:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.