word2vec, how to classify text with SVM?
Asked Answered
L

0

0

I have a csv file, it has 2 columns: class and text_data. I first extract biGram and TriGrams, then tried to use SVM on my data for classification. But it shows "TypeError: sequence item 0: expected a bytes-like object, str found". I have used Gensim=4.0.0. Your help highly appreciated. Code:

# all packages imported
df_covid = pd.read_csv('allCurated853_4.csv', encoding="utf8")
df_covid['label'] = df_covid['class'].map({
    'covidshutdown': 0,
    'manufactoring':1,
    'corporate':2,
    'environmental':3,
    'infrastructure':4,
    'other':5
})
X = df_covid['text_data']
y = df_covid['label']
corpus=X
lst_corpus = []
for string in corpus:
lst_words = string.split()
lst_grams = [" ".join(lst_words[i:i+1])
            for i in range(0, len(lst_words), 1)]
lst_corpus.append(lst_grams)

bigrams_detector = gensim.models.phrases.Phrases(lst_corpus,
            delimiter=" ".encode(), min_count=5, threshold=10)
bigrams_detector = gensim.models.phrases.Phraser(bigrams_detector)
trigrams_detector = gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],
            delimiter=" ".encode(), min_count=5, threshold=10)
trigrams_detector = gensim.models.phrases.Phraser(trigrams_detector)

cv = gensim.models.word2vec.Word2Vec(lst_corpus, size=300, window=8,
            min_count=1, sg=1, iter=30)

X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(cv, y, 
            test_size=0.20, random_state=42)
clf = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
y_pred = clf.predict(X_testCv)
print(classification_report(X_testCv, y_pred))

Complete Error message: enter image description here

This is a dataset click here to download

if, I do not use biGrams and triGrams and change word2Vec model to this

cv= gensim.models.Word2Vec(lst_corpus, vector_size=100, window=5, min_count=5, workers=4)

New error message appears:

Traceback (most recent call last):

File "D:\Dropbox\AAA\50-2\word2Vec_trails\word2Vec_triGrams_34445.py", line 65, in <module>
X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(cv, y, test_size=0.20, random_state=42)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2172, in train_test_split
arrays = indexable(*arrays)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 299, in indexable
check_consistent_length(*result)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in check_consistent_length
lengths = [_num_samples(X) for X in arrays if X is not None]

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in <listcomp>
lengths = [_num_samples(X) for X in arrays if X is not None]

  File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 202, in _num_samples
raise TypeError("Singleton array %r cannot be considered"

TypeError: Singleton array array(<gensim.models.word2vec.Word2Vec object at 0x000001A41DF59820>,
  dtype=object) cannot be considered a valid collection.

I made this attempt (below code) to use word2vec vectors like count vectorizer or TFIDF. It generates output,but not corrent one. I think I should make list of lists of vectors. Help appreciated. this is Code:

from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
df_covid = pd.read_csv('allCurated853_4.csv', encoding="utf8")
df['label'] = df['class'].map({'covidshutdowncsv': 0,  'manufactoringcsv':1, 'corporatecsv':2, 'environmentalcsv':3,'infrastructurecsv':4, 'other':5})
X = df['message']
y = df['label']
X=X.to_string ()
ls = []
rows = X.splitlines(True)
print('size of rows:', len(rows))   # size of rows: 852
for i in rows:
    ls.append(i.split(' '))
print('total words:', len(ls))    # total words: 852
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print('words in vocabolary :',len(words)) # words in vocabolary:3110
print(words)
words=words[0:852]    # problem
vectors = []
for word in words:
    vectors.append(model[word].tolist())
data = np.array(vectors)    
print('vectors of words:', len(data))   # vectors of words: 852     
X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(data, y, test_size=0.20, random_state=42)
clf_covid = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
clf_covid.score(X_testCv,y_testCv)
# score: 0.50299
Lotty answered 23/7, 2021 at 4:35 Comment(14)
Can you add the full error message to your question, including the traceback showing involved lines-of-code?Assyrian
For the future, quoted text is better than a screenshot. But, two things: (1) your bigram_detector then trigram_detector code doesn't actually feed into later steps. Is it really important? Are you sure you want to be trying advanced Phrases combinations before having something more simple working? (2) Is your code modeled after something elsewhere? The choice to set delimiter=" ".encode() may be a little off, and the error seems to highlight a problem with the set delimiter. Why do you need a non-default delimiter, anyway?Assyrian
thank you for the advice. (1) I agree with you. advanced phrases combinations can wait before simple version work. (2) do you think the problem is in data? yes words separated via a space.Lotty
If you're putting aside phrase-combos for now, which is a good idea, there's no info in the question about any current error you're hitting - only the error from the phrasing code. So hard to have any theories! What's the blocking error once the phrasing steps are skipped?Assyrian
Thanks for the update. Your cv variable is an instance of the Word2Vec model, with all sorts of internal state, including among other things a collection of word-vectors that can be looked up by word keys. You are passing that cv instance into scikit-learn's train_test_split() function. It's not expecting a Word2Vec object, but some ordered collection of input rows, one per item that's to be classified.Assyrian
thank you, could you show me a solution for that!Lotty
Well, what do you want to classify, and how did you expect a Word2Vec model to help?Assyrian
I want to classify a dataset. my dataset is there to download gojomo.Lotty
If you want someone to download your data, interpret vague requirements, and write code for you, you may be in the market for a consultant. But if you have well-formed questions about specific blocks you've hit in your own code-writing efforts, you could provide the key summary details here and maybe get more help.Assyrian
I succeeded partially to use word2vec with train_test-split. let me extend my question above. Basically, there are 5 classes and some 850 texts, each row is one text, and every text has a label. I want to use 80% of the text to train the SVM model and test the model on the remaining 20 %.Lotty
How were you hoping to turn a Word2Vec model, with its vectors for a vocabulary of 3,110 words, into vectors for 852 multi-word texts instead? (Separately: a corpus with a mere 3k unique words is unlikely to train a very good word2vec model – as the algorithm requires large, varid training sets to work well. And, min_count=1 is almost always a bad idea with word2vec - it needs multiple contrasting examples of a word to place it well, so much so that discarding rare words, as the default min_count=5 does, usually gives better results than keeping them.)Assyrian
Let us continue this discussion in chat.Lotty
Unfortunately, chat seems to hide anything discussed from future searchers, so I prefer not to create non-reusable content like that.Assyrian
agree, the chat was StackOverflow's recommendation.Lotty

© 2022 - 2024 — McMap. All rights reserved.