word2vec, how to classify text with SVM?

I have a csv file, it has 2 columns: class and text_data. I first extract biGram and TriGrams, then tried to use SVM on my data for classification. But it shows "TypeError: sequence item 0: expected a bytes-like object, str found". I have used Gensim=4.0.0. Your help highly appreciated. Code:

# all packages imported
df_covid = pd.read_csv('allCurated853_4.csv', encoding="utf8")
df_covid['label'] = df_covid['class'].map({
    'covidshutdown': 0,
    'manufactoring':1,
    'corporate':2,
    'environmental':3,
    'infrastructure':4,
    'other':5
})
X = df_covid['text_data']
y = df_covid['label']
corpus=X
lst_corpus = []
for string in corpus:
lst_words = string.split()
lst_grams = [" ".join(lst_words[i:i+1])
            for i in range(0, len(lst_words), 1)]
lst_corpus.append(lst_grams)

bigrams_detector = gensim.models.phrases.Phrases(lst_corpus,
            delimiter=" ".encode(), min_count=5, threshold=10)
bigrams_detector = gensim.models.phrases.Phraser(bigrams_detector)
trigrams_detector = gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],
            delimiter=" ".encode(), min_count=5, threshold=10)
trigrams_detector = gensim.models.phrases.Phraser(trigrams_detector)

cv = gensim.models.word2vec.Word2Vec(lst_corpus, size=300, window=8,
            min_count=1, sg=1, iter=30)

X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(cv, y, 
            test_size=0.20, random_state=42)
clf = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
y_pred = clf.predict(X_testCv)
print(classification_report(X_testCv, y_pred))

Complete Error message:

This is a dataset click here to download

if, I do not use biGrams and triGrams and change word2Vec model to this

cv= gensim.models.Word2Vec(lst_corpus, vector_size=100, window=5, min_count=5, workers=4)

New error message appears:

Traceback (most recent call last):

File "D:\Dropbox\AAA\50-2\word2Vec_trails\word2Vec_triGrams_34445.py", line 65, in <module>
X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(cv, y, test_size=0.20, random_state=42)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2172, in train_test_split
arrays = indexable(*arrays)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 299, in indexable
check_consistent_length(*result)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in check_consistent_length
lengths = [_num_samples(X) for X in arrays if X is not None]

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in <listcomp>
lengths = [_num_samples(X) for X in arrays if X is not None]

  File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 202, in _num_samples
raise TypeError("Singleton array %r cannot be considered"

TypeError: Singleton array array(<gensim.models.word2vec.Word2Vec object at 0x000001A41DF59820>,
  dtype=object) cannot be considered a valid collection.

I made this attempt (below code) to use word2vec vectors like count vectorizer or TFIDF. It generates output,but not corrent one. I think I should make list of lists of vectors. Help appreciated. this is Code:

from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
df_covid = pd.read_csv('allCurated853_4.csv', encoding="utf8")
df['label'] = df['class'].map({'covidshutdowncsv': 0,  'manufactoringcsv':1, 'corporatecsv':2, 'environmentalcsv':3,'infrastructurecsv':4, 'other':5})
X = df['message']
y = df['label']
X=X.to_string ()
ls = []
rows = X.splitlines(True)
print('size of rows:', len(rows))   # size of rows: 852
for i in rows:
    ls.append(i.split(' '))
print('total words:', len(ls))    # total words: 852
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print('words in vocabolary :',len(words)) # words in vocabolary:3110
print(words)
words=words[0:852]    # problem
vectors = []
for word in words:
    vectors.append(model[word].tolist())
data = np.array(vectors)    
print('vectors of words:', len(data))   # vectors of words: 852     
X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(data, y, test_size=0.20, random_state=42)
clf_covid = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
clf_covid.score(X_testCv,y_testCv)
# score: 0.50299

Recommended topics

Hot tags