I wrote a text classification program. When I run the program it crashes with an error as seen in this screenshot:
ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
Here is my code:
from sklearn.model_selection import train_test_split
from gensim.models.word2vec import Word2Vec
from sklearn.preprocessing import scale
from sklearn.linear_model import SGDClassifier
import nltk, string, json
import numpy as np
def cleanText(corpus):
reviews = []
for dd in corpus:
#for d in dd:
try:
words = nltk.word_tokenize(dd['description'])
words = [w.lower() for w in words]
reviews.append(words)
#break
except:
pass
return reviews
with open('C:\\NLP\\bad.json') as fin:
text = json.load(fin)
neg_rev = cleanText(text)
with open('C:\\NLP\\good.json') as fin:
text = json.load(fin)
pos_rev = cleanText(text)
#1 for positive sentiment, 0 for negative
y = np.concatenate((np.ones(len(pos_rev)), np.zeros(len(neg_rev))))
x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_rev, neg_rev)), y, test_size=0.2)
The data I am using is available here:
How would I go about fixing this error?
shape
of your concatenated reviews and youry
variable? – Diaeresisn_samples=0
in the error. So work backward from there and figure out what actually comes out of your parsing inpos_rev
andneg_rev
, because if you get no errors, it seems likely that thelen()
of each is0
– Diaeresis