I'm playing around with sklearn and NLP for the first time, and thought I understood everything I was doing up until I didn't know how to fix this error. Here is the relevant code (largely adapted from http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sgboost import XGBClassifier
from pandas import DataFrame
def read_files(path):
for article in os.listdir(path):
with open(os.path.join(path, doc)) as f:
text = f.read()
yield os.path.join(path, article), text
def build_data_frame(path, classification)
rows = []
index = []
for filename, text in read_files(path):
rows.append({'text': text, 'class': classification})
index.append(filename)
df = DataFrame(rows, index=index)
return df
data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES: # SOURCES is a list of tuples
data = data.append(build_data_frame(path, classification))
data = data.reindex(np.random.permutation(data.index))
classifier = Pipeline([
('features', FeatureUnion([
('text', Pipeline([
('tfidf', TfidfVectorizer()),
('svd', TruncatedSVD(algorithm='randomized', n_components=300)
])),
('words', Pipeline([('wscaler', StandardScaler())])),
])),
('clf, XGBClassifier(silent=False)),
])
classifier.fit(data['text'].values, data['class'].values)
The data loaded into the DataFrame is preprocessed text with all stopwords, punctuation, unicode, capitals, etc. taken care of. This is the error I'm getting once I call fit on the classifier where the ... represents one of the documents that should have been vecorized in the pipeline:
ValueError: could not convert string to float: ...
I first thought the TfidfVectorizer() is not working, causing an error on the SVD algorithm, but after I extracted each step out of the pipeline and implemented them sequentially, the same error only came up on XGBClassifer.fit().
Even more confusing to me, I tried to piece this script apart step-by-step in the interpreter, but when I tried to import either read_files or build_data_frame, the same ValueError came up with one of my strings, but this was merely after:
from classifier import read_files
I have no idea how that could be happening, if anyone has any idea what my glaring errors may be, I'd really appreciate it. Trying to wrap my head around these concepts on my own but coming across a problem likes this leaves me feeling pretty incapacitated.