Sklearn Pipeline ValueError: could not convert string to float
Asked Answered
M

2

7

I'm playing around with sklearn and NLP for the first time, and thought I understood everything I was doing up until I didn't know how to fix this error. Here is the relevant code (largely adapted from http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html):

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sgboost import XGBClassifier
from pandas import DataFrame

def read_files(path):
    for article in os.listdir(path):
        with open(os.path.join(path, doc)) as f:
            text = f.read()
        yield os.path.join(path, article), text

def build_data_frame(path, classification)
    rows = []
    index = []
    for filename, text in read_files(path):
        rows.append({'text': text, 'class': classification})
        index.append(filename)
    df = DataFrame(rows, index=index)
    return df

data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES: # SOURCES is a list of tuples
    data = data.append(build_data_frame(path, classification))
data = data.reindex(np.random.permutation(data.index))

classifier = Pipeline([
    ('features', FeatureUnion([
        ('text', Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('svd', TruncatedSVD(algorithm='randomized', n_components=300)
            ])),
        ('words', Pipeline([('wscaler', StandardScaler())])),
    ])),
    ('clf, XGBClassifier(silent=False)),
])
classifier.fit(data['text'].values, data['class'].values)

The data loaded into the DataFrame is preprocessed text with all stopwords, punctuation, unicode, capitals, etc. taken care of. This is the error I'm getting once I call fit on the classifier where the ... represents one of the documents that should have been vecorized in the pipeline:

ValueError: could not convert string to float: ...

I first thought the TfidfVectorizer() is not working, causing an error on the SVD algorithm, but after I extracted each step out of the pipeline and implemented them sequentially, the same error only came up on XGBClassifer.fit().

Even more confusing to me, I tried to piece this script apart step-by-step in the interpreter, but when I tried to import either read_files or build_data_frame, the same ValueError came up with one of my strings, but this was merely after:

from classifier import read_files

I have no idea how that could be happening, if anyone has any idea what my glaring errors may be, I'd really appreciate it. Trying to wrap my head around these concepts on my own but coming across a problem likes this leaves me feeling pretty incapacitated.

Murrah answered 31/8, 2018 at 21:59 Comment(0)
E
3

First part of your pipeline is a FeatureUnion. FeatureUnion will pass all the data it gets parallely to all internal parts. The second part of your FeatureUnion is a Pipeline containing single StandardScaler. Thats the source of error.

This is your data flow:

X --> classifier, Pipeline
            |
            |  <== X is passed to FeatureUnion
            \/
      features, FeatureUnion
                      |
                      |  <== X is duplicated and passed to both parts
        ______________|__________________
       |                                 |
       |  <===   X contains text  ===>   |                         
       \/                               \/
   text, Pipeline                   words, Pipeline
           |                                  |   
           |  <===    Text is passed  ===>    |
          \/                                 \/ 
       tfidf, TfidfVectorizer            wscaler, StandardScaler  <== Error
                 |                                   |
                 | <==Text converted to floats       |
                \/                                   |
              svd, TruncatedSVD                      |
                       |                             |
                       |                             |
                      \/____________________________\/
                                      |
                                      |
                                     \/
                                   clf, XGBClassifier

Since text is passed to StandardScaler, the error is thrown, StandardScaler can only work with numerical features.

Just as you are converting text to numbers using TfidfVectorizer, before sending that to TruncatedSVD, you need to do the same before StandardScaler, or else only provide numerical features to it.

Looking at the description in question, did you intend to keep StandardScaler after the results of TruncatedSVD?

Extroversion answered 1/9, 2018 at 3:10 Comment(1)
Awesome, this is so helpful. I did not understand that FeatureUnion parallelizes its input. StandardScaler is another piece of the puzzle I'm still wrapping my head around. I've read that a lot of classifiers require it to normalize the data, so I kept it for that reasonMurrah
W
0

I was getting the same error during working with titanic dataset. The error I faced after applying fit() on pipe object is:

ValueError Traceback (most recent call last) Input In [51], in <cell line: 1>() ----> 1 pipe.fit(X_train, y_train) ValueError: could not convert string to float: 'male'

If you are also getting the same error then check the Column Transformer object which you have created, if you have passed the remainder='passthrough' argument while creating the Column Transformer object, remove this argument and rerun your code after that applies fit() on training dataset fit(X_train, y_train), after that this error will be resolved.

Windsail answered 7/2, 2023 at 15:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.