Train Model fails because 'list' object has no attribute 'lower'
Asked Answered
W

4

15

I am training a classifier over tweets for sentiment analysis purposes.

The code is the following:

df = pd.read_csv('Trainded Dataset Sentiment.csv', error_bad_lines=False)
df.head(5)

enter image description here

#TWEET
X = df[['SentimentText']].loc[2:50000]
#SENTIMENT LABEL
y = df[['Sentiment']].loc[2:50000]

#Apply Normalizer function over the tweets
X['Normalized Text'] = X.SentimentText.apply(text_normalization_sentiment)
X = X['Normalized Text']

After normalization, the dataframe looks like:

enter image description here

X_train, X_test, y_train, y_test =
sklearn.cross_validation.train_test_split(X, y, 
test_size=0.2, random_state=42)

#Classifier
vec = TfidfVectorizer(min_df=5, max_df=0.95, sublinear_tf=True,
                      use_idf=True, ngram_range=(1,2))
svm_clf = svm.LinearSVC(C=0.1)
vec_clf = Pipeline([('vectorizer', vec), ('pac', svm_clf)])
vec_clf.fit(X_train, y_train) #Problem
joblib.dump(vec_clf, 'svmClassifier.pk1', compress=3)

It fails with the following error:

AttributeError: 'list' object has no attribute 'lower'

enter image description here

Full Traceback:
--------------------------------------------------------------------------- AttributeError                            Traceback (most recent call last) <ipython-input-33-4264de810c2b> in <module>()
      4 svm_clf = svm.LinearSVC(C=0.1)
      5 vec_clf = Pipeline([('vectorizer', vec), ('pac', svm_clf)])
----> 6 vec_clf.fit(X_train, y_train)
      7 joblib.dump(vec_clf, 'svmClassifier.pk1', compress=3)

C:\Users\Monviso\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
    255             This estimator
    256         """
--> 257         Xt, fit_params = self._fit(X, y, **fit_params)
    258         if self._final_estimator is not None:
    259             self._final_estimator.fit(Xt, y, **fit_params)

C:\Users\Monviso\Anaconda3\lib\site-packages\sklearn\pipeline.py in
_fit(self, X, y, **fit_params)
    220                 Xt, fitted_transformer = fit_transform_one_cached(
    221                     cloned_transformer, None, Xt, y,
--> 222                     **fit_params_steps[name])
    223                 # Replace the transformer of the step with the fitted
    224                 # transformer. This is necessary when loading the transformer

C:\Users\Monviso\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py in __call__(self, *args, **kwargs)
    360 
    361     def __call__(self, *args, **kwargs):
--> 362         return self.func(*args, **kwargs)
    363 
    364     def call_and_shelve(self, *args, **kwargs):

C:\Users\Monviso\Anaconda3\lib\site-packages\sklearn\pipeline.py in
_fit_transform_one(transformer, weight, X, y, **fit_params)
    587                        **fit_params):
    588     if hasattr(transformer, 'fit_transform'):
--> 589         res = transformer.fit_transform(X, y, **fit_params)
    590     else:
    591         res = transformer.fit(X, y, **fit_params).transform(X)

C:\Users\Monviso\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)    1379             Tf-idf-weighted document-term matrix.    1380         """
-> 1381         X = super(TfidfVectorizer, self).fit_transform(raw_documents)    1382         self._tfidf.fit(X)  1383         # X is already a transformed view of raw_documents so

C:\Users\Monviso\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
    867 
    868         vocabulary, X = self._count_vocab(raw_documents,
--> 869                                           self.fixed_vocabulary_)
    870 
    871         if self.binary:

C:\Users\Monviso\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
    790         for doc in raw_documents:
    791             feature_counter = {}
--> 792             for feature in analyze(doc):
    793                 try:
    794                     feature_idx = vocabulary[feature]

C:\Users\Monviso\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
    264 
    265             return lambda doc: self._word_ngrams(
--> 266                 tokenize(preprocess(self.decode(doc))), stop_words)
    267 
    268         else:

C:\Users\Monviso\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(x)
    230 
    231         if self.lowercase:
--> 232             return lambda x: strip_accents(x.lower())
    233         else:
    234             return strip_accents

AttributeError: 'list' object has no attribute 'lower'
Wryneck answered 25/8, 2017 at 14:29 Comment(6)
I assume is error is in X['Normalized Text'] = X.SentimentText.apply(text_normalization_sentiment) line, but hard to understand without full tracebackCarloscarlota
what is text_normalization_sentiment doing?Allhallowtide
It tokenizes the tweets, and normalizes the textWryneck
I added a preview of the normalized textWryneck
can you post the actual code for the normalization function please?Solfeggio
vec_clf.fit(X_train, y_train) #ProblemPseudaxis
A
25

The TFIDF Vectorizer should expect an array of strings. So if you pass him an array of arrays of tokenz, it crashes.

Allhallowtide answered 25/8, 2017 at 15:2 Comment(7)
How can I pass the tokens then?Wryneck
Give him the normalized string but not tokenized. It will tokenize it.Allhallowtide
I'm tokenizing them taking into account specific features. Is it not possible to preserve my tokens somehow?Wryneck
Yes, see docu. Parameter "tokenizer". scikit-learn.org/stable/modules/generated/…Allhallowtide
you don't need to keep them tokenized. join them back together after you normalize so that there is one string in each row.Solfeggio
like cameron wrote, just rejoin the tokens with whitespaces after the normalization (if you really need to tokenize them for normalization in the first place). vectorizer wants an array [ "this is a test" , "this is another test"]Allhallowtide
vec_clf.fit(X_train, y_train) #Problem --> I had the same problem, just change it to array of strings, not array of arrays of string - like this: vec_clf.fit([' '.join(arr) for arr in x_train], [' '.join(arr) for arr in y_train])Pseudaxis
S
6

add this code .apply(lambda x: ' '.join(x)) after X_train and y_train and it should work.

Stores answered 29/9, 2020 at 21:2 Comment(0)
C
4

Answer from http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/

from sklearn.feature_extraction.text import CountVectorizer

def dummy(doc):
    return doc

tfidf = CountVectorizer(
    tokenizer=dummy,
    preprocessor=dummy,
)  

docs = [
    ['hello', 'world', '.'],
    ['hello', 'world'],
    ['again', 'hello', 'world']
]

tfidf.fit(docs)
tfidf.get_feature_names()
# ['.', 'again', 'hello', 'world']
Cherin answered 23/4, 2019 at 13:41 Comment(0)
E
3

Apply

X = df.text.astype(str)

I had the similar problem but instead of extracting values using .loc[] or .iloc[], I simply used

X = df.text
y = df.target

which converts the dataframe column to Series having list as each row and tokenized items as objects in each row. The series looked similar to what Alex had:

print(X)

Directly converted series

So, only .astype(str) worked for me.

Result:

Series after applying .astype(str)

Enrichetta answered 29/10, 2021 at 4:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.