Apply CountVectorizer to column with list of words in rows in Python
Asked Answered
M

4

9

I made a preprocessing part for text analysis and after removing stopwords and stemming like this:

test[col] = test[col].apply(
    lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])

train[col] = train[col].apply(
    lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])

I've got a column with list of "cleaned words". Here are 3 rows in a column:

['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']

I now want to apply CountVectorizer to this column:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500, analyzer='word', lowercase=False) # will leave only 1500 words
X_train = cv.fit_transform(train[col])

But I got an Error:

TypeError: expected string or bytes-like object

It would be a bit strange to create string from list and than separate by CountVectorizer again.

Marker answered 8/12, 2017 at 9:42 Comment(1)
It's unclear from your code and discussion if you're using pandas to handle columns (and rows), but if you're not I recommend it. Especially when you say 'I've got a column with list of "cleaned words"' but you don't show a pandas dataframe column, only a Python list(-of-list).Intertexture
M
4

As I found no other way to avoid an error, I joined the lists in column

train[col]=train[col].apply(lambda x: " ".join(x) )
test[col]=test[col].apply(lambda x: " ".join(x) )

Only after that I started to get the result

X_train = cv.fit_transform(train[col])
X_train=pd.DataFrame(X_train.toarray(), columns=cv.get_feature_names())
Marker answered 8/12, 2017 at 18:10 Comment(0)
B
9

To apply CountVectorizer to list of words you should disable analyzer.

x=[['ab','cd'], ['ab','de']]
vectorizer = CountVectorizer(analyzer=lambda x: x)
vectorizer.fit_transform(x).toarray()

Out:
array([[1, 1, 0],
       [1, 0, 1]], dtype=int64)
Bandwagon answered 12/9, 2020 at 10:4 Comment(0)
M
4

As I found no other way to avoid an error, I joined the lists in column

train[col]=train[col].apply(lambda x: " ".join(x) )
test[col]=test[col].apply(lambda x: " ".join(x) )

Only after that I started to get the result

X_train = cv.fit_transform(train[col])
X_train=pd.DataFrame(X_train.toarray(), columns=cv.get_feature_names())
Marker answered 8/12, 2017 at 18:10 Comment(0)
I
1

Your input should be list of strings or bytes, in this case you seem to provide list of list.

It looks like you already tokenized your string into tokens, inside separate lists. What you can do is a hack as below:

inp = [['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 
'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 
'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 
'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']]
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']


inp = ["<some_space>".join(x) for x in inp]

vectorizer = CountVectorizer(tokenizer = lambda x: x.split("<some_space>"), analyzer="word")

vectorizer.fit_transform(inp)
Isisiskenderun answered 28/11, 2018 at 14:48 Comment(0)
M
0

When you use fit_transform, the params passed in have to be an iterable of strings or bytes-like objects. Looks like you should be applying that over your column instead.

X_train = train[col].apply(lambda x: cv.fit_transform(x))

You can read the docs for fit_transform here.

Matless answered 8/12, 2017 at 9:53 Comment(3)
unfortunately it creates an error "ValueError: empty vocabulary; perhaps the documents only contain stop words"Marker
Is it possible that some of your rows have empty "cleaned words"?Matless
Don't you get simply the row with all 0 for that case?Marker

© 2022 - 2024 — McMap. All rights reserved.