Currently I'm using the following code to lemmatize and calculate TF-IDF values for some text data using spaCy:
lemma = []
for doc in nlp.pipe(df['col'].astype('unicode').values, batch_size=9844,
n_threads=3):
if doc.is_parsed:
lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct | n.lemma_ != "-PRON-"])
else:
lemma.append(None)
df['lemma_col'] = lemma
vect = sklearn.feature_extraction.text.TfidfVectorizer()
lemmas = df['lemma_col'].apply(lambda x: ' '.join(x))
vect = sklearn.feature_extraction.text.TfidfVectorizer()
features = vect.fit_transform(lemmas)
feature_names = vect.get_feature_names()
dense = features.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
df = pd.DataFrame(denselist, columns=feature_names)
lemmas = pd.concat([lemmas, df])
df= pd.concat([df, lemmas])
I need to strip out proper nouns, punctuation, and stop words but am having some trouble doing that within my current code. I've read some documentation and other resources, but am now running into an error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-21-e924639f7822> in <module>()
7 if doc.is_parsed:
8 tokens.append([n.text for n in doc])
----> 9 lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct or n.lemma_ != "-PRON-"])
10 pos.append([n.pos_ for n in doc])
11 else:
<ipython-input-21-e924639f7822> in <listcomp>(.0)
7 if doc.is_parsed:
8 tokens.append([n.text for n in doc])
----> 9 lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct or n.lemma_ != "-PRON-"])
10 pos.append([n.pos_ for n in doc])
11 else:
AttributeError: 'str' object has no attribute 'is_punct'
Is there an easier way to strip this stuff out of the text, without having to drastically change my approach?
Full code available here.