How to extract vocabulary from Pipeline
Asked Answered
P

1

9

I can extract vocabulary from CountVecotizerModel by the following way

fl = StopWordsRemover(inputCol="words", outputCol="filtered")
df = fl.transform(df)
cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures")
model = cv.fit(df)

print(model.vocabulary)

the above code will print list of vocabulary with index as it's ids.

Now I have created a pipeline of the above code as following:

rm_stop_words = StopWordsRemover(inputCol="words", outputCol="filtered")
count_freq = CountVectorizer(inputCol=rm_stop_words.getOutputCol(), outputCol="rawFeatures")

pipeline = Pipeline(stages=[rm_stop_words, count_freq])
model = pipeline.fit(dfm)
df = model.transform(dfm)

print(model.vocabulary) # This won't work as it's not CountVectorizerModel

it will throw the following error

print(len(model.vocabulary))

AttributeError: 'PipelineModel' object has no attribute 'vocabulary'

So how to extract the Model attribute from the pipeline?

Pug answered 12/10, 2017 at 17:27 Comment(0)
A
9

The same way, as with any other stage attribute, extract stages:

stages = model.stages

find the one(-s) you're interested in:

from pyspark.ml.feature import CountVectorizerModel

vectorizers = [s for s in stages if isinstance(s, CountVectorizerModel)]

and get desired fields:

[v.vocabulary for v in vectorizers]
Arawak answered 12/10, 2017 at 17:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.