'CrossValidatorModel' object has no attribute 'featureImportances'
Asked Answered
G

1

7

I'm trying to extract the feature importance's of a random forest classifier model I have trained using Pyspark. I referred to the following article to get the feature importance scores for the random forest model I trained.

PySpark & MLLib: Random Forest Feature Importances

However, as I use the method describe in this article I get the following error

'CrossValidatorModel' object has no attribute 'featureImportances'

Here is the code I used to train my model

cols = new_data.columns
stages = []
label_stringIdx = StringIndexer(inputCol = 'Bought_Fibre', outputCol = 'label')
stages += [label_stringIdx]
numericCols = new_data.schema.names[1:-1]
assembler = VectorAssembler(inputCols=numericCols, outputCol="features")
stages += [assembler]

pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(new_data)
new_data.fillna(0, subset=cols)
new_data = pipelineModel.transform(new_data)
new_data.fillna(0, subset=cols)
new_data.printSchema()


train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 1045)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()

train_sampled = train_initial.sampleBy("label", fractions={0: 0.1, 1: 1.0}, seed=0)
train_sampled.groupBy("label").count().orderBy("label").show()



labelIndexer = StringIndexer(inputCol='label',
                             outputCol='indexedLabel').fit(train_sampled)

featureIndexer = VectorIndexer(inputCol='features',
                               outputCol='indexedFeatures',
                               maxCategories=2).fit(train_sampled)

from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)


pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])

paramGrid = ParamGridBuilder() \
    .addGrid(rf_model.numTrees, [ 200, 400,600,800,1000]) \
    .addGrid(rf_model.impurity,['entropy','gini']) \
    .addGrid(rf_model.maxDepth,[2,3,4,5]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=5)    


train_model = crossval.fit(train_sampled)

Please help to resolve the above mentioned error and help to extract the features

Geomorphic answered 3/12, 2018 at 2:57 Comment(0)
T
12

That's because the CrossValidatorModel doesn't have a feature importance attribute, but the RandomForestModel model has.

Since you are using a Pipeline and CrossValidator to fit your data, you'll need to get the underlying stage of the best fitted model :

# '2' is the index of your RandomForestModel inside of the Pipeline
your_model = cvModel.bestModel.stages[2] 
var_imp = your_model.featureImportances
Through answered 3/12, 2018 at 9:6 Comment(4)
@eliasha. Thanks for your suggestion. I was just wondering if there is any significance of [2] in the code. I am very new to the pyspark world so this might be a very silly question but just trying to understand every bit of your codeGeomorphic
This means that we are fetching the second value from the stages listThrough
Understood and if I am not wrong this would be because features are at the second position in my pipeline Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])Geomorphic
Lists indices start with 0, 1, 2 and so on. The 3rd element, the random forest is third position, so that index 2Through

© 2022 - 2024 — McMap. All rights reserved.