Pyspark random forest feature importance mapping after column transformations

Asked 19/6, 2018 at 22:8 Answered 10/9, 2020 at 16:14

apache-spark pyspark apache-spark-sql apache-spark-mllib

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.

Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this -

use string indexer to index string columns
use one hot encoder for all columns

use a vectorassembler to create the feature column containing the feature vector

Some sample code from the docs for steps 1,2,3 -

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, 
VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status", 
"occupation", "relationship", "race", "sex", "native_country"]
 stages = [] # stages in our Pipeline
 for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, 
    outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary 
    SparseVectors
    # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", 
    outputCol=categoricalCol + "classVec")
    encoder = OneHotEncoderEstimator(inputCols= 
    [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

numericCols = ["age", "fnlwgt", "education_num", "capital_gain", 
"capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

# Create a Pipeline.
pipeline = Pipeline(stages=stages)
# Run the feature transformations.
#  - fit() computes feature statistics as needed.
#  - transform() actually transforms the features.
pipelineModel = pipeline.fit(dataset)
dataset = pipelineModel.transform(dataset)

finally train the model

after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -
```
print dtModel_1.featureImportances

(38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
```

How do I map it back to the initial column names and the values? So that I can plot ?**

Instable answered 19/6, 2018 at 22:8 Comment(0)

Extract metadata as shown here by user6910411

attrs = sorted(
    (attr["idx"], attr["name"])
    for attr in (
        chain(*dataset.schema["features"].metadata["ml_attr"]["attrs"].values())
    )
)

and combine with feature importance:

[
    (name, dtModel_1.featureImportances[idx])
    for idx, name in attrs
    if dtModel_1.featureImportances[idx]
]

Bide answered 19/6, 2018 at 23:43 Comment(2)

Yes, I was actually able to figure it out. I did it slightly differently, I created a pandas dataframe with the idx and feature names and then converted to a dictionary which was broadcast variable. code – Instable 20/6, 2018 at 21:0

pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")  feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"]))  feature_dict_broad = sc.broadcast(feature_dict)

– Instable 20/6, 2018 at 21:15

The transformed dataset metdata has the required attributes.Here is an easy way to do -

create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)

pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] 
["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")

Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.

feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 

feature_dict_broad = sc.broadcast(feature_dict)

Instable answered 20/6, 2018 at 21:26 Comment(1)

When I do this, it doesn't show my numeric column names, it just says "numeric_feature_1", "numeric_feature_2" ... I have a few transformations that I do to my numeric variables. Would this make them disappear? – Imponderable 17/4, 2020 at 15:51

When creating your assembler you used a list of variables (assemblerInputs). The order is preserved in 'features' variable. So just do a Pandas DataFrame:

features_imp_pd = (
     pd.DataFrame(
       dtModel_1.featureImportances.toArray(), 
       index=assemblerInputs, 
       columns=['importance'])
)

Unsaddle answered 10/9, 2020 at 16:14 Comment(0)

Recommended topics

Hot tags