How to map features from the output of a VectorAssembler back to the column names in Spark ML?

Asked 21/3, 2017 at 18:56 Answered 6/8, 2018 at 16:42

python apache-spark machine-learning pyspark apache-spark-ml

I'm trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in my dataset. However, in order to train a linear regression model I had to create a feature vector using Spark's VectorAssembler, and now for each row I have a single feature vector and the target column. When I try to access Spark's in-built regression summary statistics, they give me a very raw list of numbers for each of these statistics, and there's no way to know which attribute corresponds to which value, which is really difficult to figure out manually with a large number of columns. How do I map these values back to the column names?

For example, I have my current output as something like this:

Coefficients: [-187.807832407,-187.058926726,85.1716641376,10595.3352802,-127.258892837,-39.2827730493,-1206.47228704,33.7078197705,99.9956812528]

P-Value: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18589731365614548, 0.275173571416679, 0.0]

t-statistic: [-23.348593508995318, -44.72813283953004, 19.836508234714472, 144.49248881747755, -16.547272230754242, -9.560681351483941, -19.563547400189073, 1.3228378389036228, 1.0912415361190977, 20.383256127350474]

Coefficient Standard Errors: [8.043646497811427, 4.182131353367049, 4.293682291754585, 73.32793120907755, 7.690626652102948, 4.108783841348964, 61.669402913526625, 25.481445101737247, 91.63478289909655, 609.7007361468519]

These numbers mean nothing unless I know which attribute they correspond to. But in my DataFrame I only have one column called "features" which contains rows of sparse Vectors.

This is an ever bigger problem when I have one-hot encoded features, because if I have one variable with an encoding of length n, I will get n corresponding coefficients/p-values/t-values etc.

Gopak answered 21/3, 2017 at 18:56 Comment(0)

As of today Spark doesn't provide any method that can do it for you, so if you have to create your own. Let's say your data looks like this:

import random
random.seed(1)

df = sc.parallelize([(
    random.choice([0.0, 1.0]), 
    random.choice(["a", "b", "c"]),
    random.choice(["foo", "bar"]),
    random.randint(0, 100),
    random.random(),
) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"])

and is processed using following pipeline:

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression

indexers = [
  StringIndexer(inputCol=c, outputCol="{}_idx".format(c)) for c in ["x1", "x2"]]
encoders = [
    OneHotEncoder(
        inputCol=idx.getOutputCol(),
        outputCol="{0}_enc".format(idx.getOutputCol())) for idx in indexers]
assembler = VectorAssembler(
    inputCols=[enc.getOutputCol() for enc in encoders] + ["x3", "x4"],
    outputCol="features")

pipeline = Pipeline(
    stages=indexers + encoders + [assembler, LinearRegression()])
model = pipeline.fit(df)

Get the LinearRegressionModel:

lrm = model.stages[-1]

Transform the data:

transformed =  model.transform(df)

Extract and flatten ML attributes:

from itertools import chain

attrs = sorted(
    (attr["idx"], attr["name"]) for attr in (chain(*transformed
        .schema[lrm.summary.featuresCol]
        .metadata["ml_attr"]["attrs"].values())))

and map to the output:

[(name, lrm.summary.pValues[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.26400012641279824),
 ('x1_idx_enc_c', 0.06320192217171572),
 ('x2_idx_enc_foo', 0.40447778902400433),
 ('x3', 0.1081883594783335),
 ('x4', 0.4545851609776568)]

[(name, lrm.coefficients[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.13874401585637453),
 ('x1_idx_enc_c', 0.23498565469334595),
 ('x2_idx_enc_foo', -0.083558932128022873),
 ('x3', 0.0030186112903237442),
 ('x4', -0.12951394186593695)]

Ilmenite answered 22/3, 2017 at 18:22 Comment(1)

Hi @Ilmenite I tried your solution but somehow I couldn't find attrs in the metadata. When I checked lr_transformed.schema[lrm.summary.featuresCol].metadata I only got {'ml_attr': {'num_attrs': 105}} Could you please give me some guidance on this problem? Thank you! – Keilakeily 26/1, 2021 at 19:57

You can see the actual order of the columns here

df.schema["features"].metadata["ml_attr"]["attrs"]

there will be two classes usually, ["binary] & ["numeric"]

pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")

Should give the exact order of all the columns

Ginetteginevra answered 5/2, 2018 at 13:46 Comment(3)

Awesome. You can do it without pandas like this: ``` [x["name"] for x in sorted(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+ df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"], key=lambda x: x["idx"])] ``` – Solubility 6/8, 2018 at 16:35

The best answer! – Peti 26/9, 2022 at 9:13

How did you know the binary one came first? – Begonia 18/11, 2022 at 1:52

Here's the one line answer:

[x["name"] for x in sorted(train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["binary"]+
   train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["numeric"], 
   key=lambda x: x["idx"])]

Thanks to @pratiklodha for the core of this.

Solubility answered 6/8, 2018 at 16:42 Comment(0)

Recommended topics

Hot tags