I'm trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in my dataset. However, in order to train a linear regression model I had to create a feature vector using Spark's VectorAssembler
, and now for each row I have a single feature vector and the target column.
When I try to access Spark's in-built regression summary statistics, they give me a very raw list of numbers for each of these statistics, and there's no way to know which attribute corresponds to which value, which is really difficult to figure out manually with a large number of columns.
How do I map these values back to the column names?
For example, I have my current output as something like this:
Coefficients: [-187.807832407,-187.058926726,85.1716641376,10595.3352802,-127.258892837,-39.2827730493,-1206.47228704,33.7078197705,99.9956812528]
P-Value: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18589731365614548, 0.275173571416679, 0.0]
t-statistic: [-23.348593508995318, -44.72813283953004, 19.836508234714472, 144.49248881747755, -16.547272230754242, -9.560681351483941, -19.563547400189073, 1.3228378389036228, 1.0912415361190977, 20.383256127350474]
Coefficient Standard Errors: [8.043646497811427, 4.182131353367049, 4.293682291754585, 73.32793120907755, 7.690626652102948, 4.108783841348964, 61.669402913526625, 25.481445101737247, 91.63478289909655, 609.7007361468519]
These numbers mean nothing unless I know which attribute they correspond to. But in my DataFrame
I only have one column called "features" which contains rows of sparse Vectors.
This is an ever bigger problem when I have one-hot encoded features, because if I have one variable with an encoding of length n, I will get n corresponding coefficients/p-values/t-values etc.
attrs
in the metadata. When I checkedlr_transformed.schema[lrm.summary.featuresCol].metadata
I only got{'ml_attr': {'num_attrs': 105}}
Could you please give me some guidance on this problem? Thank you! – Keilakeily