How to extract a value from a Vector in a column of a Spark Dataframe [duplicate]
Asked Answered
G

2

13

When using SparkML to predict labels the result Dataframe is:

scala> result.show
+-----------+--------------+
|probability|predictedLabel|
+-----------+--------------+
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.6,0.4]|           1.0|
|  [0.6,0.4]|           1.0|
|  [1.0,0.0]|           1.0|
|  [0.9,0.1]|           1.0|
|  [0.9,0.1]|           1.0|
|  [1.0,0.0]|           1.0|
|  [1.0,0.0]|           1.0|
+-----------+--------------+
only showing top 20 rows

I want to create a new Dataframe with a new column named prob which is the first value from the Vector in probability column of original Dataframe e.g.:

+-----------+--------------+----------+
|probability|predictedLabel|   prob   |
+-----------+--------------+----------+
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.1|
|  [0.6,0.4]|           1.0|       0.6|
|  [0.6,0.4]|           1.0|       0.6|
|  [1.0,0.0]|           1.0|       1.0|
|  [0.9,0.1]|           1.0|       0.9|
|  [0.9,0.1]|           1.0|       0.9|
|  [1.0,0.0]|           1.0|       1.0|
|  [1.0,0.0]|           1.0|       1.0|
+-----------+--------------+----------+

How can extract this value into a new column?

Gargoyle answered 2/5, 2017 at 6:8 Comment(1)
import org.apache.spark.ml.functions.vector_to_array and then model.withColumn("prob", vector_to_array(col("probability")).getItem(1)).show(false))Basin
G
7

You can use the capabilities of Dataset and the wonderful functions library to accomplish what you need:

result.withColumn("prob", $"probability".getItem(0))

This adds a new Column called prob whose value is derived from the probability Column by taking the first item (at index 0--we are computer scientists after all) in the array.

I would mention also that UDFs should be your last resort because the Catalyst optimizer cannot currently optimize UDFs, so you should always prefer the built-in functions to get the most out of Catalyst.

Grannias answered 2/5, 2017 at 12:43 Comment(2)
This would work with ArrayType not with org.apache.spark.ml.linalg.VectorUDT.Braud
Name: org.apache.spark.sql.AnalysisException Message: Can't extract value from probability#16416; StackTrace: at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)Requisite
G
1

It is fairly simple if you use Spark UDF(s). Like this:

val headValue = udf((arr: Seq[Double]) => arr.head)

result.withColumn("prob", headValue(result("probability"))).show

It will give you desired output:

+-----------+--------------+----------+
|probability|predictedLabel|   prob   |
+-----------+--------------+----------+
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.1|
|  [0.6,0.4]|           1.0|       0.6|
|  [0.6,0.4]|           1.0|       0.6|
|  [1.0,0.0]|           1.0|       1.0|
|  [0.9,0.1]|           1.0|       0.9|
|  [0.9,0.1]|           1.0|       0.9|
|  [1.0,0.0]|           1.0|       1.0|
|  [1.0,0.0]|           1.0|       1.0|
+-----------+--------------+----------+
Goldstein answered 2/5, 2017 at 6:27 Comment(2)
probability output is org.apache.spark.ml.linalg.VectorUDT not ArrayType(DoubleType).Braud
Could you please convert this to pythonOfficiate

© 2022 - 2024 — McMap. All rights reserved.