How to extract a value from a Vector in a column of a Spark Dataframe [duplicate]

Asked 2/5, 2017 at 6:8 Answered 2/5, 2017 at 12:43

scala apache-spark dataframe apache-spark-sql apache-spark-mllib

When using SparkML to predict labels the result Dataframe is:

scala> result.show
+-----------+--------------+
|probability|predictedLabel|
+-----------+--------------+
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.6,0.4]|           1.0|
|  [0.6,0.4]|           1.0|
|  [1.0,0.0]|           1.0|
|  [0.9,0.1]|           1.0|
|  [0.9,0.1]|           1.0|
|  [1.0,0.0]|           1.0|
|  [1.0,0.0]|           1.0|
+-----------+--------------+
only showing top 20 rows

I want to create a new Dataframe with a new column named prob which is the first value from the Vector in probability column of original Dataframe e.g.:

+-----------+--------------+----------+
|probability|predictedLabel|   prob   |
+-----------+--------------+----------+
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.1|
|  [0.6,0.4]|           1.0|       0.6|
|  [0.6,0.4]|           1.0|       0.6|
|  [1.0,0.0]|           1.0|       1.0|
|  [0.9,0.1]|           1.0|       0.9|
|  [0.9,0.1]|           1.0|       0.9|
|  [1.0,0.0]|           1.0|       1.0|
|  [1.0,0.0]|           1.0|       1.0|
+-----------+--------------+----------+

How can extract this value into a new column?

Gargoyle answered 2/5, 2017 at 6:8 Comment(1)

import org.apache.spark.ml.functions.vector_to_array and then model.withColumn("prob", vector_to_array(col("probability")).getItem(1)).show(false)) – Basin 20/8, 2021 at 12:31

You can use the capabilities of Dataset and the wonderful functions library to accomplish what you need:

result.withColumn("prob", $"probability".getItem(0))

This adds a new Column called prob whose value is derived from the probability Column by taking the first item (at index 0--we are computer scientists after all) in the array.

I would mention also that UDFs should be your last resort because the Catalyst optimizer cannot currently optimize UDFs, so you should always prefer the built-in functions to get the most out of Catalyst.

Grannias answered 2/5, 2017 at 12:43 Comment(2)

This would work with ArrayType not with org.apache.spark.ml.linalg.VectorUDT. – Braud 7/11, 2018 at 22:1

Name: org.apache.spark.sql.AnalysisException Message: Can't extract value from probability#16416; StackTrace: at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) – Requisite 28/8, 2019 at 2:50

It is fairly simple if you use Spark UDF(s). Like this:

val headValue = udf((arr: Seq[Double]) => arr.head)

result.withColumn("prob", headValue(result("probability"))).show

It will give you desired output:

+-----------+--------------+----------+
|probability|predictedLabel|   prob   |
+-----------+--------------+----------+
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.1|
|  [0.6,0.4]|           1.0|       0.6|
|  [0.6,0.4]|           1.0|       0.6|
|  [1.0,0.0]|           1.0|       1.0|
|  [0.9,0.1]|           1.0|       0.9|
|  [0.9,0.1]|           1.0|       0.9|
|  [1.0,0.0]|           1.0|       1.0|
|  [1.0,0.0]|           1.0|       1.0|
+-----------+--------------+----------+

Goldstein answered 2/5, 2017 at 6:27 Comment(2)

probability output is org.apache.spark.ml.linalg.VectorUDT not ArrayType(DoubleType). – Braud 7/11, 2018 at 22:2

Could you please convert this to python – Officiate 24/7, 2020 at 3:1

Recommended topics

Hot tags