What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]
Asked Answered
L

1

9

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what type should I return in my udf function?

from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import *


conf = SparkConf().setAppName('rank_test')
sc = SparkContext(conf=conf)
spark = SQLContext(sc)


df = spark.createDataFrame([[[0.1,0.2,0.3,0.4,0.5]]],['a'])
print '???'
df.show()
def list2vec(column):
    print '?????',column
    return Vectors.dense(column)
getVector = udf(lambda y: list2vec(y),DenseVector() )
df.withColumn('b',getVector(col('a'))).show()

I have tried many Types , and this DenseVector() give me error:

Traceback (most recent call last):
  File "t.py", line 21, in <module>
    getVector = udf(lambda y: list2vec(y),DenseVector() )
TypeError: __init__() takes exactly 2 arguments (1 given)

Help me, please.

Lorimer answered 3/4, 2018 at 6:31 Comment(0)
D
15

You can use vectors and VectorUDT with UDF,

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F

ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a                        |b                    |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+

df.printSchema()
root
  |-- a: array (nullable = true)
  |    |-- element: double (containsNull = true)
  |-- b: vector (nullable = true)

About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html

Dread answered 3/4, 2018 at 7:19 Comment(4)
Thank you, but my spark version is 1.6.0, which does not have VectorUDT, this's why I asked this questionLorimer
1.6 has VectorUDT in mllib. Just import as , from pyspark.mllib.linalg import Vectors, VectorUDT, spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/…Dread
You are right, thank uLorimer
what is the equivalent in scala (ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())) ?Setser

© 2022 - 2024 — McMap. All rights reserved.