Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue? [duplicate]
Asked Answered
H

1

9

My python version is 3.6.3 and spark version is 2.2.1. Here is my code:

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

sc = SparkContext()
spark = SparkSession.builder.appName("Data Preprocessor") \
        .config("spark.some.config.option", "1") \
        .getOrCreate()

dataset = spark.createDataFrame([(0, 59.0, 0.0, Vectors.dense([2.0, 0.0, 
          0.0, 0.0, 0.0, 0.0, 0.0, 9.0, 9.0, 9.0]), 1.0)],
          ["id", "hour", "mobile", "userFeatures", "clicked"])

assembler = VectorAssembler(inputCols=["hour", "mobile", "userFeatures"], 
outputCol="features")

output = assembler.transform(dataset)
output.select("features").show(truncate=False)

Instead of getting a single vector, I am getting following output:

(12,[0,2,9,10,11],[59.0,2.0,9.0,9.0,9.0])

Heartstrings answered 6/2, 2018 at 9:1 Comment(0)
G
14

The vector returned by vectorAssembler is in sparseVector form. 12 is the number of features. ([0,2,9,10,11]) are the indices of the non-zero values. [59.0,2.0,9.0,9.0,9.0] are the non-zero values.

Goldthread answered 6/2, 2018 at 9:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.