Type conversion error from LabeledPoint in pyspark.mllib, for using linear regression model in pyspark.ml

Asked 14/2, 2017 at 16:38 Answered 13/11, 2017 at 15:14

I have the following code for linear regression using pyspark.ml package. However I get this error message for the last line, when the model is being fit:

IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.

Does anyone has an idea what is missing? Is there any replacement in pyspark.ml for LabeledPoint in pyspark.mllib?

from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.regression import LabeledPoint
import numpy as np
from pandas import *


data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")

def parsePoint(line):
    values = [float(x) for x in line.split(',')]
    return LabeledPoint(values[1], [values[0]])


points_df = data.map(parsePoint).toDF()

lr = LinearRegression()

model = lr.fit(points_df, {lr.regParam:0.0})

Korikorie answered 14/2, 2017 at 16:38 Comment(4)

Please can you let me know the spark version you are using and a sample of the file that you are trying to import – Sufferable 14/2, 2017 at 17:27

pyspark.ml uses dataframes api whereas you don't have any column names such as 'label' and 'features', is there some part of the code that you have not posted. – Sufferable 14/2, 2017 at 18:7

This is the entire code causing the error. and here are few first lines from the data file, which is being read with the function parsePoint: 0.656992798279138,2.5834056958606 0.716673783763451,2.36159163031627 0.259623437084048,1.69482312701634 – Korikorie 15/2, 2017 at 10:46

I am using spark version 2.0. @GauravDhama – Korikorie 15/2, 2017 at 10:49

The problem is that newer versions of spark have a Vector class in linalg module of ml and you do not need to get it from mllib.linalg. Also the newer versions do not accept spark.mllib.linalg.VectorUDT in ml. here is the code that would work for you :

from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
import numpy as np


data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")

def parsePoint(line):
    values = [float(x) for x in line.split(',')]
    return (values[1], Vectors.dense([values[0]]))


points_df = data.map(parsePoint).toDF(['label','features'])

lr = LinearRegression()

model = lr.fit(points_df)

Sufferable answered 15/2, 2017 at 12:47 Comment(0)

Spark newer versions don't accept spark.mllib.linalg.VectorUDT (you do not need to get it from mllib.linalg).

try to replace

from pyspark.mllib.regression import LabeledPoint

by:

from pyspark.ml.linalg import Vectors

Pageantry answered 13/11, 2017 at 15:14 Comment(0)

Recommended topics

Hot tags