Create labeledPoints from a Spark DataFrame using Pyspark

About

Asked 29/6, 2018 at 10:17 Answered 29/6, 2018 at 10:34

Solved pyspark rdd apache-spark-mllib random-forest

I have a spark Dataframe with two coulmn "label" and "sparse Vector" obtained after applying Countvectorizer to the corpus of tweet.

When trying to train Random Forest Regressor model i found that it accept only Type LabeledPoint.

Does any one know how to convert my spark DataFrame to LabeledPoint

Agretha answered 29/6, 2018 at 10:17 Comment(0)

Which spark version you are using. Spark use spark ml instead of mllib.

from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.sql import functions as F

# Input data: Each row is a bag of words with a ID.
df = sqlContext.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)

model = cv.fit(df)

result = model.transform(df).withColumn('label', F.lit(0))
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
rf.fit(result)

if You insist on mllib:

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest

rdd = result \ 
          .rdd \
          .map(lambda row: LabeledPoint(row['label'], row['features'].toArray()))
RandomForest.trainClassifier(rdd, 2, {}, 3)

Anything answered 29/6, 2018 at 10:34 Comment(14)

Hi i am using spark 2.3.0 version – Agretha 29/6, 2018 at 10:37

Hey its the same version with me. Can you use ml module instead of ml. Labelled point is working ith rdd's. – Anything 29/6, 2018 at 10:40

Can you help me setting its parameter in fact in mlli i used Train regressor but in this case of ml it is different class this is the used method pyspark.ml.regression.RandomForestRegressor(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="variance", subsamplingRate=1.0, seed=None, numTrees=20, featureSubsetStrategy="auto") – Agretha 29/6, 2018 at 10:49

You should convert df to rdd and then import labelled point from mllib.regression. Answer is updated. – Anything 29/6, 2018 at 10:55

Does it supprot DataFrame or LabelledPoint type ?? – Agretha 29/6, 2018 at 11:7

Not support dataframe. – Anything 29/6, 2018 at 11:12

But in this example i think he trained it with DataFrame from numpy import allclose >>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42) >>> model = rf.fit(df) – Agretha 29/6, 2018 at 11:23

Can you try it with from pyspark.mllib.tree import RandomForest. It uses it with rdd. Answer is updated. – Anything 29/6, 2018 at 14:44

yes i already try it with pyspark.mllib.trees but it does'nt work neither with DF or RDD i know it does'nt support DataFrame but with RDD i didn't understand from where theproblem comes – Agretha 2/7, 2018 at 9:27

it worked with rdd in my case. What is error with it? Can you put sample of your dataframe? – Anything 2/7, 2018 at 11:13

Please explore my code example and sample data here github.com/jowwel/Sen_analyser-with-Pyspark-.git – Agretha 2/7, 2018 at 11:25

Can You try:

from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.tree import RandomForest rdd = transformed \            .rdd \           .map(lambda row: LabeledPoint(row['label'], row['CV_vector'].toArray())) RandomForest.trainClassifier(rdd, 2, {}, 3)

– Anything 2/7, 2018 at 11:34

Hamza i finally solve the problem but not with LabeledPoint i used pyspark.ml.RandomForestRegressor with Dataframe but it seems to give wrong predictions can you please guess the cause of thi wrong prediction i updated the previous link Kindly explore it github.com/jowwel/Sen_analyser-with-Pyspark- – Agretha 2/7, 2018 at 12:59

Maybe because of indexes. ids, CategoricalIds and ml ids can be get mixed. Can you acccept my answer as correct one? I cant commenting questions. – Anything 2/7, 2018 at 14:22

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags