Create labeledPoints from a Spark DataFrame using Pyspark
Asked Answered
A

1

5

I have a spark Dataframe with two coulmn "label" and "sparse Vector" obtained after applying Countvectorizer to the corpus of tweet.

When trying to train Random Forest Regressor model i found that it accept only Type LabeledPoint.

Does any one know how to convert my spark DataFrame to LabeledPoint

Agretha answered 29/6, 2018 at 10:17 Comment(0)
A
6

Which spark version you are using. Spark use spark ml instead of mllib.

from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.sql import functions as F

# Input data: Each row is a bag of words with a ID.
df = sqlContext.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)

model = cv.fit(df)

result = model.transform(df).withColumn('label', F.lit(0))
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
rf.fit(result)

if You insist on mllib:

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest

rdd = result \ 
          .rdd \
          .map(lambda row: LabeledPoint(row['label'], row['features'].toArray()))
RandomForest.trainClassifier(rdd, 2, {}, 3)
Anything answered 29/6, 2018 at 10:34 Comment(14)
Hi i am using spark 2.3.0 versionAgretha
Hey its the same version with me. Can you use ml module instead of ml. Labelled point is working ith rdd's.Anything
Can you help me setting its parameter in fact in mlli i used Train regressor but in this case of ml it is different class this is the used method pyspark.ml.regression.RandomForestRegressor(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="variance", subsamplingRate=1.0, seed=None, numTrees=20, featureSubsetStrategy="auto")Agretha
You should convert df to rdd and then import labelled point from mllib.regression. Answer is updated.Anything
Does it supprot DataFrame or LabelledPoint type ??Agretha
Not support dataframe.Anything
But in this example i think he trained it with DataFrame from numpy import allclose >>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42) >>> model = rf.fit(df)Agretha
Can you try it with from pyspark.mllib.tree import RandomForest. It uses it with rdd. Answer is updated.Anything
yes i already try it with pyspark.mllib.trees but it does'nt work neither with DF or RDD i know it does'nt support DataFrame but with RDD i didn't understand from where theproblem comesAgretha
it worked with rdd in my case. What is error with it? Can you put sample of your dataframe?Anything
Please explore my code example and sample data here github.com/jowwel/Sen_analyser-with-Pyspark-.gitAgretha
Can You try: from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.tree import RandomForest rdd = transformed \ .rdd \ .map(lambda row: LabeledPoint(row['label'], row['CV_vector'].toArray())) RandomForest.trainClassifier(rdd, 2, {}, 3)Anything
Hamza i finally solve the problem but not with LabeledPoint i used pyspark.ml.RandomForestRegressor with Dataframe but it seems to give wrong predictions can you please guess the cause of thi wrong prediction i updated the previous link Kindly explore it github.com/jowwel/Sen_analyser-with-Pyspark-Agretha
Maybe because of indexes. ids, CategoricalIds and ml ids can be get mixed. Can you acccept my answer as correct one? I cant commenting questions.Anything

© 2022 - 2024 — McMap. All rights reserved.