How to print the probability of prediction in LogisticRegressionWithLBFGS for pyspark

Asked 6/11, 2015 at 6:33 Answered 17/7, 2017 at 7:12

Solved apache-spark machine-learning pyspark apache-spark-mllib logistic-regression

I am using Spark 1.5.1 and, In pyspark, after I fit the model using:

model = LogisticRegressionWithLBFGS.train(parsedData)

I can print the prediction using:

model.predict(p.features)

Is there a function to print the probability score also along with the prediction?

Rockery answered 6/11, 2015 at 6:33 Comment(0)

You have to clear the threshold first, and this works only for binary classification:

 from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
 from pyspark.mllib.regression import LabeledPoint

 parsed_data = [LabeledPoint(0.0, [4.6,3.6,1.0,0.2]),
                LabeledPoint(0.0, [5.7,4.4,1.5,0.4]),
                LabeledPoint(1.0, [6.7,3.1,4.4,1.4]),
                LabeledPoint(0.0, [4.8,3.4,1.6,0.2]),
                LabeledPoint(1.0, [4.4,3.2,1.3,0.2])]   

 model = LogisticRegressionWithLBFGS.train(sc.parallelize(parsed_data))
 model.threshold
 # 0.5
 model.predict(parsed_data[2].features)
 # 1

 model.clearThreshold()
 model.predict(parsed_data[2].features)
 # 0.9873840020002339

Ev answered 6/11, 2015 at 8:41 Comment(6)

From the documentation I couldnt find a way to do the same for multiclass classification. Are you aware if it is possible? The only way i thought is to make a manual 1-vs-all – Sherwin 22/3, 2016 at 9:59

@MpizosDimitris, This requires changing the actual function. I have just implemented this in Scala and can provide an answer for a new question – Schmid 24/3, 2016 at 21:10

@BrianVanover #36152068 – Sherwin 29/3, 2016 at 6:57

@desertnaut, looks like there is no change regarding support for multiclass classification with spark 2.2.0 with MLLib. Is the spark community recommending to use ML package instead. wondering why spark is lacking on this classifiers when the support is available in scipy and even with octave. – Immunogenetics 16/7, 2017 at 20:52

@Immunogenetics Indeed MLlib is headed for deprecation - ML is the recommended package now – Ev 16/7, 2017 at 20:58

@desertnaut, i was getting little fed up by moving btw ML and mllib , so instead of trying with LBFGS classifier , I tried with RandForest classifier and it worked. The only challenge is that it is computationally intense as the error goes down with increase in number of trees and depth. Since it had no accuracy method like above, I had make one which I had posted in #28819192 – Immunogenetics 17/7, 2017 at 4:17

I presume the question is on computing probability score for the predicting the entire training set. if so , I did the following to compute it. Not sure if the post is still active, but this is howI did this:

#get the original training data before it was converted to rows of LabelPoint.
#let us assume it is otd  ( of type spark DataFrame)
#let us extract the featureset as rdd by:
fs=otd.rdd.map(lambda x:x[1:]) # assuming label is col 0.

#the below is just a sample way of creating a Labelpoint rows..
parsedData= otd.rdd.map(lambda x: reg.LabeledPoint(int(x[0]-1),x[1:]))

# now convert otd to a panda DataFrame as:
ptd= otd.toPandas()
m= ptd.shape[0]
# train and get the model
model=LogisticRegressionWithLBFGS.train(trainingData,numClasses=10)


#Now store the model.predict rdd structures 
predict=model.predict(fs)
pr=predict.collect()

correct=0
correct = ((ptd.label-1) == (pr)).sum()
print((correct/m) *100)

Note the above is for multi-class classification.

Immunogenetics answered 17/7, 2017 at 7:12 Comment(7)

@desertnaut, please take a look if this makes sense. – Immunogenetics 17/7, 2017 at 7:13

1) trainingdata is nowhere defined 2) fs is nowhere used 3) it is not clear what the outcome of your code is, and if it indeed provides probabilities; that's why it is a good practice to provide dummy data and demonstrate the results, as I have done 4) toPandas is not a good idea, since it will only work for 'small' datasets (where you don't even need Spark) 5) the issue has been mostly resolved in ML: #43631531 – Ev 17/7, 2017 at 9:20

@desertnaut, I ran this code against my data-set which we were discussing on a different post. fs is passed as a argument to the predict. My training data is a 5000x400 matrix containing labels 1-10 for multi-class classifier. It is a hand written digits containing numbers from 1-10. I understand toPandas() is not efficient , but the goal was to compute probability. – Immunogenetics 17/7, 2017 at 15:53

1) probably, by trainingData you meant parsedData 2) if you can use toPandas() the way you use it, there is absolutely no reason to use Spark at all - you would do your job better with pandas & scikit-learn – Ev 17/7, 2017 at 16:6

Compared with scikit-learn and similar packages, the functionality in Spark ML/MLlib is really primitive; the only reason to use it is if your data do not fit into a single machine's main memory ('big data'), and hence you need to work on a distributed computing environment (cluster). – Ev 17/7, 2017 at 16:57

@desertnaut, i am evaluating which one to go for - should I do edge computing using scikit and then aggregated data to cloud or do I collect the data and do it centrally. Moreover two parallel track of ML and MLlib also adds to the confusion and clarity. To your point on toPandas(), note that I don't have to use toPandas and convert it to DataFrame if I had stored the data orginally in a format that allows we to compute probability as above. It is a convenience and probably there was a reason why Spark folks allows converting to Pandas DataFrame just like collect() converts RDD to list. – Immunogenetics 18/7, 2017 at 17:3

toPandas &collect exist solely as to allow for local processing of the results of (possibly successive) aggregations of large data that don't fit in memory. Let me repeat - if you do your processing in a laptop with Spark, you are simply imposing unnecessary pain to yourself without any benefit at all... – Ev 18/7, 2017 at 19:54

Recommended topics

Hot tags