PySpark & MLLib: Class Probabilities of Random Forest Predictions
Asked Answered
T

4

9

I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel.

How can I extract class probabilities from a RandomForestModel classifier in PySpark?

Here's the sample code provided in the documentation that only provides the final class (not the probability):

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features)) 

I don't see any model.predict_proba() method -- what should I do??

Trautman answered 2/3, 2015 at 20:15 Comment(2)
Late but there is a fork with a scala solution: github.com/apache/spark/compare/master...mqk:masterCarpentaria
The issue has now been (mostly) resolved in the new Spark ML library: #43631531Dishpan
P
12

As far as I can tell this is not supported in the current version (1.2.1). The Python wrapper over the native Scala code (tree.py) defines only 'predict' functions which, in turn, call the respective Scala counterparts (treeEnsembleModels.scala). The latter make decisions by taking a vote among binary decisions. A much cleaner solution would have been to provide a probabilistic prediction which can be thresholded arbitrarily or used for ROC computation like in sklearn. This feature should be added for future releases!

As a workaround, I implemented predict_proba as a pure Python function (see example below). It is neither elegant nor very efficient, as it runs a loop over the set of individual decision trees in a forest. The trick - or rather a dirty hack - is to access the array of Java decision tree models and cast them into Python counterparts. After that you can compute individual model's predictions over the entire dataset and accumulate their sum in an RDD using 'zip'. Dividing by the number of trees gets the desired result. For large datasets, a loop over a small number of decision trees in a master node should be acceptable.

The code below is rather tricky due to the difficulties of integrating Python into Spark (run in Java). One should be very careful not to send any complex data to worker nodes, which results in crashes due to serialization problems. No code referring to the Spark context can be run on a worker node. Also, no code referring to any Java code can be serialized. For example, it may be tempting to use len(trees) instead of ntrees in the code below - bang! Writing such a wrapper in Java/Scala can be much more elegant, for example by running a loop over decision trees on worker nodes and hence reducing communication costs.

The test function below demonstrates that the predict_proba gives identical test error as predict used in original examples.

def predict_proba(rf_model, data):
   '''
   This wrapper overcomes the "binary" nature of predictions in the native
   RandomForestModel. 
   '''

    # Collect the individual decision tree models by calling the underlying
    # Java model. These are returned as JavaArray defined by py4j.
    trees = rf_model._java_model.trees()
    ntrees = rf_model.numTrees()
    scores = DecisionTreeModel(trees[0]).predict(data.map(lambda x: x.features))

    # For each decision tree, apply its prediction to the entire dataset and
    # accumulate the results using 'zip'.
    for i in range(1,ntrees):
        dtm = DecisionTreeModel(trees[i])
        scores = scores.zip(dtm.predict(data.map(lambda x: x.features)))
        scores = scores.map(lambda x: x[0] + x[1])

    # Divide the accumulated scores over the number of trees
    return scores.map(lambda x: x/ntrees)

def testError(lap):
    testErr = lap.filter(lambda (v, p): v != p).count() / float(testData.count())
    print('Test Error = ' + str(testErr))


def testClassification(trainingData, testData):

    model = RandomForest.trainClassifier(trainingData, numClasses=2,
                                         categoricalFeaturesInfo={},
                                         numTrees=50, maxDepth=30)

    # Compute test error by thresholding probabilistic predictions
    threshold = 0.5
    scores = predict_proba(model,testData)
    pred = scores.map(lambda x: 0 if x < threshold else 1)
    lab_pred = testData.map(lambda lp: lp.label).zip(pred)
    testError(lab_pred)

    # Compute test error by comparing binary predictions
    predictions = model.predict(testData.map(lambda x: x.features))
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
    testError(labelsAndPredictions)

All-in-all, this was a nice exercise to learn Spark!

Phototube answered 10/3, 2015 at 12:29 Comment(2)
Thanks. It looks good, but your probabilities are different from when we run RandomForest.trainRegressor() on the (binary) response feature and take the predictions from the model as probabilities. Conceptually, how is your method and just taking the regression output different?Trautman
I did not consider and have not used random forests for regression. For classification one can trivially interpret the fraction of votes for the positive class as probability, and this is precisely what my code does. I don't know how the probabilistic prediction is computed for regression.Phototube
T
7

This is now available.

Spark ML provides:

  • a predictionCol that contains the predicted label
  • and a probabilityCol that contains a vector with the probabilities for each label, this is what you where looking for!
  • You can also access the raw counts

For more details, here is the Spark Documentation: http://spark.apache.org/docs/latest/ml-classification-regression.html#output-columns-predictions

Tropical answered 17/2, 2016 at 14:13 Comment(1)
Indeed - see here for an example: #43631531Dishpan
P
1

It will, however, be available with Spark 1.5.0 and the new Spark-ML API.

Panorama answered 4/8, 2015 at 9:50 Comment(0)
C
0

Probably people would have moved on with this post, but i was hit by the same problem today when trying to compute the accuracy for the multi-class classifier against a training set. So I thought I share my experience if someone is trying with mllib ...

probability can be computed fairly easy as follows:-

# say you have a testset against which you want to run your classifier
   (trainingset, testset) =data.randomSplit([0.7, 0.3])
   # I converted the spark dataset containing the test data to pandas
     ptd=testData.toPandas()

   #Now get a count of number of labels matching the predictions

   correct = ((ptd.label-1) == (predictions)).sum() 
   # here we had to change the labels from 0-9 as opposed to 1-10 since
   #labels take the values from 0 .. numClasses-1

   m=ptd.shape[0]
   print((correct/m)*100)
Cachalot answered 17/7, 2017 at 4:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.