Random Forests and ROC Curves in Julia

using DecisionTree using RDatasets using MLBase quakes_data = dataset("datasets", "quakes"); # Add in a binary column as feature column for classification quakes_data[:MagGT5] = convert(Array{Int32,1}, quakes_data[:Mag] .> 5.0) # Getting features and labels where label = 1 is mag > 1 and label = 2 is mag <= 5 features = convert(Array, quakes_data[:, [1:3;5]]); labels = convert(Array, quakes_data[:, 6]); labels[labels.==0] = 2 # Create a random forest model with the tuning parameters I want r_f_model = RandomForestClassifier(nsubfeatures = 3, ntrees = 50, partialsampling=0.7, maxdepth = 4) # Train the model in-place on the dataset (there isn't a fit function without the in-place functionality) DecisionTree.fit!(r_f_model, features, labels) # Apply the trained model to the test features data set (here I haven't partitioned into training and test) r_f_prediction = convert(Array{Int64,1}, DecisionTree.predict(r_f_model, features)) # Applying the model to the training set and looking at model stats TrainingROC = roc(labels, r_f_prediction) #getting the stats around the model applied to the train set # p::T # positive in ground-truth # n::T # negative in ground-truth # tp::T # correct positive prediction # tn::T # correct negative prediction # fp::T # (incorrect) positive prediction when ground-truth is negative # fn::T # (incorrect) negative prediction when ground-truth is positive

The task in binary classification is to give a 0/1 (or true/false, red/blue) label to a new, unlabeled, data-point. Most classification algorithms are designed to output a continuous real value. This value is optimized to be higher for points with known or predicted label 1, and lower for points with known or predicted label 0. To use this value to generate a 0/1 prediction, an additional threshold is used. Points with a value higher than threshold are predicted to be labeled 1 (and for lower than threshold a 0 label is predicted ).

Why is this setup useful? Because, sometimes mispredicting a 0 instead of a 1 is more costly, and then you can set the threshold low, making the algorithm output predict 1s more often.

In an extreme case when predicting 0 instead of a 1 costs nothing for the application, you can set the threshold at infinity, making it always output 0 (which is obviously the best solution, since it incurs no cost).

The threshold trick cannot eliminate errors from the classifier - no classifier in real-world problems is perfect or free from noise. What it can do is change the ratio between the 0-when-really-1 errors and 1-when-really-0 errors for the final classification.

As you increase the threshold, more points are classified with a 0 label. Consider a chart with the fraction of points classified with 0 on the x-axis, and the fraction of points with a 0-when-really-1 error on the y-axis. For each value of the threshold, plot a point for the resulting classifier on this chart. Plotting a point for all thresholds you get a curve. This is (some variant of) the ROC curve, which summarizes the abilities of the classifier. An often used metric for quality of classification is the AUC or area-under-curve of this chart, but in fact, the whole curve can be of interest in applications.

A summary like this appears in many texts on machine learning, which are a google query away.

Hope this clarifies the role of the threshold and its relation to ROC curves.

Recommended topics

Hot tags