Correctness of logistic regression in Vowpal Wabbit?

I have started using Vowpal Wabbit for logistic regression, however I am unable to reproduce the results it gives. Perhaps there is some undocumented "magic" it does, but has anyone been able to replicate / verify / check the calculations for logistic regression?

For example, with the simple data below, we aim to model the way age predicts label. It is obvious there is a strong relationship as when age increases the probability of observing 1 increases.

As a simple unit test, I used the 12 rows of data below:

Now, performing a logistic regression on this dataset, using R , SPSS or even by hand, produces a model which looks like L = 0.2294*age - 14.08. So if I substitude the age, and use the logit transform prob=1/(1+EXP(-L)) I can obtain the predicted probabilities which range from 0.0001 for the first row, to 0.9864 for the last row, as reasonably expected.

If I plug in the same data in Vowpal Wabbit,

-1 'P1 |f age:20
-1 'P2 |f age:25
-1 'P3 |f age:30
-1 'P4 |f age:35
-1 'P5 |f age:40
-1 'P6 |f age:50
1 'P7 |f age:60
-1 'P8 |f age:65
1 'P9 |f age:70
1 'P10 |f age:75
1 'P11 |f age:77
1 'P12 |f age:80

And then perform a logistic regression using

vw -d data.txt -f demo_model.vw --loss_function logistic --invert_hash aaa

(command line consistent with How to perform logistic regression using vowpal wabbit on very imbalanced dataset ) , I obtain a model L= -0.00094*age - 0.03857 , which is very different.

The predicted values obtained using -r or -p further confirm this. The resulting probabilities end up nearly all the same, for example 0.4857 for age=20, and 0.4716 for age=80, which is extremely off.

I have noticed this inconsistency with larger datasets too. In what sense is Vowpal Wabbit carrying out the logistic regression differently, and how are the results to be interpreted?

This is a common misunderstanding of vowpal wabbit.

One cannot compare batch learning with online learning.

vowpal wabbit is not a batch learner. It is an online learner. Online learners learn by looking at examples one at a time and slightly adjusting the weights of the model as they go.

There are advantages and disadvantages to online learning. The downside is that convergence to the final model is slow/gradual. The learner doesn't do a "perfect" job at extracting information from each example, because the process is iterative. Convergence on a final result is deliberately restrained/slow. This can make online learners appear weak on tiny data-sets like the above.

There are several upsides though:

Online learners don't need to load the full data into memory (they work by examining one example at a time and adjusting the model based on the real-time observed per-example loss) so they can scale easily to billions of examples. A 2011 paper by 4 Yahoo! researchers describes how vowpal wabbit was used to learn from a tera (10^12) feature data-set in 1 hour on 1k nodes. Users regularly use vw to learn from billions of examples data-sets on their desktops and laptops.
Online learning is adaptive and can track changes in conditions over time, so it can learn from non-stationary data, like learning against an adaptive adversary.
Learning introspection: one can observe loss convergence rates while training and identify specific issues, and even gain significant insights from specific data-set examples or features.
Online learners can learn in an incremental fashion so users can intermix labeled and unlabeled examples to keep learning while predicting at the same time.
The estimated error, even during training, is always "out-of-sample" which is a good estimate of the test error. There's no need to split the data into train and test subsets or perform N-way cross-validation. The next (yet unseen) example is always used as a hold-out. This is a tremendous advantage over batch methods from the operational aspect. It greatly simplifies the typical machine-learning process. In addition, as long as you don't run multiple-passes over the data, it serves as a great over-fitting avoidance mechanism.

Online learners are very sensitive to example order. The worst possible order for an online learner is when classes are clustered together (all, or almost all, -1s appear first, followed by all 1s) like the example above does. So the first thing to do to get better results from an online learner like vowpal wabbit, is to uniformly shuffle the 1s and -1s (or simply order by time, as the examples typically appear in real-life).

OK now what?

Q: Is there any way to produce a reasonable model in the sense that it gives reasonable predictions on small data when using an online learner?

A: Yes, there is!

You can emulate what a batch learner does more closely, by taking two simple steps:

Uniformly shuffle 1 and -1 examples.
Run multiple passes over the data to give the learner a chance to converge

Caveat: if you run multiple passes until error goes to 0, there's a danger of over-fitting. The online learner has perfectly learned your examples, but it may not generalize well to unseen data.

The second issue here is that the predictions vw gives are not logistic-function transformed (this is unfortunate). They are akin to standard deviations from the middle point (truncated at [-50, 50]). You need to pipe the predictions via utl/logistic (in the source tree) to get signed probabilities. Note that these signed probabilities are in the range [-1, +1] rather than [0, 1]. You may use logistic -0 instead of logistic to map them to a [0, 1] range.

So given the above, here's a recipe that should give you more expected results:

# Train:
vw train.vw -c --passes 1000 -f model.vw --loss_function logistic --holdout_off


# Predict on train set (just as a sanity check) using the just generated model:
vw -t -i model.vw train.vw -p /dev/stdout | logistic | sort -tP -n -k 2

Giving this more expected result on your data-set:

-0.95674145247658 P1
-0.930208359811439 P2
-0.888329575506748 P3
-0.823617739247262 P4
-0.726830630992614 P5
-0.405323815830325 P6
0.0618902961794472 P7
0.298575998150221 P8
0.503468453150847 P9
0.663996516371277 P10
0.715480084449868 P11
0.780212725426778 P12

You could make the results more/less polarized (closer to 1 on the older ages and closer to -1 on the younger) by increasing/decreasing the number of passes. You may also be interested in the following options for training:

--max_prediction <arg>     sets the max prediction to <arg>
--min_prediction <arg>     sets the min prediction to <arg>
-l <arg>                   set learning rate to <arg>

For example, by increasing the learning rate from the default 0.5 to a large number (e.g. 10) you can force vw to converge much faster when training on small data-sets thus requiring less passes to get there.

Update

As of mid 2014, vw no longer requires the external logistic utility to map predictions back to [0,1] range. A new --link logistic option maps predictions to the logistic function [0, 1] range. Similarly --link glf1 maps predictions to a generalized logistic function [-1, 1] range.

Recommended topics

Hot tags