VowpalWabbit: Differences and scalability
Asked Answered
P

2

10

I am trying to ascertain how VowpalWabbit's "state" is maintained as the size of our input set grows. In a typical machine learning environment, if I have 1000 input vectors, I would expect to send all of those at once, wait for a model building phase to complete, and then use the model to create new predictions.

In VW, it appears that the "online" nature of the algorithm shifts this paradigm to be more performant and capable of adjusting in real-time.

  1. How is this real-time model modification implemented ?

  2. Does VW take increasing resources with respect to total input data size over time ? That is, as i add more data to my VW model (when it is small), do the real-time adjustment calculations begin to take longer once the cumulative # of feature vector inputs increases to 1000s, 10000s, or millions?

Perusse answered 30/1, 2012 at 14:13 Comment(1)
It depends what you mean by "[higher] total input data size over time". If you're using categorical features, then if "more data" implies "more levels in categoricals", and especially if you turn on higher-order interactions, you get hash collisions which impact accuracy, so yeah you eventually need to increase '-b' hash bitdepth, and thus your memory requirement for the feature hash. However the weight vector (which is kept in-memory) is still small.Travelled
H
19

Just to add to carlosdc's good answer.

Some of the features that set vowpal wabbit apart, and allow it to scale to tera-feature (1012) data-sizes are:

The online weight vector: vowpal wabbit maintains an in memory weight-vector which is essentially the vector of weights for the model that it is building. This is what you call "the state" in your question.

Unbounded data size: The size of the weight-vector is proportional to the number of features (independent input variables), not the number of examples (instances). This is what makes vowpal wabbit, unlike many other (non online) learners, scale in space. Since it doesn't need to load all the data into memory like a typical batch-learner does, it can still learn from data-sets that are too big to fit in memory.

Cluster mode: vowpal wabbit supports running on multiple hosts in a cluster, imposing a binary tree graph structure on the nodes and using the all-reduce reduction from leaves to root.

Hash trick: vowpal wabbit employs what's called the hashing trick. All feature names get hashed into an integer using murmurhash-32. This has several advantages: it is very simple and time-efficient not having to deal with hash-table management and collisions, while allowing features to occasionally collide. It turns out (in practice) that a small number of feature collisions in a training set with thousands of distinct features is similar to adding an implicit regularization term. This counter-intuitively, often improves model accuracy rather than decrease it. It is also agnostic to sparseness (or density) of the feature space. Finally, it allows the input feature names to be arbitrary strings unlike most conventional learners which require the feature names/IDs to be both a) numeric and b) unique.

Parallelism: vowpal wabbit exploits multi-core CPUs by running the parsing and learning in two separate threads, adding further to its speed. This is what makes vw be able to learn as fast as it reads data. It turns out that most supported algorithms in vw, counter-intuitively, are bottlenecked by IO speed, rather than by learning speed.

Checkpointing and incremental learning: vowpal wabbit allows you to save your model to disk while you learn, and then to load the model and continue learning where you left off with the --save_resume option.

Test-like error estimate: The average loss calculated by vowpal wabbit "as it goes" is always on unseen (out of sample) data (*). This eliminates the need to bother with pre-planned hold-outs or do cross validation. The error rate you see during training is 'test-like'.

Beyond linear models: vowpal wabbit supports several algorithms, including matrix factorization (roughly sparse matrix SVD), Latent Dirichlet Allocation (LDA), and more. It also supports on-the-fly generation of term interactions (bi-linear, quadratic, cubic, and feed-forward sigmoid neural-net with user-specified number of units), multi-class classification (in addition to basic regression and binary classification), and more.

There are tutorials and many examples in the official vw wiki on github.

(*) One exception is if you use multiple passes with --passes N option.

Hellenhellene answered 21/3, 2013 at 3:59 Comment(7)
I thought vw only did 1 layer neural networks? (e.g. --nn 10 for a single layer with 10 neurons). How would you do a multi-layer network?Worldweary
@Zach: I think you're correct. I edited the answer accordingly. Thanks and sorry.Hellenhellene
Regarding VW exploiting multiple cores, I'm under the impression VW only uses one core no matter how many cores you have. Yes it has 2 threads: one for I/O and one for computing. That's all.Herodias
Yes, vw runs in two threads so if you have multiple cores they run on two cores. If you time vw .... you can see how the the time summary includes CPU utilization exceeding 100% (150% is typical). Similar utilities like libSVM or liblinear don't have this feature.Hellenhellene
Can you elaborate on the part where you say "most supported algorithms in VW are bottlenecked by IO speed"? Judging from their paper it actually seems that that is not the case.Goodfellowship
On modern hardware vw runs in two threads: I/O thread (file-read + example parse + feature-name hashing + feature-value atof) and a learning thread (basically SGD with all its enhancements). The read (IO) thread is the slower of the two and what limits the throughput (bytes per second). In this sense, you get all the learning "for free". Also, the IO thread is slower than a simple cat (pure IO) of the same data by only a small constant (about 5 in my tests). If you want to make vw faster, you should focus on making the IO thread faster (faster atof, faster hashing, etc).Hellenhellene
It's conceivable parsing in more than one thread then, as well.Thicket
G
10

VW is a (very) sophisticated implementation of stochastic gradient descent. You can read more about stochastic gradient descent here

It turns out that a good implementation of stochastic gradient descent is basically I/O bound, it goes as fast as you can get it the data, so VW has some sophisticated data structures to "compile" the data.

Therefore the answer the answer to question (1) is by doing stochastic gradient descent and the answer to question (2) is definitely not.

Grace answered 30/1, 2012 at 17:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.