Scalable, Efficient Hierarchical Softmax in Tensorflow?
Asked Answered
W

1

22

I'm interested in implementing a hierarchical softmax model that can handle large vocabularies, say on the order of 10M classes. What is the best way to do this to both be scalable to large class counts and efficient? For instance, at least one paper has shown that HS can achieve a ~25x speedup for large vocabs when using a 2-level tree where each node sqrt(N) classes. I'm interested also in a more general version for an arbitrary depth tree with an arbitrary branching factor.

There are a few options that I see here:

1) Run tf.gather for every batch, where we gather the indices and splits. This creates problems with large batch sizes and fat trees where now the coefficients are being duplicated a lot, leading to OOM errors.

2) Similar to #1, we could use tf.embedding_lookup which would keep help with OOM errors but now keeps everything on the CPU and slows things down quite a bit.

3) Use tf.map_fn with parallel_iterations=1 to process each sample separately and go back to using gather. This is much more scalable but does not really get close to the 25x speedup due to the serialization.

Is there a better way to implement HS? Are there different ways for deep and narrow vs. short and wide trees?

Willin answered 23/5, 2017 at 16:36 Comment(3)
They vary based on the task. Language models have larger batches around 400 with hidden sizes around 300; other tasks may have smaller batch sizes and larger hidden sizes, like imagenet classification. VRAM and RAM are pretty large relative to the problem (though GPU RAM is not).Willin
Can I have a look at your HS implementation in Tensorflow? I'm currently need it too.Middelburg
It's a little messy, but see here: github.com/tansey/sdp/blob/… -- in retrospect, I would suggest using pytorch or another dynamic graph framework.Willin
P
11

You mention that you want GPU-class performance:

but now keeps everything on the CPU and slows things down quite a bit

and wish to use 300-unit hidden size and 10M-word dictionaries.

This means that (assuming float32), you'll need 4 * 300 * 10M * 2 bytes = 24 GB just to store the parameters and the gradient for the output layer.

Hierarchical Softmax (HSM) doesn't reduce the memory requirements - it just speeds up the training.

Realistically, you'll need a lot more GPU memory, because you'll also need to store:

  • other parameters and their gradients

  • optimizer data, e.g. velocities in momentum training

  • activations and backpropagated temporary data

  • framework-specific overhead

Therefore, if you want to do all computation on GPUs, you'll have no choice but to distribute this layer across multiple high-memory GPUs.

However, you now have another problem:

To make this concrete, let's suppose you have a 2-level HSM with 3K classes, with 3K words per class (9M words in total). You distribute the 3K classes across 8 GPUs, so that each hosts 384 classes.

What if all target words in a batch are from the same 384 classes, i.e. they belong to the same GPU? One GPU will be doing all the work, while the other 7 wait for it.

The problem is that even if the target words in a batch belong to different GPUs, you'll still have the same performance as in the worst-case scenario, if you want to do this computation in TensorFlow (This is because TensorFlow is a "specify-and-run" framework -- the computational graph is the same for the best case and the worst case)

What is the best way to do this to both be scalable to large class counts and efficient?

The above inefficiency of model parallelism (each GPU must process the whole batch) suggests that one should try to keep everything in one place.

Let us suppose that you are either implementing everything on the host, or on 1 humongous GPU.

  1. If you are not modeling sequences, or if you are, but there is only one output for the whole sequence, then the memory overhead from copying the parameters, to which you referred, is negligible compared to the memory requirements described above:

    400 == batch size << number of classes == 3K

    In this case, you could simply use gather or embedding_lookup (Although the copying is inefficient)

  2. However, if you do model sequences of length, say, 100, with output at every time step, then the parameter copying becomes a big issue.

    In this case, I think you'll need to drop down to C++ / CUDA C and implement this whole layer and its gradient as a custom op.

Policlinic answered 5/6, 2017 at 7:24 Comment(7)
So you're saying the only efficient way to implement this is to use the standard embedding_lookup that I suggested in #2? It seems reasonable, but I'd wonder to what extent you really will see the GPU stalling that you're describing on real-world datasets, which is sort of what I'm looking for. Also, sampled softmax is compared against in the paper that I linked to and has been compared against thoroughly in a number of other papers.Willin
Also, what if one could handle everything on one GPU? Say in the future I have a 32GB GPU for instance.Willin
@WesleyTansey "sampled softmax is compared against" - I see it now. See this and other updates.Policlinic
Thanks. So that seems like it's sort of just agreeing with me. What I'm looking for here is some hard numbers showing that there is (or isn't) a better way to do this than gather. How would one prevent copying via CUDA? What would the performance gain really be?Willin
@WesleyTansey "So that seems like it's sort of just agreeing with me. " I pointed out that your memory problem starts before you even train the network (that wasn't in your Q). I also pointed out the inherent inefficiency of trying to do this on multiple GPUs in TF (ditto). While I know C++ and CUDA C, and it's obvious to me that this can be done, implementing this for you is too much work, sorry.Policlinic
"Hierarchical Softmax (HSM) doesn't reduce the memory requirements - it just speeds up the training." - this is not true. Hierarchical softmax greatly reduce the number of trainable parameters and thus memory requirements. For simple dense layer + softmax you have n_l times n_o weights where n_l us the dimentionality of the last layer of the network and n_o is the dimentionality of the output. For hierarchical softmax of depth 1 you have n_l times n_h plus n_l times n_o/n_h. If n_h is approximately sqrt(n_o) this means a lot less parametersYellowthroat
@Yellowthroat "Hierarchical softmax greatly reduce the number of trainable parameters and thus memory requirements." 3000x parameter savings?! That's laughably wrong! You must have totally misunderstood the tutorials or whatever you were reading. Each time step in HSM involves fewer parameters. However, these are different parameter subsets for each time step. Overall, there are no savings in the number of parameters.Policlinic

© 2022 - 2024 — McMap. All rights reserved.