You mention that you want GPU-class performance:
but now keeps everything on the CPU and slows things down quite a bit
and wish to use 300-unit hidden size and 10M-word dictionaries.
This means that (assuming float32
), you'll need 4 * 300 * 10M * 2 bytes = 24 GB just to store the parameters and the gradient for the output layer.
Hierarchical Softmax (HSM) doesn't reduce the memory requirements - it just speeds up the training.
Realistically, you'll need a lot more GPU memory, because you'll also need to store:
other parameters and their gradients
optimizer data, e.g. velocities in momentum training
activations and backpropagated temporary data
framework-specific overhead
Therefore, if you want to do all computation on GPUs, you'll have no choice but to distribute this layer across multiple high-memory GPUs.
However, you now have another problem:
To make this concrete, let's suppose you have a 2-level HSM with 3K classes, with 3K words per class (9M words in total). You distribute the 3K classes across 8 GPUs, so that each hosts 384 classes.
What if all target words in a batch are from the same 384 classes, i.e. they belong to the same GPU? One GPU will be doing all the work, while the other 7 wait for it.
The problem is that even if the target words in a batch belong to different GPUs, you'll still have the same performance as in the worst-case scenario, if you want to do this computation in TensorFlow (This is because TensorFlow is a "specify-and-run" framework -- the computational graph is the same for the best case and the worst case)
What is the best way to do this to both be scalable to large class counts and efficient?
The above inefficiency of model parallelism (each GPU must process the whole batch) suggests that one should try to keep everything in one place.
Let us suppose that you are either implementing everything on the host, or on 1 humongous GPU.
If you are not modeling sequences, or if you are, but there is only one output for the whole sequence, then the memory overhead from copying the parameters, to which you referred, is negligible compared to the memory requirements described above:
400 == batch size << number of classes == 3K
In this case, you could simply use gather
or embedding_lookup
(Although the copying is inefficient)
However, if you do model sequences of length, say, 100, with output at every time step, then the parameter copying becomes a big issue.
In this case, I think you'll need to drop down to C++ / CUDA C and implement this whole layer and its gradient as a custom op.