Can anyone give me any hints why XLA-JIT has better performance on CPU backend?
I tried TensorFlow without and with XLA-JIT (manual mode) on mnist benchmark on a single CPU. Using XLA-JIT achieves 13.6x speedups against TensorFlow without XLA-JIT.
As operation fusion is often mentioned when talking about the advantages of XLA-JIT, I naturally thought this technique might be the reason behind, so I learned the source code and found the fusion procedure is roughly like this (please correct me if anything is wrong):
- Check if there are operations in an HloComputation (CompOld) can be fused;
- If so, a new Fusion instruction is added to CompOld, and fused operations are removed from CompOld;
- Then a new HloComputation (CompNew) is created consisting of the fused operations. The added Fusion instruction in CompOld has a pointer pointing to CompNew.
- When it comes to the backend, the LLVM IR are emitted independently for both CompOld and CompNew.
Considering the significant performance improvement, I think there must be something more that I miss or am mistaken about. May I have your advice?