The performance of operation fusion using TensorFlow XLA-JIT on CPU backend

Can anyone give me any hints why XLA-JIT has better performance on CPU backend?

I tried TensorFlow without and with XLA-JIT (manual mode) on mnist benchmark on a single CPU. Using XLA-JIT achieves 13.6x speedups against TensorFlow without XLA-JIT.

As operation fusion is often mentioned when talking about the advantages of XLA-JIT, I naturally thought this technique might be the reason behind, so I learned the source code and found the fusion procedure is roughly like this (please correct me if anything is wrong):

Check if there are operations in an HloComputation (CompOld) can be fused;
If so, a new Fusion instruction is added to CompOld, and fused operations are removed from CompOld;
Then a new HloComputation (CompNew) is created consisting of the fused operations. The added Fusion instruction in CompOld has a pointer pointing to CompNew.
When it comes to the backend, the LLVM IR are emitted independently for both CompOld and CompNew.

Considering the significant performance improvement, I think there must be something more that I miss or am mistaken about. May I have your advice?

Recommended topics

Hot tags