The performance of operation fusion using TensorFlow XLA-JIT on CPU backend
Asked Answered
A

0

6

Can anyone give me any hints why XLA-JIT has better performance on CPU backend?

I tried TensorFlow without and with XLA-JIT (manual mode) on mnist benchmark on a single CPU. Using XLA-JIT achieves 13.6x speedups against TensorFlow without XLA-JIT.

As operation fusion is often mentioned when talking about the advantages of XLA-JIT, I naturally thought this technique might be the reason behind, so I learned the source code and found the fusion procedure is roughly like this (please correct me if anything is wrong):

  1. Check if there are operations in an HloComputation (CompOld) can be fused;
  2. If so, a new Fusion instruction is added to CompOld, and fused operations are removed from CompOld;
  3. Then a new HloComputation (CompNew) is created consisting of the fused operations. The added Fusion instruction in CompOld has a pointer pointing to CompNew.
  4. When it comes to the backend, the LLVM IR are emitted independently for both CompOld and CompNew.

Considering the significant performance improvement, I think there must be something more that I miss or am mistaken about. May I have your advice?

Aunt answered 24/11, 2017 at 4:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.