What is the overhead of using Intel Last Branch Record?
Asked Answered
R

2

14

Last Branch Record refers to a collection of register pairs (MSRs) that store the source and destination addresses related to recently executed branches. http://css.csail.mit.edu/6.858/2012/readings/ia32/ia32-3b.pdf document has more information in case you are interested.

  • a) Can someone give an idea of how much LBR slows down program execution of common programs - both CPU and IO intensive ?
  • b) Will branch prediction be turned OFF when LBR tracing is ON ?
Rockefeller answered 3/2, 2013 at 8:7 Comment(3)
How will you use Intel LBR? I think, overhead of LBR is small for recording and prediction is not turned off.Cistern
I simply enable LBR at the start of a program and disable it at the end. I too think the overhead should be relatively small, at least when compared to software instrumentation. But, it would be helpful if some official documentation on overhead exists.Rockefeller
The only place for official documentation is intel.com/content/www/us/en/processors/…Cistern
C
13

The paper Intel Code Execution Trace Resources (by Arium workers, Craig Pedersen and Jeff Acampora, Apr 29, 2012 ) lists three variants of branch tracing:

  • Last Branch Record (LBR) flag in the DebugCtlMSR and corresponding LastBranchToIP and LastBranchFromIP MSRs as well as LastExceptionToIP and LastExceptionFromIP MSRs.

  • Branch Trace Store (BTS) using either cache-as-RAM or system DRAM.

  • Architecture Event Trace (AET) captured off the XDP port and stored externally in a connected In-Target Probe.

As said in page 2, LBR save information in MSRs, "does not impede any real-time performance," but is useful only for very short code ("effective trace display is very shallow and typically may only show hundreds of instructions."). Only saves info about 4-16 branches.

BTS allows to capture many pairs of branch "From"s and "To"s, and stores them in cache (Cache-as-RAM, CAR) or in system DRAM. In case of CAR, trace depth/length is limited by cache sizes (and some constant); with DRAM trace length is almost unlimited. The paper estimates overhead of BTS as from 20 up to 100 percents due to additional memory stores. BTS on Linux is easy to use with proposed perf branch record (not yet in vanilla) or btrax project. perf branch presentation gives some hints about BTS organisation: there is BTS buffer, which contains "from", "to" fields, and "predicted flag". So, branch prediction is not turned off when using BTS. Also, when BTS buffer is filled up to max size, interrupt is generated. BTS-handling module in kernel (perf_events subsystem or btrax kernel module) should copy data from BTS buffer to other location in case of such interrupt.

So, in BTS mode there are two sources of overhead: Cache/Memory stores and interrupts from BTS buffer overflow.

AET uses external agent to save debug and trace data. This agent is connected via eXtended Debug Port (XDP) and interfaces with In-Target Probe (ITP). Overhead of AET "can have a significant effect on system performance, which can be several orders of magnitude greater" according to this paper, because AET can generate/capture more types of events. But the collected data storage is external to debugged platform.

Paper's "Summary" says: 

LBR has no overhead, but is very shallow (4–16 branch locations, depending on the CPU). Trace data is available immediately out of reset.

BTS is much deeper, but has an impact on CPU performance and requires on-board RAM. Trace data is available as soon as CAR is initialized.

AET requires special ITP hardware and is not available on all CPU architectures. It has the advantage of storing the trace data off board.

Cistern answered 6/2, 2013 at 13:4 Comment(3)
Thanks for sharing the paper and summarizing the key points !Rockefeller
The paper is not available any longer. Can you please try to find the available one?Westerly
UPD. Found here, page 130. Consider updating the link.Westerly
C
3

This is an old question (with an old answer too) but it does come up in searches today.

In 2021 what you want to use for hardware tracing is Intel® Processor Trace (IPT).
Keep in mind the question is obviously about Intel/AMD desktop CPUs. AFAIK there is similar solutions for ARM CPUs, not covered here.

I've used both LBR and IPT setups in Windows using custom drivers, and the later is by far the least amount of overhead. Somewhere in the two digits or less percentage wise slowdown doing a process trace.

Also in the answer saying:

LBR has no overhead,..

Is technically true, but impractical to say because the overhead comes when actually reading the store registers. Typically you will set it up to interrupt on every branch record. So we are talking about the overhead to handle an interrupt/exception/trap for every single branch (call, jmp, jcc, int, etc.) instruction that has a thread active via the trap/single-step flag.

The biggest downside to IPT is that is available only on Intel CPUs while the LBR feature is supported by AMD CPUs too.

Also unfortunately AFAIK (last time I checked) the IPT feature is not supported by any commercial VM software yet. Which means you will more than likely be able to only do an IPT session on direct hardware. Not a big deal unless you really wanted to do your tracing in a VM. For that matter LBR might have the same limitation.

Some Linuxes have native kernel support for IPT. A good starting point for Windows is Alex Ionescu's WinIPT project:
https://ionescu007.github.io/winipt/

Cassidycassie answered 10/7, 2021 at 21:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.