mysterious rtm abort using haswell tsx

I'm experimenting with the tsx extensions in haswell, by adapting an existing medium-sized (1000's of lines) codebase to using GCC transactional memory extensions (which indirectly are using haswell tsx in this machine) instead of coarse grained locks. I am using GCC's transactional_memory extensions, not writing my own _xbegin / _xend directly. I am using the ITM_DEFAULT_METHOD=htm

I'm having issues getting it to work fast enough because I get high rates of hardware transaction abort for mysterious reasons. As shown below, these aborts are not due to conflicts nor due to capacity limitations.

Here is the perf command I used to quantify the failure rate and underlying causes:

perf stat \
 -e cpu/event=0x54,umask=0x2,name=tx_mem_abort_capacity_write/ \
 -e cpu/event=0x54,umask=0x1,name=tx_mem_abort_conflict/ \
 -e cpu/event=0x5d,umask=0x1,name=tx_exec_misc1/ \
 -e cpu/event=0x5d,umask=0x2,name=tx_exec_misc2/ \
 -e cpu/event=0x5d,umask=0x4,name=tx_exec_misc3/ \
 -e cpu/event=0x5d,umask=0x8,name=tx_exec_misc4/ \
 -e cpu/event=0x5d,umask=0x10,name=tx_exec_misc5/ \
 -e cpu/event=0xc9,umask=0x1,name=rtm_retired_start/ \
 -e cpu/event=0xc9,umask=0x2,name=rtm_retired_commit/ \
 -e cpu/event=0xc9,umask=0x4,name=rtm_retired_aborted/pp \
 -e cpu/event=0xc9,umask=0x8,name=rtm_retired_aborted_misc1/ \
 -e cpu/event=0xc9,umask=0x10,name=rtm_retired_aborted_misc2/ \
 -e cpu/event=0xc9,umask=0x20,name=rtm_retired_aborted_misc3/ \
 -e cpu/event=0xc9,umask=0x40,name=rtm_retired_aborted_misc4/ \
 -e cpu/event=0xc9,umask=0x80,name=rtm_retired_aborted_misc5/ \ 
./myprogram -th 1 -reps 3000000

So, the program runs some code with transactions in it 30 million times. Each request involves one transaction gcc __transaction_atomic block. There is only one thread in this run.

This particular perf command captures most of the relevant tsx performance events described in the Intel software developers manual vol 3.

The output from perf stat is the following:

             0 tx_mem_abort_capacity_write                                  [26.66%]
             0 tx_mem_abort_conflict                                        [26.65%]
    29,937,894 tx_exec_misc1                                                [26.71%]
             0 tx_exec_misc2                                                [26.74%]
             0 tx_exec_misc3                                                [26.80%]
             0 tx_exec_misc4                                                [26.92%]
             0 tx_exec_misc5                                                [26.83%]
    29,906,632 rtm_retired_start                                            [26.79%]
             0 rtm_retired_commit                                           [26.70%]
    29,985,423 rtm_retired_aborted                                          [26.66%]
             0 rtm_retired_aborted_misc1                                    [26.75%]
             0 rtm_retired_aborted_misc2                                    [26.73%]
    29,927,923 rtm_retired_aborted_misc3                                    [26.71%]
             0 rtm_retired_aborted_misc4                                    [26.69%]
           176 rtm_retired_aborted_misc5                                    [26.67%]

  10.583607595 seconds time elapsed

As you can see from the output:

The rtm_retired_start count is 30 million (matches input to program)
The rtm_retired_abort count is about the same (no commits at all)
The abort_conflict and abort_capacity counts are 0, so these are not the reasons. Also, recall it is only one thread running, conflicts should be rare.
The only actual leads here are the high values of tx_exec_misc1 and rtm_retired_aborted_misc3, which are somewhat similar in description.

The Intel manual (vol 3) defines rtm_retired_aborted_misc3 counters:

code: C9H 20H

mnemonic: RTM_RETIRED.ABORTED_MISC3

description: Number of times an RTM execution aborted due to HLE unfriendly instructions.

The definition for tx_exec_misc1 has some similar words:

code: 5DH 01H

mnemonic: TX_EXEC.MISC1

description: Counts the number of times a class of instructions that may cause a transactional abort was executed. Since this is the count of execution, it may not always cause a transactional abort.

I checked the assembly location for the aborts using perf record/ perf report using high precision (PEBS) support for rtm_retired_aborted. The location has a mov instruction from register to register. No weird instruction names seen nearby.

Update:

Here are two things I've tried since then:

1) the tx_exec_misc1 and rtm_retired_aborted_misc3 signature we we see here can be obtained, for example, by a dummy block of the form

for (int i = 0; i < 10000000; i++){
  __transaction_atomic{
    _xabort(1);
  }
}

or one of the form

for (int i = 0; i < 10000000; i++){
  __transaction_atomic{
    printf("hello");
    fflush(stdout);
  }
}

In both cases, the perf counters look similar to what I see. However, in both cases the perf report for -e cpu/tx-abort/ points to the intuitively correct assembly lines: an xabort instruction for the first example and a syscall one for the second one. In the real codebase, the perf report points to a stack push right at the start of a function:

           :    00000000004167e0 <myns::myfun()>:
    100.00 :      4167e0:       push   %rbp
      0.00 :      4167e1:       mov    %rsp,%rbp
      0.00 :      4167e4:       push   %r15

I have also run the same command under the intel software development emulator. It turns out that the problem goes away in that case: I get no aborts as far as the application is concerned.

Recommended topics

Hot tags