I'm experimenting with the tsx extensions in haswell, by adapting an existing medium-sized (1000's of lines) codebase to using GCC transactional memory extensions (which indirectly are using haswell tsx in this machine) instead of coarse grained locks. I am using GCC's transactional_memory extensions, not writing my own _xbegin / _xend directly. I am using the ITM_DEFAULT_METHOD=htm
I'm having issues getting it to work fast enough because I get high rates of hardware transaction abort for mysterious reasons. As shown below, these aborts are not due to conflicts nor due to capacity limitations.
Here is the perf command I used to quantify the failure rate and underlying causes:
perf stat \
-e cpu/event=0x54,umask=0x2,name=tx_mem_abort_capacity_write/ \
-e cpu/event=0x54,umask=0x1,name=tx_mem_abort_conflict/ \
-e cpu/event=0x5d,umask=0x1,name=tx_exec_misc1/ \
-e cpu/event=0x5d,umask=0x2,name=tx_exec_misc2/ \
-e cpu/event=0x5d,umask=0x4,name=tx_exec_misc3/ \
-e cpu/event=0x5d,umask=0x8,name=tx_exec_misc4/ \
-e cpu/event=0x5d,umask=0x10,name=tx_exec_misc5/ \
-e cpu/event=0xc9,umask=0x1,name=rtm_retired_start/ \
-e cpu/event=0xc9,umask=0x2,name=rtm_retired_commit/ \
-e cpu/event=0xc9,umask=0x4,name=rtm_retired_aborted/pp \
-e cpu/event=0xc9,umask=0x8,name=rtm_retired_aborted_misc1/ \
-e cpu/event=0xc9,umask=0x10,name=rtm_retired_aborted_misc2/ \
-e cpu/event=0xc9,umask=0x20,name=rtm_retired_aborted_misc3/ \
-e cpu/event=0xc9,umask=0x40,name=rtm_retired_aborted_misc4/ \
-e cpu/event=0xc9,umask=0x80,name=rtm_retired_aborted_misc5/ \
./myprogram -th 1 -reps 3000000
So, the program runs some code with transactions in it 30 million times. Each request involves one transaction gcc __transaction_atomic
block. There is only one thread in this run.
This particular perf
command captures most of the relevant tsx performance events described in the Intel software developers manual vol 3.
The output from perf stat
is the following:
0 tx_mem_abort_capacity_write [26.66%]
0 tx_mem_abort_conflict [26.65%]
29,937,894 tx_exec_misc1 [26.71%]
0 tx_exec_misc2 [26.74%]
0 tx_exec_misc3 [26.80%]
0 tx_exec_misc4 [26.92%]
0 tx_exec_misc5 [26.83%]
29,906,632 rtm_retired_start [26.79%]
0 rtm_retired_commit [26.70%]
29,985,423 rtm_retired_aborted [26.66%]
0 rtm_retired_aborted_misc1 [26.75%]
0 rtm_retired_aborted_misc2 [26.73%]
29,927,923 rtm_retired_aborted_misc3 [26.71%]
0 rtm_retired_aborted_misc4 [26.69%]
176 rtm_retired_aborted_misc5 [26.67%]
10.583607595 seconds time elapsed
As you can see from the output:
- The
rtm_retired_start
count is 30 million (matches input to program) - The
rtm_retired_abort
count is about the same (no commits at all) - The
abort_conflict
andabort_capacity
counts are 0, so these are not the reasons. Also, recall it is only one thread running, conflicts should be rare. - The only actual leads here are the high values of
tx_exec_misc1
andrtm_retired_aborted_misc3
, which are somewhat similar in description.
The Intel manual (vol 3) defines rtm_retired_aborted_misc3
counters:
code: C9H 20H
mnemonic: RTM_RETIRED.ABORTED_MISC3
description: Number of times an RTM execution aborted due to HLE unfriendly instructions.
The definition for tx_exec_misc1
has some similar words:
code: 5DH 01H
mnemonic: TX_EXEC.MISC1
description: Counts the number of times a class of instructions that may cause a transactional abort was executed. Since this is the count of execution, it may not always cause a transactional abort.
I checked the assembly location for the aborts using perf record/ perf report using high precision (PEBS) support for rtm_retired_aborted
. The location has a mov
instruction from register to register. No weird instruction names seen nearby.
Update:
Here are two things I've tried since then:
1) the tx_exec_misc1 and rtm_retired_aborted_misc3 signature we we see here can be obtained, for example, by a dummy block of the form
for (int i = 0; i < 10000000; i++){
__transaction_atomic{
_xabort(1);
}
}
or one of the form
for (int i = 0; i < 10000000; i++){
__transaction_atomic{
printf("hello");
fflush(stdout);
}
}
In both cases, the perf counters look similar to what I see. However, in both cases the perf report
for -e cpu/tx-abort/
points to the intuitively correct assembly lines: an xabort
instruction for the first example and a syscall
one for the second one. In the real codebase, the perf report points to a stack push right at the start of a function:
: 00000000004167e0 <myns::myfun()>:
100.00 : 4167e0: push %rbp
0.00 : 4167e1: mov %rsp,%rbp
0.00 : 4167e4: push %r15
I have also run the same command under the intel software development emulator. It turns out that the problem goes away in that case: I get no aborts as far as the application is concerned.