Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

A

5

22

Modern x86_64 linux with glibc will detect that CPU has support of AVX extension and will switch many string functions from generic implementation to AVX-optimized version (with help of ifunc dispatchers: 1, 2).

This feature can be good for performance, but it prevents several tool like valgrind (older libVEXs, before valgrind-3.8) and gdb's "target record" (Reverse Execution) from working correctly (Ubuntu "Z" 17.04 beta, gdb 7.12.50.20170207-0ubuntu2, gcc 6.3.0-8ubuntu1 20170221, Ubuntu GLIBC 2.24-7ubuntu2):

$ cat a.c
#include <string.h>
#define N 1000
int main(){
        char src[N], dst[N];
        memcpy(dst, src, N);
        return 0;
}
$ gcc a.c -o a -fno-builtin
$ gdb -q ./a
Reading symbols from ./a...(no debugging symbols found)...done.
(gdb) start
Temporary breakpoint 1 at 0x724
Starting program: /home/user/src/a

Temporary breakpoint 1, 0x0000555555554724 in main ()
(gdb) record
(gdb) c
Continuing.
Process record does not support instruction 0xc5 at address 0x7ffff7b60d31.
Process record: failed to record execution log.

Program stopped.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:416
416             VMOVU   (%rsi), %VEC(4)
(gdb) x/i $pc
=> 0x7ffff7b60d31 <__memmove_avx_unaligned_erms+529>:   vmovdqu (%rsi),%ymm4

There is error message "Process record does not support instruction 0xc5" from gdb's implementation of "target record", because AVX instructions are not supported by the record/replay engine (sometimes the problem is detected on _dl_runtime_resolve_avx function): https://sourceware.org/ml/gdb/2016-08/msg00028.html "some AVX instructions are not supported by process record", https://bugs.launchpad.net/ubuntu/+source/gdb/+bug/1573786, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=836802, https://bugzilla.redhat.com/show_bug.cgi?id=1136403

Solution proposed in https://sourceware.org/ml/gdb/2016-08/msg00028.html "You can recompile libc (thus ld.so), or hack __init_cpu_features and thus __cpu_features at runtime (see e.g. strcmp)." or set LD_BIND_NOW=1, but recompiled glibc still has AVX, and ld bind-now doesn't help.

I heard that there are /etc/ld.so.nohwcap and LD_HWCAP_MASK configurations in glibc. Can they be used to disable ifunc dispatching to AVX-optimized string functions in glibc?

How does glibc (rtld?) detects AVX, using cpuid, with /proc/cpuinfo (probably not), or HWCAP aux (LD_SHOW_AUXV=1 /bin/echo |grep HWCAP command gives AT_HWCAP: bfebfbff)?

Asarum answered 25/2, 2017 at 3:1 Comment(4)

Selection code: github.com/bminor/glibc/blob/master/sysdeps/x86_64/multiarch/… ENTRY(__new_memcpy) .type __new_memcpy, @gnu_indirect_function .. .HAS_ARCH_FEATURE (Prefer_ERMS) where ..feature are defined at github.com/bminor/glibc/blob/master/sysdeps/x86/cpu-features.h; tested field is filled by init_cpu_features by using cpuid instruction of eax=7,ecx=0. How to hack into init_cpu_features and mask out AVX/ERMS in cpu_features->cpuid[COMMON_CPUID_INDEX_7].ecx? – Asarum 25/2, 2017 at 5:1

Have you ever figured out how to mask out AVX/SSE without recompiling glibc? Capabilities seems loaded in sysdeps/x86/libc-start.c (__libc_start_main calls init_cpu_features (&_dl_x86_cpu_features)), but at that point symbols already seem resolved (based on p *memcpy pointing to __memmove_avx_unaligned_erms). – Bonkers 9/6, 2017 at 14:6

@Lekensteyn, "how to mask out AVX/SSE without recompiling glibc" - I did rebuild of unmodified glibc (with dpkg-buildpackage, without strip) AND binary patching in the __get_cpu_features function (get_common_indeces / get_common_indeces.constprop.1), cpuid,.., then just after cpm 0xf,.. ;je ..; cmp 0x6 replaced jle with jg (0x7e to 0x7f) - probably disabling all the code after if .. max_cpuid>=7 of sysdeps/x86/cpu-features.c. Or try to use more recent valgrind & gdb record tools or older glibc or implement missing instruction emulation in gdb record if it is not done. – Asarum 10/6, 2017 at 0:40

As a possible workaround, Mozilla's rr works with AVX: stackoverflow.com/questions/40125154/… – Vaduz 11/9, 2017 at 9:41

B

8

There does not seem a straightforward runtime method to patch feature detection. This detection happens rather early in the dynamic linker (ld.so).

Binary patching the linker seems the easiest method at the moment. @osgx described one method where a jump is overwritten. Another approach is just to fake the cpuid result. Normally cpuid(eax=0) returns the highest supported function in eax while the manufacturer IDs are returned in registers ebx, ecx and edx. We have this snippet in glibc 2.25 sysdeps/x86/cpu-features.c:

__cpuid (0, cpu_features->max_cpuid, ebx, ecx, edx);

/* This spells out "GenuineIntel".  */
if (ebx == 0x756e6547 && ecx == 0x6c65746e && edx == 0x49656e69)
  {
      /* feature detection for various Intel CPUs */
  }
/* another case for AMD */
else
  {
    kind = arch_kind_other;
    get_common_indeces (cpu_features, NULL, NULL, NULL, NULL);
  }

The __cpuid line translates to these instructions in /lib/ld-linux-x86-64.so.2 (/lib/ld-2.25.so):

172a8:       31 c0                   xor    eax,eax
172aa:       c7 44 24 38 00 00 00    mov    DWORD PTR [rsp+0x38],0x0
172b1:       00 
172b2:       c7 44 24 3c 00 00 00    mov    DWORD PTR [rsp+0x3c],0x0
172b9:       00 
172ba:       0f a2                   cpuid

So rather than patching branches, we could as well change the cpuid into a nop instruction which would result in invocation of the last else branch (as the registers will not contain "GenuineIntel"). Since initially eax=0, cpu_features->max_cpuid will also be 0 and the if (cpu_features->max_cpuid >= 7) will also be bypassed.

Binary patching cpuid(eax=0) by nop this can be done with this utility (works for both x86 and x86-64):

#!/usr/bin/env python
import re
import sys

infile, outfile = sys.argv[1:]
d = open(infile, 'rb').read()
# Match CPUID(eax=0), "xor eax,eax" followed closely by "cpuid"
o = re.sub(b'(\x31\xc0.{0,32}?)\x0f\xa2', b'\\1\x66\x90', d)
assert d != o
open(outfile, 'wb').write(o)

An equivalent Perl variant, -0777 ensures that the file is read at once instead of separating records at line feeds:

perl -0777 -pe 's/\x31\xc0.{0,32}?\K\x0f\xa2/\x66\x90/' < /lib/ld-linux-x86-64.so.2 > ld-linux-x86-64-patched.so.2
# Verify result, should display "Success"
cmp -s /lib/ld-linux-x86-64.so.2 ld-linux-x86-64-patched.so.2 && echo 'Not patched' || echo Success

That was the easy part. Now, I did not want to replace the system-wide dynamic linker, but execute only one particular program with this linker. Sure, that can be done with ./ld-linux-x86-64-patched.so.2 ./a, but the naive gdb invocations failed to set breakpoints:

$ gdb -q -ex "set exec-wrapper ./ld-linux-x86-64-patched.so.2" -ex start ./a
Reading symbols from ./a...done.
Temporary breakpoint 1 at 0x400502: file a.c, line 5.
Starting program: /tmp/a 
During startup program exited normally.
(gdb) quit
$ gdb -q -ex start --args ./ld-linux-x86-64-patched.so.2 ./a
Reading symbols from ./ld-linux-x86-64-patched.so.2...(no debugging symbols found)...done.
Function "main" not defined.
Temporary breakpoint 1 (main) pending.
Starting program: /tmp/ld-linux-x86-64-patched.so.2 ./a
[Inferior 1 (process 27418) exited normally]
(gdb) quit

A manual workaround is described in How to debug program with custom elf interpreter? It works, but it is unfortunately a manual action using add-symbol-file. It should be possible to automate it a bit using GDB Catchpoints though.

An alternative approach that does not binary linking is LD_PRELOADing a library that defines custom routines for memcpy, memove, etc. This will then take precedence over the glibc routines. The full list of functions is available in sysdeps/x86_64/multiarch/ifunc-impl-list.c. Current HEAD has more symbols compared to the glibc 2.25 release, in total (grep -Po 'IFUNC_IMPL \(i, name, \K[^,]+' sysdeps/x86_64/multiarch/ifunc-impl-list.c):

memchr, memcmp, __memmove_chk, memmove, memrchr, __memset_chk, memset, rawmemchr, strlen, strnlen, stpncpy, stpcpy, strcasecmp, strcasecmp_l, strcat, strchr, strchrnul, strrchr, strcmp, strcpy, strcspn, strncasecmp, strncasecmp_l, strncat, strncpy, strpbrk, strspn, strstr, wcschr, wcsrchr, wcscpy, wcslen, wcsnlen, wmemchr, wmemcmp, wmemset, __memcpy_chk, memcpy, __mempcpy_chk, mempcpy, strncmp, __wmemset_chk,

Bonkers answered 11/6, 2017 at 11:47 Comment(4)

"alternative approach" may fail if there is usage of AVX2-enabled functions in the ld.so (there are some, don't know if they used before preloading LD_PRELOAD). Try patchelf --set-interpreter /some/short/path/ld.so ./my_program where path to new ld.so is not longer than original ld.so path - to always use new ld.so for the program. – Asarum 11/6, 2017 at 15:40

Here's a one line perl command that does the cpuid to nop modification: perl -pe 's/\x{31}\x{c0}.{0,32}\K\x{0f}\x{a2}/\x{66}\x{90}/' < ld-linux-orig > ld-linux-patched This was easy to stick into a build system rule to patch the loader. In my case, a toolchain post staging install hook in buildroot. – Michaelemichaelina 28/9, 2018 at 16:23

@TrentP: that pattern could in theory have false positives for a 0F A2 that appeared inside another instruction, or spanning an instruction boundary, somewhere with 32-bytes of a 31 C0 xor eax,eax. Probably a good idea to cmp -l your input/output and make sure only one replacement happened. – Mckinney 2/5, 2019 at 0:22

@PeterCordes I updated the post with a non-greedy version just in case there is an adjacent cpuid (I don't expect that to be honest). I also added a Perl one-liner for convenience. – Bonkers 3/8, 2019 at 13:52

C

19

It looks like there is a nice workaround for this implemented in recent versions of glibc: a "tunables" feature that guides selection of optimized string functions. You can find a general overview of this feature here and the relevant code inside glibc in ifunc-impl-list.c.

Here's how I figured it out. First, I took the address being complained about by gdb:

Process record does not support instruction 0xc5 at address 0x7ffff75c65d4.

I then looked it up in the table of shared libraries:

(gdb) info shared
From                To                  Syms Read   Shared Object Library
0x00007ffff7fd3090  0x00007ffff7ff3130  Yes         /lib64/ld-linux-x86-64.so.2
0x00007ffff76366b0  0x00007ffff766b52e  Yes         /usr/lib/x86_64-linux-gnu/libubsan.so.1
0x00007ffff746a320  0x00007ffff75d9cab  Yes         /lib/x86_64-linux-gnu/libc.so.6
...

You can see that this address is within glibc. But what function, specifically?

(gdb) disassemble 0x7ffff75c65d4
Dump of assembler code for function __strcmp_avx2:
   0x00007ffff75c65d0 <+0>:     mov    %edi,%eax
   0x00007ffff75c65d2 <+2>:     xor    %edx,%edx
=> 0x00007ffff75c65d4 <+4>:     vpxor  %ymm7,%ymm7,%ymm7

I can look in ifunc-impl-list.c to find the code that controls selecting the avx2 version:

  IFUNC_IMPL (i, name, strcmp,
          IFUNC_IMPL_ADD (array, i, strcmp,
                  HAS_ARCH_FEATURE (AVX2_Usable),
                  __strcmp_avx2)
          IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSE4_2),
                  __strcmp_sse42)
          IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSSE3),
                  __strcmp_ssse3)
          IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_sse2_unaligned)
          IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_sse2))

It looks like AVX2_Usable is the feature to disable. Let's rerun gdb accordingly:

GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX2_Usable gdb...

On this iteration it complained about __memmove_avx_unaligned_erms, which appeared to be enabled by AVX_Usable - but I found another path in ifunc-memmove.h enabled by AVX_Fast_Unaligned_Load. Back to the drawing board:

GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX2_Usable,-AVX_Fast_Unaligned_Load gdb ...

On this final round I discovered a rdtscp instruction in the ASAN shared library, so I recompiled without the address sanitizer and at last, it worked.

In summary: with some work it's possible to disable these instructions from the command line and use gdb's record feature without severe hacks.

Constitutionality answered 5/4, 2020 at 19:32 Comment(6)

I'm trying to use this flag of GLIBC_TUNABLES , but I'm still getting the same error. it looks like it ignores the GLIBC_TUNABLES setting. any idea for me? – Partlet 23/12, 2021 at 22:9

Check your glibc version, maybe? – Constitutionality 24/12, 2021 at 23:8

The problem is really the glibc. Unfortunallty I can't upgrade. do you think that staticall linked may help here? – Partlet 28/12, 2021 at 21:32

I'm not sure. If you know your glibc version it's probably worth investigating to see what is happening. I would take a look at the code itself. – Constitutionality 30/12, 2021 at 5:35

My version of glibc required the use of 'glibc.tune.hwcaps' instead of 'glibc.cpu.hwcaps' and then it worked. I used grep on ld-linux-x86-64.so strings to figure this out. – Intercolumniation 15/3, 2022 at 15:57

With glibc-2.35-20.fc36.x86_64 I had to use (some subset would be enough): GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX,-AVX2,-AVX512BW,-AVX512DQ,-AVX512F,-AVX512VL,-BMI1,-BMI2,-LZCNT,-MOVBE,-RTM,-SSE4_1,-SSE4_2,-SSSE3 – Resile 15/11, 2022 at 10:58

B

8

There does not seem a straightforward runtime method to patch feature detection. This detection happens rather early in the dynamic linker (ld.so).

Binary patching the linker seems the easiest method at the moment. @osgx described one method where a jump is overwritten. Another approach is just to fake the cpuid result. Normally cpuid(eax=0) returns the highest supported function in eax while the manufacturer IDs are returned in registers ebx, ecx and edx. We have this snippet in glibc 2.25 sysdeps/x86/cpu-features.c:

__cpuid (0, cpu_features->max_cpuid, ebx, ecx, edx);

/* This spells out "GenuineIntel".  */
if (ebx == 0x756e6547 && ecx == 0x6c65746e && edx == 0x49656e69)
  {
      /* feature detection for various Intel CPUs */
  }
/* another case for AMD */
else
  {
    kind = arch_kind_other;
    get_common_indeces (cpu_features, NULL, NULL, NULL, NULL);
  }

The __cpuid line translates to these instructions in /lib/ld-linux-x86-64.so.2 (/lib/ld-2.25.so):

172a8:       31 c0                   xor    eax,eax
172aa:       c7 44 24 38 00 00 00    mov    DWORD PTR [rsp+0x38],0x0
172b1:       00 
172b2:       c7 44 24 3c 00 00 00    mov    DWORD PTR [rsp+0x3c],0x0
172b9:       00 
172ba:       0f a2                   cpuid

So rather than patching branches, we could as well change the cpuid into a nop instruction which would result in invocation of the last else branch (as the registers will not contain "GenuineIntel"). Since initially eax=0, cpu_features->max_cpuid will also be 0 and the if (cpu_features->max_cpuid >= 7) will also be bypassed.

Binary patching cpuid(eax=0) by nop this can be done with this utility (works for both x86 and x86-64):

#!/usr/bin/env python
import re
import sys

infile, outfile = sys.argv[1:]
d = open(infile, 'rb').read()
# Match CPUID(eax=0), "xor eax,eax" followed closely by "cpuid"
o = re.sub(b'(\x31\xc0.{0,32}?)\x0f\xa2', b'\\1\x66\x90', d)
assert d != o
open(outfile, 'wb').write(o)

An equivalent Perl variant, -0777 ensures that the file is read at once instead of separating records at line feeds:

perl -0777 -pe 's/\x31\xc0.{0,32}?\K\x0f\xa2/\x66\x90/' < /lib/ld-linux-x86-64.so.2 > ld-linux-x86-64-patched.so.2
# Verify result, should display "Success"
cmp -s /lib/ld-linux-x86-64.so.2 ld-linux-x86-64-patched.so.2 && echo 'Not patched' || echo Success

That was the easy part. Now, I did not want to replace the system-wide dynamic linker, but execute only one particular program with this linker. Sure, that can be done with ./ld-linux-x86-64-patched.so.2 ./a, but the naive gdb invocations failed to set breakpoints:

$ gdb -q -ex "set exec-wrapper ./ld-linux-x86-64-patched.so.2" -ex start ./a
Reading symbols from ./a...done.
Temporary breakpoint 1 at 0x400502: file a.c, line 5.
Starting program: /tmp/a 
During startup program exited normally.
(gdb) quit
$ gdb -q -ex start --args ./ld-linux-x86-64-patched.so.2 ./a
Reading symbols from ./ld-linux-x86-64-patched.so.2...(no debugging symbols found)...done.
Function "main" not defined.
Temporary breakpoint 1 (main) pending.
Starting program: /tmp/ld-linux-x86-64-patched.so.2 ./a
[Inferior 1 (process 27418) exited normally]
(gdb) quit

A manual workaround is described in How to debug program with custom elf interpreter? It works, but it is unfortunately a manual action using add-symbol-file. It should be possible to automate it a bit using GDB Catchpoints though.

An alternative approach that does not binary linking is LD_PRELOADing a library that defines custom routines for memcpy, memove, etc. This will then take precedence over the glibc routines. The full list of functions is available in sysdeps/x86_64/multiarch/ifunc-impl-list.c. Current HEAD has more symbols compared to the glibc 2.25 release, in total (grep -Po 'IFUNC_IMPL \(i, name, \K[^,]+' sysdeps/x86_64/multiarch/ifunc-impl-list.c):

memchr, memcmp, __memmove_chk, memmove, memrchr, __memset_chk, memset, rawmemchr, strlen, strnlen, stpncpy, stpcpy, strcasecmp, strcasecmp_l, strcat, strchr, strchrnul, strrchr, strcmp, strcpy, strcspn, strncasecmp, strncasecmp_l, strncat, strncpy, strpbrk, strspn, strstr, wcschr, wcsrchr, wcscpy, wcslen, wcsnlen, wmemchr, wmemcmp, wmemset, __memcpy_chk, memcpy, __mempcpy_chk, mempcpy, strncmp, __wmemset_chk,

Bonkers answered 11/6, 2017 at 11:47 Comment(4)

"alternative approach" may fail if there is usage of AVX2-enabled functions in the ld.so (there are some, don't know if they used before preloading LD_PRELOAD). Try patchelf --set-interpreter /some/short/path/ld.so ./my_program where path to new ld.so is not longer than original ld.so path - to always use new ld.so for the program. – Asarum 11/6, 2017 at 15:40

Here's a one line perl command that does the cpuid to nop modification: perl -pe 's/\x{31}\x{c0}.{0,32}\K\x{0f}\x{a2}/\x{66}\x{90}/' < ld-linux-orig > ld-linux-patched This was easy to stick into a build system rule to patch the loader. In my case, a toolchain post staging install hook in buildroot. – Michaelemichaelina 28/9, 2018 at 16:23

@TrentP: that pattern could in theory have false positives for a 0F A2 that appeared inside another instruction, or spanning an instruction boundary, somewhere with 32-bytes of a 31 C0 xor eax,eax. Probably a good idea to cmp -l your input/output and make sure only one replacement happened. – Mckinney 2/5, 2019 at 0:22

@PeterCordes I updated the post with a non-greedy version just in case there is an adjacent cpuid (I don't expect that to be honest). I also added a Perl one-liner for convenience. – Bonkers 3/8, 2019 at 13:52

T

4

I encountered this problem recently as well, and ended up solving it using dynamic CPUID faulting to interrupt execution of the CPUID instruction and override its result, which avoids touching glibc or the dynamic linker. This requires processor support for CPUID faulting (Ivy Bridge+) as well as Linux kernel support (4.12+) for exposing it to userspace through the ARCH_GET_CPUID and ARCH_SET_CPUID subfunctions of arch_prctl(). When this feature is enabled, a SIGSEGV signal will be delivered on each execution of CPUID, allowing a signal handler can emulate execution of the instruction and override the result.

The full solution is a bit involved since I also need to interpose the dynamic linker, because hardware capability detection was moved there starting with glibc 2.26+. I've uploaded the full solution online at https://github.com/ddcc/libcpuidoverride .

Tenterhook answered 1/5, 2019 at 22:54 Comment(2)

Hello, thanks for your work. Will this feature work with any of "Atom" series of Intel CPUs (Bit 31 of MSR_PLATFORM_INFO = MSR CEh)? Any AMD? More details are in your post dcddcc.com/blog/… – Asarum 3/5, 2019 at 0:44

I'm not sure, and there doesn't seem to be much about this online. It should be straightforward to test though, if you have access to an Atom-based system. Intel FlexMigration supports another mode called CPUID masking, where the processor can directly mask out CPUID bits, but this doesn't seem to be implemented or supported in the Linux kernel: github.com/torvalds/linux/blob/master/arch/x86/include/asm/… . Likewise, AMD Extended Migration has a similar feature called CPUID Override that seems similar to masking, but I don't believe it's implemented in Linux either. – Tenterhook 3/5, 2019 at 5:7

A

2

Not the best or complete solution, just a smallest bit-editing kludge to allow valgrind and gdb record for the my task.

Lekensteyn asks:

how to mask out AVX/SSE without recompiling glibc

I did full rebuild of unmodified glibc, which is rather easy in debian and ubuntu: just sudo apt-get source glibc, sudo apt-get build-dep glibc and cd glibc-*/; dpkg-buildpackage -us -uc (manual to get the ld.so without stripped debugging information.

Then I did binary (bit) patching of the output ld.so file, in the function used by __get_cpu_features. Target function was compiled from get_common_indeces of source file sysdeps/x86/cpu-features.c under the name of get_common_indeces.constprop.1 (it is just next after the __get_cpu_features in the binary code). It has several cpuids, first one is cpuid eax=1 "Processor Info and Feature Bits"; and later there is check "jle 0x6" and jump down around the code "cpuid eax=7 ecx=0 Extended Features" just to get AVX2 status. There is the code which was compiled into this logic:

get_common_indeces (struct cpu_features *cpu_features,
            unsigned int *family, unsigned int *model,
            unsigned int *extended_model, unsigned int *stepping)
{ ...
  if (cpu_features->max_cpuid >= 7)
    __cpuid_count (7, 0,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].eax,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].ebx,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].ecx,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].edx);

The cpu_features->max_cpuid was filled in init_cpu_features of the same file in __cpuid (0, cpu_features->max_cpuid, ebx, ecx, edx); line. It was easier to disable the if statement by replacing jle after cmp 0x6 with jg (byte 0x7e to 0x7f). (Actually this binary patch was reapplied manually to the __get_cpu_features function of real system ld-linux.so.2 - first jle before mov 7 eax; xor ecx,ecx; cpuid changed into jg.)

Recompiled package and modified ld.so were not installed into the system; I used commandline syntax of ld.so ./my_program (or mv ld.so /some/short/path.so and patchelf --set-interpreter ./my_program).

Recommended topics

Hot tags