intel-pmu Questions

2

I am attempting to run Alex Ionescu's WinIPT interface in a virtual machine, and having no success. (This is a Windows 10 Pro host running a Windows 10 VM and both are the 18363 update) I have suc...
Holophrastic asked 7/2, 2020 at 21:24

1

Solved

I am running a C++ benchmark test for a specific application. In this test, I open the performance counter file (__NR_perf_event_open syscall) before the critical section, proceed with the section ...
Ginnifer asked 30/9, 2021 at 16:36

2

Solved

Last Branch Record refers to a collection of register pairs (MSRs) that store the source and destination addresses related to recently executed branches. http://css.csail.mit.edu/6.858/2012/reading...
Rockefeller asked 3/2, 2013 at 8:7

5

Solved

Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).
Entresol asked 2/12, 2017 at 21:37

2

Is there a perf stat equivalent on Mac OS? I would like to do the same thing for a CLI command and googling is not yielding anything.
Caston asked 6/4, 2020 at 21:15

2

The description of the RESOURCE_STALLS.RS hardware performance event for Intel Broadwell is the following: This event counts stall cycles caused by absence of eligible entries in the reservatio...
Manhandle asked 5/10, 2018 at 0:15

2

Solved

I'm trying to understand the rdpmc instruction. As such I have the following asm code: segment .text global _start _start: xor eax, eax mov ebx, 10 .loop: dec ebx jnz .loop mov ecx, 1<&l...
Casebook asked 17/5, 2019 at 19:43

1

Solved

Summary Consider the following loop: loop: movl $0x1,(%rax) add $0x40,%rax cmp %rdx,%rax jne loop where rax is initialized to the address of a buffer that is larger than the L3 cache size. Ever...
Planter asked 5/3, 2019 at 2:59

1

Solved

Some built-in perf events are mapped to offcore events. For example, LLC-loads and LLC-load-misses are mapped to OFFCORE_RESPONSE. events. This can be easily determined as discussed in here. Howeve...
Cornia asked 16/1, 2019 at 18:19

0

Consider the following simple code: #include <stdlib.h> #include <stdio.h> #include <string.h> #include <time.h> #include <err.h> int cpu_ms() { return (int)(clock(...
Ewers asked 29/9, 2018 at 5:9

2

Solved

Consider the following loop: .loop: add rsi, OFFSET mov eax, dword [rsi] dec ebp jg .loop where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss...
Swanhilda asked 26/9, 2018 at 23:25

2

Solved

I was experimenting with AVX -AVX2 instruction sets to see the performance of streaming on consecutive arrays. So I have below example, where I do basic memory read and store. #include <iostrea...
Tiemroth asked 27/10, 2013 at 18:8

0

On Intel x86, Linux uses the event l1d.replacements to implement its L1-dcache-load-misses event. This event is defined as follows: Counts L1D data line replacements including opportunistic re...
Preconscious asked 4/9, 2018 at 20:20

1

Solved

When I run perf list I see a bunch of Hardware Cache Events, as follows: $ perf list | grep 'cache event' L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-...
Shields asked 4/9, 2018 at 16:58

1

Solved

I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 10000000 .loop:...
Slumberous asked 28/8, 2018 at 9:32

1

Solved

Newer Linux kernels have a sysfs tunable /proc/sys/kernel/perf_event_paranoid which allows the user to adjust the available functionality of perf_events for non-root users, with higher numbers bein...
Barbabra asked 18/8, 2018 at 18:8

2

Solved

I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results: I wouldn't expect a mov instruction to take 23.3% of the time while a ...
Isoagglutinin asked 21/1, 2018 at 16:58

2

I am trying to understand the multiplex and scaling of "cycles" event in the "perf" output. The following is the output of perf tool: 144094.487583 task-clock (msec) # 1.017 CPUs utilized 53991...
Ervin asked 24/1, 2018 at 4:24

1

Solved

In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system): results[] = ... for (iteration = 0; iteration < num_iter...
Remonstrance asked 18/8, 2016 at 15:6
1

© 2022 - 2024 — McMap. All rights reserved.