cpu-cache Questions

1

My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only ...
Melonie asked 19/6, 2022 at 5:11

1

I am conducting a test to measure the message synchronization latency between different cores of a CPU. Specifically, I am measuring how many clock cycles it takes for CPU2 to detect changes in the...
Chema asked 24/11, 2023 at 14:0

0

I've been reading through the "What every programmer should know about memory" paper and got confused with the measurements performed on pages 20-21 of the document. Sequential Read Acces...
Yettie asked 25/8, 2023 at 12:22

4

Solved

I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it. Allocate a big array Access part of it of different size each time. So I wrote a little program. Here'...
Rushton asked 2/10, 2013 at 12:25

1

Solved

Data is usually aligned with its own data type, i.e a 32-bit int is usually aligned to 4 bytes, this makes loading/storing them more efficient for the processor. Now when does cache line alignment ...
Swearword asked 31/7, 2023 at 15:43

2

I stumbled upon a peculiar performance issue when running the following c++ code on some Intel Xeon processors: // array_a contains permutation of [0, n - 1] // array_b and inverse are initialized ...
Projective asked 7/9, 2020 at 15:23

10

Solved

What is the difference between "cache unfriendly code" and the "cache friendly" code? How can I make sure I write cache-efficient code?
Vain asked 22/5, 2013 at 18:37

1

Solved

I was tasked with implementing an optimised matrix multiplication micro-kernel that computes C = A*B in C++ starting from the following snippet of code. I am getting some counter intuitive behaviou...
Owing asked 4/3, 2023 at 17:52

1

I have an Intel Sapphire Rapids CPU with 56 cores. By default, SNC is not enabled. When core 0 accesses a certain memory address A, I think the following will happen: One of the cache agent is acc...
Sinuate asked 20/11, 2022 at 20:45

4

Solved

I have seen the related question including here and here, but it seems that the only instruction ever mentioned for serializing rdtsc is cpuid. Unfortunately, cpuid takes roughly 1000 cycles on my ...
Silverside asked 24/4, 2014 at 22:2

8

Solved

I'd like my program to read the cache line size of the CPU it's running on in C++. I know that this can't be done portably, so I will need a solution for Linux and another for Windows (Solutions fo...
Polak asked 29/9, 2008 at 19:35

2

Solved

Cache lines are often 64 bytes, other sizes also exist. My very simple question is: is there any theory behind this number, or is it just the result of the vast amount of tests and measurements th...
Unbearable asked 30/3, 2016 at 15:11

3

Solved

In languages like C, unsynchronized reads and writes to the same memory location from different threads is undefined behavior. But in the CPU, cache coherence says that if one core writes to a memo...
Sceptic asked 11/10, 2021 at 12:5

1

In Paul McKenny's famous paper "Memory Barriers: A Hardware View for Software Hackers" 3.3 Store Buffers and Memory Barriers To see the second complication, a violation of global memory ...

1

I have a discrete NVIDIA GPU (say, Kepler or Maxwell). I want to clear my L2 cache before some kernel is scheduled, so as not to taint my test results. I could do something like allocate a large s...
Instill asked 15/7, 2015 at 11:39

0

In MESI protocol you write to the cache line only when holding it in the Exclusive/Modified state. To acquire the Exclusive state, you send an Invalidate request to all the cores holding the same c...
Ritual asked 27/8, 2022 at 6:48

0

I have a piece of code written in C and I want to find the bottleneck for this code! Using perf tool and annotating the assembly code, I see the push %r12 instruction at the start of the function i...
Saurel asked 31/5, 2022 at 9:34

1

Solved

I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering. In my opinion, Out-of-Order-Execution...
Bailee asked 6/4, 2022 at 14:32

1

Solved

I found a comment from crossbeam. Starting from Intel's Sandy Bridge, spatial prefetcher is now pulling pairs of 64-byte cache lines at a time, so we have to align to 128 bytes rather than 64. Sou...
Circumcise asked 5/5, 2022 at 11:44

1

Solved

I am trying to understand more about how CPU cache affects performance. As a simple test I am summing the values of the first column of a matrix with varying numbers of total columns. // compiled w...
Trochee asked 10/4, 2022 at 16:4

0

env : x86-64; linux-centos; 8-cpu-core For testing 'false sharing performance' I wrote c++ code like this: volatile int32_t a; volatile int32_t b; int64_t p1[7]; volatile int64_t c; int64_t p...
Sanity asked 8/11, 2021 at 7:17

1

Solved

Given this code snippet from this textbook that I am currently studying. Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson) (global editi...
Farad asked 10/7, 2021 at 6:48

2

Solved

So I am trying to learn performance metrics of various components of computer like L1 cache, L2 cache, main memory, ethernet, disk etc as below: Latency Comparison Numbers ------------------------...
Forage asked 6/4, 2020 at 17:36

2

I'm struggling to solve this question, I've looked around but all of the similar questions are more advanced than mine, making use of logs, it's more advanced than we've done in our class. Here's t...
Futch asked 11/6, 2014 at 12:58

2

Solved

The x86 INVD invalidates the cache hierarchy without writing the contents back to memory, apparently. I'm curious, what use is such an instruction? Given how one has very little control over what ...
Baccivorous asked 21/1, 2017 at 3:13

© 2022 - 2025 — McMap. All rights reserved.