Multiple accesses to main memory and out-of-order execution - McMap

About

Multiple accesses to main memory and out-of-order execution

Asked 2/5, 2016 at 19:15 Answered 2/5, 2016 at 19:46

Solved performance intel cpu-architecture memory-access

C

1

0

Let us assume that I have two pointers that are pointing to unrelated addresses that are not cached, so they will both have to come all the way from main memory when being dereferenced.

int load_and_add(int *pA, int *pB)
{
    int a = *pA;   // will most likely miss in cache
    int b = *pB;   // will most likely miss in cache 

    // ...  some code that does not use a or b

    int c = a + b;
    return c;
}

If out-of-order execution allows executing the code before the value of c is computed, how will the fetching of values a and b proceed on a modern Intel processor?

Are the potentially-pipelined memory accesses completely serialized or may there be some sort of fetch overlapping performed by the CPU's memory controller?

In other words, if we assume that hitting main memory costs 300 cycles. Will fetching a and b cost 600 cycles or does out-of-order execution enable some possible overlap and perhaps cost less cycles?

Cretinism answered 2/5, 2016 at 19:15 Comment(3)

edited to use the right terminology. E.g. "hit" is usually used to describe a cache hit, so "hitting main memory" doesn't parse easily when skimming. "Consecutive" would normally be used when the memory addresses are consecutive. The question is whether they're handled in parallel (pipelined) or not. – Nagoya 2/5, 2016 at 19:39

Thanks @PeterCordes, great rewording. I really struggled writing the question, but indeed, the bottom line I was trying to learn is if the memory reads were handled in parallel. – Cretinism 2/5, 2016 at 20:10

No worries, it's often hard to ask a question the "right" way if you don't already know enough to search and find the answer yourself :P – Nagoya 2/5, 2016 at 20:12

N

3

Modern CPUs have multiple load buffers so multiple loads can be outstanding at the same time. The memory subsystem is heavily pipelined, giving many parts of it much better throughput than latency. (e.g. with prefetching, Haswell can sustain (from main memory) an 8B load every 1 clock. But the latency if the address isn't known ahead of time is in the hundreds of cycles).

So yes, a Haswell core can keep track of up to 72 outstanding load uops waiting for data from cache / memory. (This is per-core. The shared L3 cache also needs some buffers to handle the whole system's loads / stores to DRAM and memory-mapped IO.)

Haswell's ReOrder Buffer size is 192 uops, so up to 190 uops of work in the code that does not use a or b can be issued and executed while the loads of a and b are the oldest instructions that haven't retired. Instructions / uops are retired in-order to support precise exceptions. The ROB size is basically the limit of the out-of-order window for hiding latency of slow operations like cache-misses.

Also see other links at the x86 tag wiki to learn how CPUs work. Agner Fog's microarch guide is great for having a mental model of the CPU pipeline to let you understand approximately how code will execute.

From David Kanter's Haswell writeup:

Nagoya answered 2/5, 2016 at 19:46 Comment(1)

It might be worth noting that miss under miss (i.e., starting a second cache missing memory access after a cache miss) does not require out-of-order execution; it only requires a scoreboard to track that the loaded values are not yet present (but execution will stall once the values are to be used). With only a scoreboard, a TLB miss on the second access would prevent memory parallelism because a precise exception could not be guaranteed. (A history or future file while still issuing in-order would allow such speculation.) – Coniferous 3/5, 2016 at 2:38

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.