Multiple accesses to main memory and out-of-order execution
Asked Answered
C

1

0

Let us assume that I have two pointers that are pointing to unrelated addresses that are not cached, so they will both have to come all the way from main memory when being dereferenced.

int load_and_add(int *pA, int *pB)
{
    int a = *pA;   // will most likely miss in cache
    int b = *pB;   // will most likely miss in cache 

    // ...  some code that does not use a or b

    int c = a + b;
    return c;
}

If out-of-order execution allows executing the code before the value of c is computed, how will the fetching of values a and b proceed on a modern Intel processor?

Are the potentially-pipelined memory accesses completely serialized or may there be some sort of fetch overlapping performed by the CPU's memory controller?

In other words, if we assume that hitting main memory costs 300 cycles. Will fetching a and b cost 600 cycles or does out-of-order execution enable some possible overlap and perhaps cost less cycles?

Cretinism answered 2/5, 2016 at 19:15 Comment(3)
edited to use the right terminology. E.g. "hit" is usually used to describe a cache hit, so "hitting main memory" doesn't parse easily when skimming. "Consecutive" would normally be used when the memory addresses are consecutive. The question is whether they're handled in parallel (pipelined) or not.Nagoya
Thanks @PeterCordes, great rewording. I really struggled writing the question, but indeed, the bottom line I was trying to learn is if the memory reads were handled in parallel.Cretinism
No worries, it's often hard to ask a question the "right" way if you don't already know enough to search and find the answer yourself :PNagoya
N
3

Modern CPUs have multiple load buffers so multiple loads can be outstanding at the same time. The memory subsystem is heavily pipelined, giving many parts of it much better throughput than latency. (e.g. with prefetching, Haswell can sustain (from main memory) an 8B load every 1 clock. But the latency if the address isn't known ahead of time is in the hundreds of cycles).

So yes, a Haswell core can keep track of up to 72 outstanding load uops waiting for data from cache / memory. (This is per-core. The shared L3 cache also needs some buffers to handle the whole system's loads / stores to DRAM and memory-mapped IO.)

Haswell's ReOrder Buffer size is 192 uops, so up to 190 uops of work in the code that does not use a or b can be issued and executed while the loads of a and b are the oldest instructions that haven't retired. Instructions / uops are retired in-order to support precise exceptions. The ROB size is basically the limit of the out-of-order window for hiding latency of slow operations like cache-misses.

Also see other links at the tag wiki to learn how CPUs work. Agner Fog's microarch guide is great for having a mental model of the CPU pipeline to let you understand approximately how code will execute.

From David Kanter's Haswell writeup: Intel Haswell, from David Kanter's article

Nagoya answered 2/5, 2016 at 19:46 Comment(1)
It might be worth noting that miss under miss (i.e., starting a second cache missing memory access after a cache miss) does not require out-of-order execution; it only requires a scoreboard to track that the loaded values are not yet present (but execution will stall once the values are to be used). With only a scoreboard, a TLB miss on the second access would prevent memory parallelism because a precise exception could not be guaranteed. (A history or future file while still issuing in-order would allow such speculation.)Coniferous

© 2022 - 2024 — McMap. All rights reserved.