what does STREAM memory bandwidth benchmark really measure?

Asked 11/5, 2019 at 3:44 Answered 12/5, 2019 at 22:40

benchmarking cpu-architecture microbenchmark memory-bandwidth

I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark.

Below is the comment from stream.c. What is the rationale about the requirement that arrays should be 4 times the size of cache?

 *       (a) Each array must be at least 4 times the size of the
 *           available cache memory. I don't worry about the difference
 *           between 10^6 and 2^20, so in practice the minimum array size
 *           is about 3.8 times the cache size.

I originally assume STREAM measures the peak memory bandwidth. But I later found that when I add extra arrays and array accesses, I can get larger bandwidth numbers. So it looks to me that STREAM doesn't guarantee to saturate memory bandwidth. Then my question is what does STREAM really measures and how do you use the numbers reported by STREAM?

For example, I added two extra arrays and make sure to access them together with the original a/b/c arrays. I modify the bytes accounting accordingly. With these two extra arrays, my bandwidth number is bumped up by ~11.5%.

> diff stream.c modified_stream.c
181c181,183
<                       c[STREAM_ARRAY_SIZE+OFFSET];
---
>                       c[STREAM_ARRAY_SIZE+OFFSET],
>                       e[STREAM_ARRAY_SIZE+OFFSET],
>                       d[STREAM_ARRAY_SIZE+OFFSET];
192,193c194,195
<     3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
<     3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
---
>     5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
>     5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
270a273,274
>             d[j] = 3.0;
>             e[j] = 3.0;
335c339
<           c[j] = a[j]+b[j];
---
>           c[j] = a[j]+b[j]+d[j]+e[j];
345c349
<           a[j] = b[j]+scalar*c[j];
---
>           a[j] = b[j]+scalar*c[j] + d[j]+e[j];

CFLAGS = -O2 -fopenmp -D_OPENMP -DSTREAM_ARRAY_SIZE=50000000

My last level cache is around 35MB.

Any commnet?

Thanks!

This is for a Skylake Linux server.

Kropp answered 11/5, 2019 at 3:44 Comment(8)

Also, I tried different numactl configs to make the threads or memory to be pinned on different numa nodes. My changed stream.c always reports more than 10% bandwidth number in all of the configurations. So I think we can exclude the possibility that NUMA-ness causes the variance. – Kropp 11/5, 2019 at 3:53

A single thread generally can't saturate DRAM bandwidth, especially on an Intel server chip. Single-core bandwidth is limited by latency / max_concurrency of the number of outstanding off-core requests it can have in flight, not by DRAM controller bandwidth. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? compares a Broadwell-E vs. a quad-core Skylake desktop. – Meador 11/5, 2019 at 4:40

Oh, but you're using OpenMP so I guess you're measuring aggregate bandwidth with all cores saturated? Your change looks like it shifts the balance more towards reads. IDK if you're getting any L3 cache hits. Presumably none of the data is shared between threads, then you'd expect more reads to help more. – Meador 11/5, 2019 at 4:44

Yes, I am measuring aggregate bandwidth. I am confused now what STREAM really tells if it doesn't saturate bandwidth. Any other benchmarks to recommend? Thanks – Kropp 11/5, 2019 at 4:47

STREAM tells you how fast a loop like that can run. With all cores active it should usually be close to saturating DRAM bandwidth, but cache hits could inflate the total. Modern CPUs are extremely complex beasts, and there are many pitfalls in predicting performance of one loop from the performance of another. Benchmark your own application, or a key loop from it if you care about that. But for characterizing hardware, STREAM is one of the benchmarks that gets used, while others include SiSoft Sandra. – Meador 11/5, 2019 at 4:53

e.g. if you google up modern hardware reviews, they often run Sandra and/or AIDA64 memory bandwidth benchmarks. pcworld.com/article/3298859/… or kitguru.net/components/motherboard/ryan-martin/…. I think those benchmarks are hand-written in asm to maybe try to come closer to saturating read, write, or copy bandwidth. STREAM TRIAD bandwidth (or your with more input streams) is less purely synthetic, esp. when you compile it yourself. – Meador 11/5, 2019 at 6:13

admin-magazine.com/HPC/Articles/… might be interesting; I've only barely skimmed it, but it makes some good points about bandwidth per core over various generations of many-core systems. Not exactly what you're asking about, though. – Meador 11/5, 2019 at 6:17

I think you are forgetting that writes (unless using non-temporal/write-coallescing optimizations) include an implicit read. By adding two reads you are increasing apparent bandwidth by about 11% (3 apparent accesses with four actual accesses vs. 5 apparent accesses with six actual accesses; (5/6)/(3/4) =(10/9)≈1.11). This appears to explain most of the difference. – Eliathan 11/5, 2019 at 14:5

Memory accesses in modern computers are a lot more complex than one might expect, and it is very hard to tell when the "high-level" model falls apart because of some "low-level" detail that you did not know about before....

The STREAM benchmark code only measures execution time -- everything else is derived. The derived numbers are based on both decisions about what I think is "reasonable" and assumptions about how the majority of computers work. The run rules are the product of trial and error -- attempting to balance portability with generality.

The STREAM benchmark reports "bandwidth" values for each of the kernels. These are simple calculations based on the assumption that each array element on the right hand side of each loop has to be read from memory and each array element on the left hand side of each loop has to be written to memory. Then the "bandwidth" is simply the total amount of data moved divided by the execution time.

There are a surprising number of assumptions involved in this simple calculation.

The model assumes that the compiler generates code to perform all the loads, stores, and arithmetic instructions that are implied by the memory traffic counts. The approach used in STREAM to encourage this is fairly robust, but an advanced compiler might notice that all the array elements in each array contain the same value, so only one element from each array actually needs to be processed. (This is how the validation code works.)
Sometimes compilers move the timer calls out of their source-code locations. This is a (subtle) violation of the language standards, but is easy to catch because it usually produces nonsensical results.
The model assumes a negligible number of cache hits. (With cache hits, the computed value is still a "bandwidth", it is just not the "memory bandwidth".) The STREAM Copy and Scale kernels only load one array (and store one array), so if the stores bypass the cache, the total amount of traffic going through the cache in each iteration is the size of one array. Cache addressing and indexing are sometimes very complex, and cache replacement policies may be dynamic (either pseudo-random or based on run-time utilization metrics). As a compromise between size and accuracy, I picked 4x as the minimum array size relative to the cache size to ensure that most systems have a very low fraction of cache hits (i.e., low enough to have negligible influence on the reported performance).
The data traffic counts in STREAM do not "give credit" to additional transfers that the hardware does, but that were not explicitly requested. This primarily refers to "write allocate" traffic -- most systems read each store target address from memory before the store can update the corresponding cache line. Many systems have the ability to skip this "write allocate", either by allocating a line in the cache without reading it (POWER) or by executing stores that bypass the cache and go straight to memory (x86). More notes on this are at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/
Multicore processors with more than 2 DRAM channels are typically unable to reach asymptotic bandwidth using only a single core. The OpenMP directives that were originally provided for large shared-memory systems now must be enabled on nearly every processor with more than 2 DRAM channels if you want to reach asymptotic bandwidth levels.
Single-core bandwidth is still important, but is typically limited by the number of cache misses that a single core can generate, and not by the peak DRAM bandwidth of the system. The issues are presented in http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/
For the single-core case, the number of outstanding L1 Data Cache misses much too small to get full bandwidth -- for your Xeon Scalable processor about 140 concurrent cache misses are required for each socket, but a single core can only support 10-12 L1 Data Cache misses. The L2 hardware prefetchers can generate additional memory concurrency (up to ~24 cache misses per core, if I recall correctly), but reaching average values near the upper end of this range requires simultaneous accesses to more 4KiB pages. Your additional array reads give the L2 hardware prefetchers more opportunity to generate (close to) the maximum number of concurrent memory accesses. An increase of 11%-12% is completely reasonable.
Increasing the fraction of reads is also expected to increase the performance when using all the cores. In this case the benefit is primarily by reducing the number of "read-write turnaround stalls" on the DDR4 DRAM interface. With no stores at all, sustained bandwidth should reach 90% peak on this processor (using 16 or more cores per socket).

Additional notes on avoiding "write allocate" traffic:

In x86 architectures, cache-bypassing stores typically invalidate the corresponding address from the local caches and hold the data in a "write-combining buffer" until the processor decides to push the data to memory. Other processors are allowed to keep and use "stale" copies of the cache line during this period. When the write-combining buffer is flushed, the cache line is sent to the memory controller in a transaction that is very similar to an IO DMA write. The memory controller has the responsibility of issuing "global" invalidations on the address before updating memory. Care must be taken when these streaming stores are used to update memory that is shared across cores. The general model is to execute the streaming stores, execute a store fence, then execute an "ordinary" store to a "flag" variable. The store fence will ensure that no other processor can see the updated "flag" variable until the results of all of the streaming stores are globally visible. (With a sequence of "ordinary" stores, results always become visible in program order, so no store fence is required.)
In the PowerPC/POWER architecture, the DCBZ (or DCLZ) instruction can be used to avoid write allocate traffic. If the line is in cache, its contents are set to zero. If the line is not in cache, a line is allocated in the cache with its contents set to zero. One downside of this approach is that the cache line size is exposed here. DCBZ on a PowerPC with 32-Byte cache lines will clear 32 Bytes. The same instruction on a processor with 128-Byte cache lines will clear 128 Bytes. This was irritating to a vendor who used both. I don't remember enough of the details of the POWER memory ordering model to comment on how/when the coherence transactions become visible with this instruction.

Selfstyled answered 12/5, 2019 at 21:14 Comment(7)

Cool, I didn't know you were on Stack Overflow. Consider changing your user-name so people know it's you. :) And BTW, even some dual-channel desktop/laptop CPUs don't fully saturate memory bandwidth with a single core when running glibc memcpy or memset for example. They come much closer than a single core on a big Xeon, depending on ratio of core clock speed vs. memory clock, but especially with fast DDR4 I think Skylake can bottleneck on the limited memory-parallelism one core can keep in flight with its limited line-fill buffers and/or L2 superqueue buffers. – Meador 12/5, 2019 at 21:32

In addition to your talk, it's been discussed on Stack Overflow: Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? and the Latency Bound Platforms section on Travis Downs' (@BeeOnRope's) answer on Enhanced REP MOVSB for memcpy – Meador 12/5, 2019 at 21:36

"Many systems have the ability to skip this "write allocate", either by allocating a line in the cache without reading it". Any document about this feature? If the memory read is skipped, how does the processor make sure that the unmodified data in the same cache line is kept intact? Thanks – Kropp 13/5, 2019 at 1:24

@yeeha: see my answer: on x86 that's done with NT stores, which are not coherent. They only get to skip the read/modify/write step if you do a full-line write; that's why they're also calls "streaming stores" because that's their use case. See Enhanced REP MOVSB for memcpy that I already linked for more about this feature. (That answer has further links.) – Meador 15/5, 2019 at 7:15

@Peter Cordes -- minor nit: on x86 non-temporal stores are "coherent" in most (but perhaps not all) aspects. Non-temporal stores follow a different ordering model -- they can become visible later than expected. These are sometimes referred to as "weakly-ordered" stores or "non-globally-ordered" stores. The only aspect that could be called "non-coherent" is that (like an IO DMA write), when a write-combining buffer is flushed, an invalidation command is sent to all caches. This will invalidate even lines that are dirty, without causing a writeback of the dirty data. – Selfstyled 19/5, 2019 at 16:25

@JohnDMcCalpin: Oh neat, I didn't know NT stores could avoid having other cores write-back the line they invalidate. I agree "non-coherent" doesn't exactly describe them. I was thinking that way because the data sits in an LFB where nothing can snoop it (except the current core if you do reload), but regular stores similarly sit in the store buffer before they commit into a Modified line in L1d. Intel does warn about using NT stores on lines containing a lock or other target of atomic RMW, but I'm not sure exactly why (their phrasing sounded like correctness, not just performance.) – Meador 19/5, 2019 at 16:32

@PeterCordes Upon further investigation of the guts of the Intel coherence protocol, I found some evidence that suggests that SKX/CLX processors will write back M state lines before they are overwritten by DMA writes (or streaming stores), but I have not tried to test this yet. The WB may be required to update the cache tags/snoop filters/memory directories/etc properly. It should be rare in practice, so not a performance issue. The silent overwriting of M-state lines by DMA writes was a feature in the processors of at least one of the processors I have worked on.... ;-) – Selfstyled 17/8, 2020 at 18:58

The key point here, as pointed out by Dr. Bandwidth's answer, is that STREAMS only counts the useful bandwidth seen by the source code. (He's the author of the benchmark.)

In practice the write stream will incur read bandwidth costs as well for the RFO (Read For Ownership) requests. When a CPU want to write 16 bytes (for example) to a cache line, first it has to load the original cache line and then modify it in L1d cache.

(Unless your compiler auto-vectorized with NT stores that bypass cache and avoid that RFO. Some compilers will do that for loops they expect to write an array too larger for cache before any of it is re-read.)

See Enhanced REP MOVSB for memcpy for more about cache-bypassing stores that avoid an RFO.

So increasing the number of read streams vs. write streams will bring software-observed bandwidth closer to actual hardware bandwidth. (Also a mixed read/write workload for the memory may not be perfectly efficient.)

Meador answered 12/5, 2019 at 22:40 Comment(1)

I should have made my comment an answer, sigh. – Eliathan 13/5, 2019 at 0:41

The purpose of the STREAM benchmark is not to measure the peak memory bandwidth (i.e., the maximum memory bandwidth that can be achieved on the system), but to measure the "memory bandwidth" of a number of kernels (COPY, SCALE, SUM, and TRIAD) that are important to the HPC community. So when the bandwidth reported by STREAM is higher, it means that HPC applications will probably run faster on the system.

It's also important to understand the meaning of the term "memory bandwidth" in context of the STREAM benchmark, which is explained in the last section of the documentation. As mentioned in that section, there are at least three ways to count the number of bytes for a benchmark. The STREAM benchmark uses the STREAM method, which count the number of bytes read and written at the source code level. For example, in the SUM kernel (a(i) = b(i) + c(i)), two elements are read and one element is written. Therefore, assuming that all accesses are to memory, the number of bytes accessed from memory per iteration is equal to the number of arrays multiplied by the size of an element (which is 8 bytes). STREAM calculates bandwidth by multiplying the total number of elements accessed (counted using the STREAM method) by the element size and dividing that by the execution time of the kernel. To take run-to-run variations into account, each kernel is run multiple times and the arithmetic average, minimum, and maximum bandwidths are reported.

As you can see, the bandwidth reported by STREAM is not the real memory bandwidth (at the hardware level), so it doesn't even make sense to say that it is the peak bandwidth. In addition, it's almost always much lower than the peak bandwidth. For example, this article shows how ECC and 2MB pages impact the bandwidth reported by STREAM. Writing a benchmark that actually achieves the maximum possible memory bandwidth (at the hardware level) on modern Intel processors is a major challenge and may be a good problem for a whole Ph.D. thesis. In practice, though, the peak bandwidth is less important than the STREAM bandwidth in the HPC domain. (Related: See my answer for information on the issues involved in measuring the memory bandwidth at the hardware level.)

Regarding your first question, notice that STREAM just assumes that all reads and writes are satisfied by the main memory and not by any cache. Allocating an array that is much larger than the size of the LLC helps in making it more likely that this is the case. Essentially, complex and undocumented aspects of the LLC including the replacement policy and the placement policy need to be defeated. It doesn't have to be exactly 4x larger than the LLC. My understanding is that this is what Dr. Bandwidth found to work in practice.

Fink answered 11/5, 2019 at 17:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags