Are there any modern CPUs where a cached byte store is actually slower than a word store?

Asked 16/1, 2019 at 12:54 Answered 17/1, 2019 at 3:8

Solved performance x86 arm cpu-architecture cpu-cache

It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register.

But I've never seen any examples. No x86 CPUs are like this, and I think all high-performance CPUs can directly modify any byte in a cache-line, too. Are some microcontrollers or low-end CPUs different, if they have cache at all?

(I'm not counting word-addressable machines, or Alpha which is byte addressable but lacks byte load/store instructions. I'm talking about the narrowest store instruction the ISA natively supports.)

In my research while answering Can modern x86 hardware not store a single byte to memory?, I found that the reasons Alpha AXP omitted byte stores presumed they'd be implemented as true byte stores into cache, not an RMW update of the containing word. (So it would have made ECC protection for L1d cache more expensive, because it would need byte granularity instead of 32-bit).

I'm assuming that word-RMW during commit to L1d cache wasn't considered as an implementation option for other more-recent ISAs that do implement byte stores.

All modern architectures (other than early Alpha) can do true byte loads/stores to uncacheable MMIO regions (not RMW cycles), which is necessary for writing device drivers for devices that have adjacent byte I/O registers. (e.g. with external enable/disable signals to specify which parts of a wider bus hold the real data, like the 2-bit TSIZ (transfer size) on this ColdFire CPU/microcontroller, or like PCI / PCIe single byte transfers, or like DDR SDRAM control signals that mask selected bytes.)

Maybe doing an RMW cycle in cache for byte stores would be something to consider for a microcontroller design, even though it's not for a high-end superscalar pipelined design aimed at SMP servers / workstations like Alpha?

I think this claim might come from word-addressable machines. Or from unaligned 32-bit stores requiring multiple accesses on many CPUs, and people incorrectly generalizing from that to byte stores.

Just to be clear, I expect that a byte store loop to the same address would run at the same cycles per iterations as a word store loop. So for filling an array, 32-bit stores can go up to 4x faster than 8-bit stores. (Maybe less if 32-bit stores saturate memory bandwidth but 8-bit stores don't.) But unless byte stores have an extra penalty, you won't get more than a 4x speed difference. (Or whatever the word width is).

And I'm talking about asm. A good compiler will auto-vectorize a byte or int store loop in C and use wider stores or whatever is optimal on the target ISA, if they're contiguous.

(And store coalescing in the store buffer could also result in wider commits to L1d cache for contiguous byte-store instructions, so that's another thing to watch out for when microbenchmarking)

; x86-64 NASM syntax
mov   rdi, rsp
; RDI holds at a 32-bit aligned address
mov   ecx, 1000000000
.loop:                      ; do {
    mov   byte [rdi], al
    mov   byte [rdi+2], dl     ; store two bytes in the same dword
      ; no pointer increment, this is the same 32-bit dword every time
    dec   ecx
    jnz   .loop             ; }while(--ecx != 0}


    mov   eax,60
    xor   edi,edi
    syscall         ; x86-64 Linux sys_exit(0)

Or a loop over an 8kiB array like this, storing 1 byte or 1 word out of every 8 bytes (for a C implementation with sizeof(unsigned int)=4 and CHAR_BIT=8 for the 8kiB, but should compile to comparable functions on any C implementation, with only a minor bias if sizeof(unsigned int) isn't a power of 2). ASM on Godbolt for a few different ISAs, with either no unrolling, or the same amount of unrolling for both versions.

// volatile defeats auto-vectorization
void byte_stores(volatile unsigned char *arr) {
    for (int outer=0 ; outer<1000 ; outer++)
        for (int i=0 ; i< 1024 ; i++)      // loop over 4k * 2*sizeof(int) chars
            arr[i*2*sizeof(unsigned) + 1] = 123;    // touch one byte of every 2 words
}

// volatile to defeat auto-vectorization: x86 could use AVX2 vpmaskmovd
void word_stores(volatile unsigned int *arr) {
    for (int outer=0 ; outer<1000 ; outer++)
        for (int i=0 ; i<(1024 / sizeof(unsigned)) ; i++)  // same number of chars
            arr[i*2 + 0] = 123;       // touch every other int
}

Adjusting sizes as necessary, I'd be really curious if anyone can point to a system where word_store() is faster than byte_store(). (If actually benchmarking, beware of warm-up effects like dynamic clock speed, and the first pass triggering TLB misses and cache misses.)

Or if actual C compilers for ancient platforms don't exist or generate sub-optimal code that doesn't bottleneck on store throughput, then any hand-crafted asm that would show an effect.

Any other way of demonstrating a slowdown for byte stores is fine, I don't insist on strided loops over arrays or spamming writes within one word.

I'd also be fine with detailed documentation about CPU internals, or CPU cycle timing numbers for different instructions. I'm leery of optimization advice or guides that could be based on this claim without having tested, though.

Any still-relevant CPU or microcontroller where cached byte stores have an extra penalty?
Any still-relevant CPU or microcontroller where un-cacheable byte stores have an extra penalty?
Any not-still-relevant historical CPUs (with or without write-back or write-through caches) where either of the above are true? What's the most recent example?

e.g. is this the case on an ARM Cortex-A?? or Cortex-M? Any older ARM microarchitecture? Any MIPS microcontroller or early MIPS server/workstation CPU? Anything other random RISC like PA-RISC, or CISC like VAX or 486? (CDC6600 was word-addressable.)

Or construct a test-case involving loads as well as stores, e.g. showing word-RMW from byte stores competing with load throughput.

(I'm not interested in showing that store-forwarding from byte stores to word loads is slower than word->word, because it's normal that SF only works efficiently when when a load is fully contained in the most recent store to touch any of the relevant bytes. But something that showed byte->byte forwarding being less efficient than word->word SF would be interesting, maybe with bytes that don't start at a word boundary.)

(I didn't mention byte loads because that's generally easy: access a full word from cache or RAM and then extract the byte you want. That implementation detail is indistinguishable other than for MMIO, where CPUs definitely don't read the containing word.)

On a load/store architecture like MIPS, working with byte data just means you use lb or lbu to load and zero or sign-extend it, then store it back with sb. (If you need truncation to 8 bits between steps in registers, then you might need an extra instruction, so local vars should usually be register sized. Unless you want the compiler to auto-vectorize with SIMD with 8-bit elements, then often uint8_t locals are good...) But anyway, if you do it right and your compiler is good, it shouldn't cost any extra instructions to have byte arrays.

I notice that gcc has sizeof(uint_fast8_t) == 1 on ARM, AArch64, x86, and MIPS. But IDK how much stock we can put in that. The x86-64 System V ABI defines uint_fast32_t as a 64-bit type on x86-64. If they're going to do that (instead of 32-bit which is x86-64's default operand-size), uint_fast8_t should also be a 64-bit type. Maybe to avoid zero-extension when used as an array index? If it was passed as a function arg in a register, since it could be zero extended for free if you had to load it from memory anyway.

Buiron answered 16/1, 2019 at 12:54 Comment(19)

Comments are not for extended discussion; this conversation has been moved to chat. – Retainer 16/1, 2019 at 23:4

you grossly misunderstood what I was saying. I hope this whole question was not about that misunderstanding. – Kunzite 17/1, 2019 at 1:27

Yes there are so called microcontrollers with caches. It tastes wrong to call them that as they have i and d caches, some flavor of mmu, and run hundreds of mhz, but they are considered microcontrollers. So yes they do exist, the ones I know about are cortex-m4 and cortex-m7 based. – Kunzite 17/1, 2019 at 1:28

MCUs the flashes tend to be slower than the system clock, certainly as you push the clock. so the cache helps with instructions. the srams tend to be on par with the system clock or at least can outperform the flash, but are normally used for data not code (can use for code if you want usually). The L1 is ideally coupled with the core at least in full sized designs, so if true it doesnt incur the bus handshake penalties on every transaction, cache hit or not. – Kunzite 17/1, 2019 at 1:31

but I wouldnt expect the mmu to be as complicated as one found in a full sized processor running full sized operating systems. instead something that can be used with an RTOS to add performance, but not necessarily layers of protection mechanisms (which dont necessarily affect the cache at all, but are a difference between what you see on an x86/mobile phone/tablet arm and an mcu). – Kunzite 17/1, 2019 at 1:35

@old_timer: your answer on that other question was just one example of this claim that CPUs can't store a single byte efficiently. The quote from Stroustrup that originally led to the question is another major example (but more distorted because it seems to imply a software-visible non-atomic RMW or something). – Buiron 17/1, 2019 at 1:36

I would have to do/see experiments to see if the cache helps at all in front of the sram and then if you would really desire to use it. this question is limited to byte sized transfers with a cache, so while there may be performance gains in places with the cache/mmu off, that is not relevant to this question, one would expect the byte writes to be buried in the noise of the cache line write itself. – Kunzite 17/1, 2019 at 1:37

@PeterCordes your question is about caches and performance the other claim is about how computers work and it is completely true that you cant do a byte write into a memory in "modern" implementations. a RMW happens into the cache sram if the cache is enabled or into a register or small buffer if system memory is supported without caching. yet technically possible to make a non-cached design with dram using x8 parts and having individual chip select control over those lanes (allowing for a byte write into dram). I would love to see that system. – Kunzite 17/1, 2019 at 1:41

it is also technically possible to have the cache implemented with 8 bit wide memories too, that is true, and it is foundry and cell library specific but I would expect that to have a higher chip real estate cost and side effects that come with it but it is very much part of the implementation, and technically possible, which means it is technically possible in a modern design to have a byte write without the bytes next to it being written. I would love to see that design as well unfortunately you would have to work there to see how it was implemented. – Kunzite 17/1, 2019 at 1:50

@old_timer: This was discussed on the Intel forums, and we're pretty sure that if there was an RMW effect, we'd be able to measure it. software.intel.com/en-us/forums/…. We think Intel's L1d cache really does have byte granularity (and only parity, not ECC), so it never needs to RMW. So the internal behaviour is basically the same as for an aligned 32-bit store. This is only for the 32kiB L1d cache; the 256k to 1M L2 cache and ~1.3 to 2.5MB per-core L3 caches only have to handle line write-back. – Buiron 17/1, 2019 at 1:53

as with the alignment discussion we had a while back the overhead of the x86 designs which you educated me on, buried the alignment and variable length instruction performance. Where simpler, less overhead designs, it is much easier to see alignment and other effects. I was describing how the actual implementations worked it has been fascinating to me now that I work somewhere that produces chips to see how they do it. – Kunzite 17/1, 2019 at 2:4

your performance test would be the benchmark if it takes more than 4 times longer to do byte based loop fills vs 32 bit word based eliminating interference from instruction fetching (in a unified L1 cache) or other that would skew or make for misleading results – Kunzite 17/1, 2019 at 2:5

of course knowing me I would find it more fun to cause the loop to take longer than 4x by exploiting something, for demonstration purposes or other. – Kunzite 17/1, 2019 at 2:5

Are some microcontrollers or low-end CPUs different, if they have cache at all? yes some mcus have caches. – Kunzite 17/1, 2019 at 2:6

I'm not counting word-addressable machines...I think of the mips and arms as word addressable machines with byte lane enables. machines like the x86 with an 8 bit based instruction set and initial implementations being 8 and 16 bit busses with a long history of reverse compatibility you need to design the whole system to support variable sizes and assumed not aligned as they have done. thats why it is hard to get those to show alignment or other performance hits. and why initial implementations of arm and such refused unaligned transfers. – Kunzite 17/1, 2019 at 2:9

and later had to cave in because it was easier than educating the software developers and tools developers, etc. but not take it to the level needed for supporting the x86 architecture to run smoothly. – Kunzite 17/1, 2019 at 2:11

and I/O is a whole other story that is very much implementation defined. and ideally uncached... – Kunzite 17/1, 2019 at 2:12

your question is broad in that it only takes one to cover the "any" which means any of us could take an open risc-v design or now mips without the restrictions and implement a cache with a byte write performance hit but high performance word write, post it on opencores and that satisfies "any modern". – Kunzite 17/1, 2019 at 2:15

plus it is limited to 8 bits rather than 16 or 32 being also included for 32 or 64 bit bus based designs. – Kunzite 17/1, 2019 at 2:18

My guess was wrong. Modern x86 microarchitectures really are different in this way from some (most?) other ISAs.

There can be a penalty for cached narrow stores even on high-performance non-x86 CPUs. The reduction in cache footprint can still make int8_t arrays worth using, though. (And on some ISAs like MIPS, not needing to scale an index for an addressing mode helps).

Merging / coalescing in the store buffer between byte stores instructions to the same word before actual commit to L1d can also reduce or remove the penalty. (x86 sometimes can't do as much of this because its strong memory model requires all stores to commit in program order.)

ARM's documentation for Cortex-A15 MPCore (from ~2012) says it uses 32-bit ECC granularity in L1d, and does in fact do a word-RMW for narrow stores to update the data.

The L1 data cache supports optional single bit correct and double bit detect error correction logic in both the tag and data arrays. The ECC granularity for the tag array is the tag for a single cache line and the ECC granularity for the data array is a 32-bit word.

Because of the ECC granularity in the data array, a write to the array cannot update a portion of a 4-byte aligned memory location because there is not enough information to calculate the new ECC value. This is the case for any store instruction that does not write one or more aligned 4-byte regions of memory. In this case, the L1 data memory system reads the existing data in the cache, merges in the modified bytes, and calculates the ECC from the merged value. The L1 memory system attempts to merge multiple stores together to meet the aligned 4-byte ECC granularity and to avoid the read-modify-write requirement.

(When they say "the L1 memory system", I think they mean the store buffer, if you have contiguous byte stores that haven't yet committed to L1d.)

Note that the RMW is atomic, and only involves the exclusively-owned cache line being modified. This is an implementation detail that doesn't affect the memory model. So my conclusion on Can modern x86 hardware not store a single byte to memory? is still (probably) correct that x86 can, and so can every other ISA that provides byte store instructions.

Cortex-A15 MPCore is a 3-way out-of-order execution CPU, so it's not a minimal power / simple ARM design, yet they chose to spend transistors on OoO exec but not efficient byte stores.

Presumably without the need to support efficient unaligned stores (which x86 software is more likely to assume / take advantage of), having slower byte stores was deemed worth it for the higher reliability of ECC for L1d without excessive overhead.

Cortex-A15 is probably not the only, and not the most recent, ARM core to work this way.

Other examples (found by @HadiBrais in comments):

Alpha 21264 (see Table 8-1 of Chapter 8 of this doc) has 8-byte ECC granularity for its L1d cache. Narrower stores (including 32-bit) result in a RMW when they commit to L1d, if they aren't merged in the store buffer first. The doc explains full details of what L1d can do per clock. And specifically documents that the store buffer does coalesce stores.
PowerPC RS64-II and RS64-III (see the section on errors in this doc). According to this abstract, L1 of the RS/6000 processor has 7 bits of ECC for each 32-bits of data.

Alpha was aggressively 64-bit from the ground up, so 8-byte granularity makes some sense, especially if the RMW cost can mostly be hidden / absorbed by the store buffer. (e.g. maybe the normal bottlenecks were elsewhere for most code on that CPU; its multi-ported cache could normally handle 2 operations per clock.)

POWER / PowerPC64 grew out of 32-bit PowerPC and probably cares about running 32-bit code with 32-bit integers and pointers. (So more likely to do non-contiguous 32-bit stores to data structures that couldn't be coalesced.) So 32-bit ECC granularity makes a lot of sense there.

Buiron answered 16/1, 2019 at 21:15 Comment(26)

with or without the ecc you can have a similar situation depending on how wide the sram is for that ecc interface, I will look I dont know if in the publicly available documents you can see what the sram interface looks like for ARM but since the sram is foundry/library specific you have to add some glue between what arms generic interface is and your implementation, that interface may have historically been byte based or not but with ecc you would need it to be wider to save on chip real estate... – Kunzite 17/1, 2019 at 2:34

Write operations must occur after the Tag RAM reads and associated address comparisons are complete. A three-entry Write Buffer is included in the cache to enable the written words to be held until they can be written to cache. One or two words can be written in a single store operation. The addresses of these outstanding writes provide an additional input to the Tag RAM comparison for reads. – Kunzite 17/1, 2019 at 2:37

To reduce unnecessary power consumption additionally, only the addressed words within a cache line are read at any time. With the required 64-bit read interface, this is achieved by disabling half of the RAMs on occasions when only a 32-bit value is required. The implementation uses two 32-bit wide RAMs to implement the cache data RAM shown in Figure 7-1 on page 7-4, with the words of each line folded into the RAMs on an odd and even basis. This means that cache refills can take several cycles, depending on the cache line lengths. The cache line length is eight words. – Kunzite 17/1, 2019 at 2:38

with or without ecc a 32 bit wide memory would need a rmw to store a byte, but even with this now old arm11 (this is from a doc for the one on the raspberry pi had a copy handy) they had a write buffer mechanism to try to bury that in the noise. being single core could you even try to hit the cache fast enough with small transfers to overwhelm that? it wouldnt be a simple one byte at a time loop it would be a bunch of strbs in a row but even there would that be detectable? – Kunzite 17/1, 2019 at 2:41

I didnt see it documented at this level but would assume based on what they showed at this level the actual interface from the arm logic to the chip vendors glue to the foundry sram would be 32 bits wide for these srams, the chip vendor could still use other sizes if it made sense but the functionality would be 32 bits at a time. – Kunzite 17/1, 2019 at 2:43

I didnt see a similar statement in the cortex-a8 but do see they have tag ram latency controls starting at 2 clocks and going up, you could max those out and try your byte at a time word at a time test. using all the cores "at the same time" to see if you get back pressure. – Kunzite 17/1, 2019 at 2:50

Cortex-A72 also has optional ECC per 32-bit for the data cache. (If one wants data cache ECC, the choices seem to be limited to expensive per-byte ECC [5 bit overhead per octet with classic Hamming code], RMW for "subword" stores, or byte parity with replication [more expensive but a single design can support full capacity with parity only or half capacity with ECC]. The RMW can be delayed and sometimes avoided if full ECC-words are written or sometimes pushed to L2 with per byte validity treating invalid bytes as zero bytes [clean byte reads and multi intraword writes complicate this].) – Instal 17/1, 2019 at 2:53

the cortex-m7 is showing a 32 bit wide memory with 7 check bits for data ram and 64 bits with 8 check bits for instruction cache ram. – Kunzite 17/1, 2019 at 3:3

In an error-free system, the major performance impact is the cost of the read-modify-write scheme for non-full stores in the data side. If a store buffer slot does not contain at least a full 32-bit word, it must read the word to be able to compute the check bits. This can occur because software only writes to an area of memory with byte or halfword store instructions. The data can then be written in the RAM. This additional read can have a negative impact on performance because it prevents the slot from being used for another write. – Kunzite 17/1, 2019 at 3:3

The buffering and outstanding capabilities of the memory system mask part of the additional read, and it is negligible for most codes. However, ARM recommends that you use as few cacheable STRB and STRH instructions as possible to reduce the performance impact. – Kunzite 17/1, 2019 at 3:3

I think that answers this whole question yes? – Kunzite 17/1, 2019 at 3:5

@old_timer: Fetching only the required word(s) or byte(s) can work regardless of width, in the general case. The size and offset-within-line are available right away, so that trick is compatible with high performance L1 caches that fetch data (from all ways) in parallel with tags (from the set selected by the index), so a tag match just MUXes from the matching way without having to wait for data to be fetched separately. This lowers load latency at the cost power. It's typical for x86 caches as you'd expect, but L1 latency is important so I wouldn't be surprised if others do it, too. – Buiron 17/1, 2019 at 3:8

@PaulA.Clayton: Is there any difficulty in making the RMW atomic with respect to other pipelined stores? I guess you'd have to detect conflicts to the same word, if you wanted to allow other stores to commit during the RMW. Or would this normally just be built-in to a single write port in a way that stalls it from accepting further writes until the whole RMW finishes? (And of course abort the write if we lose the line to a MESI request). – Buiron 17/1, 2019 at 3:14

But would multiple write ports present a problem, here? I guess probably not, as long as only one was from the CPU and the other was from L2. A full line can't be written by a transfer from L2 while a store is in the middle of committing to it. CPUs with 2 stores per clock are rare, but Intel's next-gen after Ice Lake, Sunny Cove, is supposed to have 2-per-clock store throughput. I assume for commit to L1d as well as from execution units to store buffer, but possibly not. – Buiron 17/1, 2019 at 3:17

Other examples of processors where the L1D uses ECC include: (1) Alpha 21264 (see Table 8-1 of Chapter 8 of this doc) and PowerPC RS64-II and RS64-III (see the section on errors in this doc). In these processors, writing a sub-64-bit into the L1 cache requires a read-merge-write operation to calculate the ECC code at the 64-bit granularity. I've not looked deeper regarding the impact on performance. – Eveland 14/2, 2019 at 17:35

@HadiBrais: interesting, the Alpha manual has some details about the store buffer coalescing nearby stores. Maybe that's why 64-bit ECC blocks aren't horrible for performance? Alpha compilers didn't use 64-bit int, did they? Where did you find the PowerPC ECC granularity, though? I searched on ECC. I only found that level of detail in the Alpha manual (where it's clear that it's 64-bit, from the write performance and that they say each cache line has 64 bits of ECC data. That's 8 ECC bits per 64 bits of data, which is exactly what's needed, and not enough bits for finer granularity.) – Buiron 14/2, 2019 at 18:2

@PeterCordes According to this paper, the L1 of the RS/6000 processor has 7 bits of ECC for each 32-bits of data. Regarding RS64-III, I'm not sure, we need to access this doc RS64-III. I'm not able to access it or find an alternative source. Good point about store coalescing. – Eveland 14/2, 2019 at 18:32

@HadiBrais: ok that makes more sense. Alpha was aggressively 64-bit (but does have 32-bit operand-size for some important instructions), while PowerPC64 grew out of 32-bit PowerPC and probably cares about running 32-bit code with 32-bit integers and pointers. (So more likely to do non-contiguous 32-bit stores to data structures that couldn't coalesce.) – Buiron 14/2, 2019 at 21:14

What does it mean by "Because of the ECC granularity in the data array, a write to the array cannot update a portion of a 4-byte aligned memory location because there is not enough information to calculate the new ECC value"? Isn't ECC value calculated from the whole data? What is the difference between ECC calculations in the case if a single byte store was done directly on the granule and if the single byte write was done through RMW? In both scenarios, isn't it the case that only a single byte of the granule is modified and the rest 24 bits are the same? – Stigma 1/2 at 12:8

@ConventionalProgrammer: Calculating the new ECC (en.wikipedia.org/wiki/Error_correction_code) bits corresponding to the whole 4-byte or 8-byte granule requires all 32 or 64 bits, so the unmodified part of the data has to be read from L1d. This is what makes it an RMW, unlike a store that overwrote all the bits of the whole granule where the new ECC data can just be calculated and stored. If an RMW is required, it's on the data, not the ECC bits. (Although you'd read those too in order to check them.) – Buiron 1/2 at 19:32

So do you mean to say that since the unmodified part of the data has to be read from L1d cache for a wholesome ECC calculation, the ECC calculation requires an RMW? ++ "unlike a store that overwrote all the bits of the whole granule where the new ECC data can just be calculated and stored" Do you mean a word store here? Shouldn't a byte store overwrite only a 8 bit portion rather than all bytes? – Stigma 1/2 at 20:35

@ConventionalProgrammer: Yes, it's the ECC calculation that requires the read part of the RMW. And yes, overwriting all the bits of a granule is only possible with a word or double-word store, whatever the ECC granularity is. (Or with coalescing of multiple narrow stores in the store buffer into one wide-enough write.) – Buiron 1/2 at 20:37

@PeterCordes Sorry for my non-adequate understanding on this but could you please elaborate a bit on why the ECC calculation cannot be done (in other words, "not enough information for ECC calculation") if a single byte store, assuming it is not coalesced into a word-sized write, is done on the data array? Do we mean here the ECC calculation of the data in the L1d cache or ECC calculation of the data that overwrites L1d cache data? – Stigma 1/2 at 20:51

@ConventionalProgrammer: As you yourself wrote, ... ECC value calculated from the whole data. For a byte store, if we just stored it into the L1d cache SRAM cells without reading anything, we wouldn't be able to update the ECC data for the 32-bit granule it's part of. So we also have to read the other 24 bits. (Or read all 32 bits, check their ECC, replace 8 bits from the byte store's data, calculate a new ECC, and write the new 32-bit word + its ECC data back to the SRAM cells that store the data+ECC for this granule.) An aligned word store doesn't need to read any old values. – Buiron 1/2 at 20:58

@PeterCordes Oh, you mean that one cannot calculate the ECC value only using the 8-bits that is to be written to L1d cache SRAM. For a moment somehow I misleadingly thought the initial ECC calculation was done by the SRAM itself on the data in cache instead of by the writer itself. – Stigma 2/2 at 12:26

@ConventionalProgrammer: Yeah, as with the tag comparators and so on, you just build that logic once (per write port), not once per cache line. So the per-line overhead is just the SRAM cells to store the data + ECC bits + tags + other metadata. – Buiron 2/2 at 12:29

cortex-m7 trm, cache ram section of the manual:

In an error-free system, the major performance impact is the cost of the read-modify-write scheme for non-full stores in the data side. If a store buffer slot does not contain at least a full 32-bit word, it must read the word to be able to compute the check bits. This can occur because software only writes to an area of memory with byte or halfword store instructions. The data can then be written in the RAM. This additional read can have a negative impact on performance because it prevents the slot from being used for another write.

The buffering and outstanding capabilities of the memory system mask part of the additional read, and it is negligible for most codes. However, ARM recommends that you use as few cacheable STRB and STRH instructions as possible to reduce the performance impact.

I have cortex-m7s but to date have not performed a test to demonstrate this.

What is meant by "read the word", it is a read of one storage location in an SRAM that is part of the data cache. It is not a high level system memory thing.

The guts of the cache is built of and around SRAM blocks that are the fast SRAM that makes a cache what it is, faster than system memory, fast to return answers back to the processor, etc. This read-modify-write (RMW) is not a high level write policy thing. What they are saying is if there is a hit and the write policy says to save the write in the cache then the byte or halfword needs to be written to one of these SRAMs. The width of the data cache data SRAM with ECC as shown in this document is 32+7 bits wide. 32 bits of data 7 bits of ECC check bits. You have to keep all 39 bits together for ECC to work. By definition you can't modify only some of the bits as that would result in an ECC fault.

Whenever any number of bits need to change in that 32 bit word stored in the data cache data SRAM, 8, 16, or 32 bits, the 7 check bits have to be recomputed and all 39 bits written at once. For an 8 or 16 bit, STRB or STRH write, the 32 data bits need to be read the 8 or 16 bits modified with the remaining data bits in that word unchanged, the 7 ECC check bits computed and the 39 bits written to the sram.

The computation of the check bits is ideally/likely within the same clock cycle that sets up the write, but the read and write are not in the same clock cycle so it should take at least two separate cycles to write data that arrived at the cache in one clock cycle. There are tricks to delay the write which sometimes can also hurt but usually moves it to a cycle that would have been unused and makes it free if you will. But it won't be the same clock cycle as the read.

They are saying if you hold your mouth right and manage to get enough smaller stores hit the cache fast enough they will stall the processor until they can catch up.

The document also describes the without ECC SRAM as being 32 bits wide, which implies this is also true when you compile the core without ECC support. I don't have access to the signals for this memory interface nor documentation so I can't say for sure but if it is implemented as a 32 bit wide interface without byte lane controls then you have the same issue, it can only write a whole 32 bit item to this SRAM and not fractions so to change 8 or 16 bits you have to RMW, down in the bowels of the cache.

The short answer to why not use narrower memory is, size of chip, with ECC the size doubles as there is a limit on how few check bits you can use even with the width getting smaller (7 bits for every 8 bits is a lot more bits to save than 7 bits for every 32). The narrower the memory you also have a lot more signals to route and can't pack the memory as densely. An apartment vs a bunch of individual houses to hold the same number of people. Roads and sidewalks to the front door instead of hallways.

And especially with a single core processor like this unless you intentionally try (which I will) it is unlikely you will accidentally hit this and why drive the cost of the product up on an: it-probably-won't-happen?

Note even with a multi-core processor you will see the memories built like this.

Edit

Okay got around to a test.

0800007c <lwtest>:
 800007c:   b430        push    {r4, r5}
 800007e:   6814        ldr r4, [r2, #0]

08000080 <lwloop>:
 8000080:   6803        ldr r3, [r0, #0]
 8000082:   6803        ldr r3, [r0, #0]
 8000084:   6803        ldr r3, [r0, #0]
 8000086:   6803        ldr r3, [r0, #0]
 8000088:   6803        ldr r3, [r0, #0]
 800008a:   6803        ldr r3, [r0, #0]
 800008c:   6803        ldr r3, [r0, #0]
 800008e:   6803        ldr r3, [r0, #0]
 8000090:   6803        ldr r3, [r0, #0]
 8000092:   6803        ldr r3, [r0, #0]
 8000094:   6803        ldr r3, [r0, #0]
 8000096:   6803        ldr r3, [r0, #0]
 8000098:   6803        ldr r3, [r0, #0]
 800009a:   6803        ldr r3, [r0, #0]
 800009c:   6803        ldr r3, [r0, #0]
 800009e:   6803        ldr r3, [r0, #0]
 80000a0:   3901        subs    r1, #1
 80000a2:   d1ed        bne.n   8000080 <lwloop>
 80000a4:   6815        ldr r5, [r2, #0]
 80000a6:   1b60        subs    r0, r4, r5
 80000a8:   bc30        pop {r4, r5}
 80000aa:   4770        bx  lr

There is a load word (ldr), load byte (ldrb), store word (str) and store byte (strb) versions of each, each are aligned on at least 16 byte boundaries as far as the top of loop address.

with icache and dcache enabled

    ra=lwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=lwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=lbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=lbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=sbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=sbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);


0001000B                                                                        
00010007                                                                        
0001000B                                                                        
00010007                                                                        
0001000C                                                                        
00010007                                                                        
0002FFFD                                                                        
0002FFFD

The loads are on par with each other as expected, the stores though, when you bunch them up like this, a byte write is 3 times longer than a word write.

But if you don't hit the cache that hard

0800019c <nbtest>:
 800019c:   b430        push    {r4, r5}
 800019e:   6814        ldr r4, [r2, #0]

080001a0 <nbloop>:
 80001a0:   7003        strb    r3, [r0, #0]
 80001a2:   46c0        nop         ; (mov r8, r8)
 80001a4:   46c0        nop         ; (mov r8, r8)
 80001a6:   46c0        nop         ; (mov r8, r8)
 80001a8:   7003        strb    r3, [r0, #0]
 80001aa:   46c0        nop         ; (mov r8, r8)
 80001ac:   46c0        nop         ; (mov r8, r8)
 80001ae:   46c0        nop         ; (mov r8, r8)
 80001b0:   7003        strb    r3, [r0, #0]
 80001b2:   46c0        nop         ; (mov r8, r8)
 80001b4:   46c0        nop         ; (mov r8, r8)
 80001b6:   46c0        nop         ; (mov r8, r8)
 80001b8:   7003        strb    r3, [r0, #0]
 80001ba:   46c0        nop         ; (mov r8, r8)
 80001bc:   46c0        nop         ; (mov r8, r8)
 80001be:   46c0        nop         ; (mov r8, r8)
 80001c0:   3901        subs    r1, #1
 80001c2:   d1ed        bne.n   80001a0 <nbloop>
 80001c4:   6815        ldr r5, [r2, #0]
 80001c6:   1b60        subs    r0, r4, r5
 80001c8:   bc30        pop {r4, r5}
 80001ca:   4770        bx  lr

then the word and byte take the same amount of time

    ra=nwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=nwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=nbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    ra=nbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

0000C00B                                                                        
0000C007                                                                        
0000C00B                                                                        
0000C007

It still takes 4 times as long to do bytes vs words all other factors held constant, but that was the challenge to have bytes take more than 4 times as long.

So as I was describing before this question, that you will see the srams being an optimal width in the cache as well as other places and byte writes are going to suffer a read-modify-write. Now whether or not that is visible do to other overhead or optimizations or not is another story. ARM clearly stated it may be visible, and I feel that I have demonstrated this. This is not a negative to ARM's design in any way, in fact the other way around, RISC moves overhead in general as far as the instruction/execution side goes, it does take more instructions to do the same task.

Efficiencies in the design allow for things like this to be visible. There are whole books written on how to make your x86 go faster, don't do 8 bit operations for this or that, or other instructions are preferred, etc. Which means you should be able to write a benchmark to demonstrate those performance hits. Just like this one, even if computing each byte in a string as you move it to memory this should be hidden, you need to write code like this and if you were going to do something like this you might consider burning the instructions combining the bytes into a word before doing the write, may or may not be faster...depends.

If I had halfword (strh) then no surprise, it also suffers the same read-modify-write as the ram is 32 bits wide (plus any ecc bits if any)

0001000C   str                                                                      
00010007   str                                                                      
0002FFFD   strh                                                                     
0002FFFD   strh                                                                     
0002FFFD   strb                                                                     
0002FFFD   strb

the loads take the same amount of time as the sram width is read as a whole and put on the bus, the processor extracts the byte lanes of interest from that, so there is no time/clock cost to doing that.

Kunzite answered 17/1, 2019 at 3:8 Comment(5)

the armv8-m shows those can have caches the cortex-m22 does not have a cache, and these armv8-ms are just showing up on the market so not sure if they have similar language to the cortex-m7 above. – Kunzite 17/1, 2019 at 3:18

When they say "read the word", do they mean from cache or from RAM? The way they say "The data can then be written in the RAM" sounds odd for a write-back cache with a write-allocate policy; is their L1 cache not write-back or not write-allocate? – Buiron 17/1, 2019 at 3:21

Note I read error-free as implying if there is an error additional clocks may happen, I didnt look for if there was scrubbing or automatic write back, I would assume not for either of those for this type of a design. Probably documented, more reading required I would expect they simply call it a data abort and move on, leave it for the software to clean things up. – Kunzite 17/1, 2019 at 3:27

I do not think the cortex-m33 has a cache. the cortex-m53p appears to have an instruction cache, not sure about a data cache and the trm is due out this year so we can see if it has matching language. Or if the security adds any additional performance hits anywhere in the memory system. – Kunzite 17/1, 2019 at 3:32

Let us continue this discussion in chat. – Buiron 17/1, 2019 at 5:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Edit

Recommended topics

Hot tags