Are there any such processors which have instructions to bypass the cache for a specific data? This question also has an answer which suggests that SSE4.2 instructions do bypass the cache. Can somebody enlighten me on that?
In general, the caching policy is controlled by the Memory Management Unit (MMU). For each address range, a caching policy is decided upon. These tables are managed by the OS and are available in system space. As a sidebar answer to a question that you may have intended to ask, for architectures that have a cache, there are usually CPU commands available for synchronizing/invalidating/flushing the cache. However, much as the MMU tables, these commands are also available only in system space.
clflush
instruction (to invalidate a single cacheline) is not privileged. And, as the original poster mentioned, the movnt
SSE instructions allow cache-bypassing stores - see #37570 for details. –
Evilminded Are there any such processors which have instructions to bypass the cache for a specific data?
The SuperH family (or at least the SuperH-2) has both implicit and explicit bypassing of its cache memory. This is done by using different areas of the memory address space, rather than through special instructions.
By setting the top 3 bits of an address to 001
you would access a cache-through mirror of the same address with the top 3 bits cleared. And some areas (like memory-mapped I/O registers) are never cached.
Altera Nios II architecture has 2 specific instructions ldio
and stio
for loads/stores that bypass the cache. They're used for memory-mapped IO.
http://www.csun.edu/~glaw/ee525/Lecture03Nios.pdf
Nios II is a soft processor generally used for Altera's FPGA boards and although it can also be licensed for hard ASIC devices but I don't know any commercial CPUs based on this architecture
The SSE cache-bypass store instructions are to avoid polluting the cache when writing to a region that won't be touched again soon. e.g. you don't want to evict data that will be used again.
Also, x86 implementations normally read in a whole cache line when a write into any part of the cache line occurs. If the previous contents of the cache line are unneeded, this is a waste of memory bandwidth. (e.g. the dest
arg of memcpy
or memset
.) I found some old discussion of this write-back (default) vs. write-combining (movntq
/ movntdq
) effect for implementing memcpy
. Be careful of using this if something else will read the output of memcpy
right away.
Streaming loads only work for reading from USWC regions, as normal memcpy
performs horribly in that case. Streaming loads from normal (WB (writeback)) are currently not special, and work like regular movdqa
loads. (i.e. the NT hint is ignored). Intel's optimization manual says you can use prefetchnta
for pollution-reducing loads.
IDK if it's possible to write into cache (rather than bypassing with movnt
) without triggering a read. Possibly AVX512 will solve this problem for memcpy
, because a 512b ZMM register is 64bytes, i.e. a full cache line. A 64-byte aligned store from a ZMM register to memory that wasn't already cached could be implemented in a way that didn't read the RAM first, and still made the store visible right away to other CPU cores in the system.
(AVX-512 is going to be in Skylake Xeon (not other skylake CPUs). Also in Knight's Landing, the massively-parallel high-throughput Xeon Phi compute accelerator thing.)
Depending on your definition of specific data, yes. Processors generally have cache control registers / tables which are used to define what regions of memory can be cached vs. which must not be cached. Generally, code running in user space is not able to access those tables.
© 2022 - 2024 — McMap. All rights reserved.