Are there any such processors which have instructions to bypass the cache?
Asked Answered
G

5

7

Are there any such processors which have instructions to bypass the cache for a specific data? This question also has an answer which suggests that SSE4.2 instructions do bypass the cache. Can somebody enlighten me on that?

Gabbert answered 13/6, 2013 at 17:36 Comment(4)
I am curious: What practical application is there for bypassing the cache? All that comes to mind is improving the predictability of instruction timing. Are there others?Tridimensional
2 cases I commonly encounter are sharing time-sensitive data between threads running on different cores, and writing to memory-mapped registers to interface with other hardware (such as a UART IC)Chervonets
@wallyk: The typical purpose is to prevent cache pollution (e.g. if you're writing a lot of data and don't expect to read any of it "soon", and don't want the data you will need to get pushed out of the cache).Chidester
Yes, many processors provide instructions to bypass cache. See my detailed survey paper on cache bypassing techniques for CPUs, GPUs and CPU-GPU systems. It also discusses benefits, challenges and tradeoffs of bypassing.Nephology
F
6

In general, the caching policy is controlled by the Memory Management Unit (MMU). For each address range, a caching policy is decided upon. These tables are managed by the OS and are available in system space. As a sidebar answer to a question that you may have intended to ask, for architectures that have a cache, there are usually CPU commands available for synchronizing/invalidating/flushing the cache. However, much as the MMU tables, these commands are also available only in system space.

Fatuity answered 13/6, 2013 at 18:29 Comment(1)
Slight clarification; on x86, the clflush instruction (to invalidate a single cacheline) is not privileged. And, as the original poster mentioned, the movnt SSE instructions allow cache-bypassing stores - see #37570 for details.Evilminded
S
2

Are there any such processors which have instructions to bypass the cache for a specific data?

The SuperH family (or at least the SuperH-2) has both implicit and explicit bypassing of its cache memory. This is done by using different areas of the memory address space, rather than through special instructions.
By setting the top 3 bits of an address to 001 you would access a cache-through mirror of the same address with the top 3 bits cleared. And some areas (like memory-mapped I/O registers) are never cached.

Song answered 13/6, 2013 at 17:44 Comment(0)
E
2

Altera Nios II architecture has 2 specific instructions ldio and stio for loads/stores that bypass the cache. They're used for memory-mapped IO.

http://www.csun.edu/~glaw/ee525/Lecture03Nios.pdf

Nios II is a soft processor generally used for Altera's FPGA boards and although it can also be licensed for hard ASIC devices but I don't know any commercial CPUs based on this architecture

Electroencephalogram answered 19/9, 2013 at 15:9 Comment(0)
E
2

The SSE cache-bypass store instructions are to avoid polluting the cache when writing to a region that won't be touched again soon. e.g. you don't want to evict data that will be used again.

Also, x86 implementations normally read in a whole cache line when a write into any part of the cache line occurs. If the previous contents of the cache line are unneeded, this is a waste of memory bandwidth. (e.g. the dest arg of memcpy or memset.) I found some old discussion of this write-back (default) vs. write-combining (movntq / movntdq) effect for implementing memcpy. Be careful of using this if something else will read the output of memcpy right away.

Streaming loads only work for reading from USWC regions, as normal memcpy performs horribly in that case. Streaming loads from normal (WB (writeback)) are currently not special, and work like regular movdqa loads. (i.e. the NT hint is ignored). Intel's optimization manual says you can use prefetchnta for pollution-reducing loads.


IDK if it's possible to write into cache (rather than bypassing with movnt) without triggering a read. Possibly AVX512 will solve this problem for memcpy, because a 512b ZMM register is 64bytes, i.e. a full cache line. A 64-byte aligned store from a ZMM register to memory that wasn't already cached could be implemented in a way that didn't read the RAM first, and still made the store visible right away to other CPU cores in the system.

(AVX-512 is going to be in Skylake Xeon (not other skylake CPUs). Also in Knight's Landing, the massively-parallel high-throughput Xeon Phi compute accelerator thing.)

Eelworm answered 30/4, 2015 at 22:36 Comment(0)
V
1

Depending on your definition of specific data, yes. Processors generally have cache control registers / tables which are used to define what regions of memory can be cached vs. which must not be cached. Generally, code running in user space is not able to access those tables.

Vauntcourier answered 13/6, 2013 at 17:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.