Do multi-core CPUs share the MMU and page tables?
Asked Answered
U

7

24

On a single core computer, one thread is executing at a time. On each context switch the scheduler checks if the new thread to schedule is in the same process than the previous one. If so, nothing needs to be done regarding the MMU (pages table). In the other case, the pages table needs to be updated with the new process pages table.

I am wondering how things happen on a multi-core computer. I guess there is a dedicated MMU on each core, and if two threads of the same process are running simultaneously on 2 cores, each of this core's MMU simply refer to the same page table. Is this true ? Can you point me to good references on the subject ?

Utrillo answered 29/3, 2012 at 16:46 Comment(4)
@Gray: Depends what sort of programming you're doing! (Ah, I see your top tag is "java" :^)Enteric
Related: cache sharing: #945466 || simultaneous RAM access: programmers.stackexchange.com/questions/183686/…Antiseptic
Shouldn't you say "...If so nothing needs to be done..." instead of "...If not, nothing needs to be done..." ?Dactylic
@Gray : What makes you think so ? This is totally a programming question!Shellbark
R
21

Take a look at this scheme. This is high level view of all that there is in a single core on a Corei7 cpu. The picture has been taken from Computer Systems: A Programmer's Perspective, Bryant and Hallaron. You can have access to diagrams in here, section 9.21.

Computer Systems: A Programmer's Perspective, 2/E (CS:APP2e)Randal E. Bryant and David R. O'Hallaron, Carnegie Mellon University

Resemble answered 10/9, 2012 at 8:14 Comment(1)
In case the hyper-threading (HT) technology is used, does each logical core have it's own MMU?Fasto
M
6

TL;DR - There is a seperate MMU per CPU, but an MMU generally has several LEVELS of page tables and these may be shared.

For instance, on an ARM the top level (PGD or page global directory name used in Linux) covers 1MB of address space. In simple systems, you can map in 1MB sections. However, this normally points to a 2nd level table (PTE or page table entry).

One way to implement multi-CPU efficiently is to have a separate top level PGD per CPU. The OS code and data will be consistent between cores. Each core will have its own TLB and L1-cache; L2/L3 caches maybe shared or may not. The maintenance of data/code caches depend on whether they are VIVT or VIPT, but that is a side issue and shouldn't affect the use of MMU and multi-core.

The process or user portion of the 2nd level page tables remain the same per process; otherwise they would have different memory or you would need to synchronize redundant tables. Individual cores may have different sets of 2nd level page tables (different top level page table pointer) when they run different processes. If it is multi-threaded, and running on two CPUs then the top level table may contain the same 2nd level page table entries for the process. In fact, the entire top level page table maybe identical (but different memory) when two CPUs run the same process. If thread local data is implemented with an MMU a single entry could differ. However, thread local data is usually implemented in other ways due to TLB and cache issue (flushing/coherency).

The image below may help. The CPU, PGD, and PTE entries in the diagram are sort of like pointers.

Multi-cpu MMU

The dashed line is the only difference between running different processes and the same processes (multi-threading case) with the MMU; it is an alternate to the solid line running from the CPU2 PGD to the process B PTE or 2nd level page table. The kernel is always a multi-threaded CPU application.

When a virtual address is translated, different bit portions are indexes into each table. If a virtual address is not in the TLB, then the CPU must do a table walk (and fetch different table memory). So a single read of a process memory would result in three memory accesses (if the TLB wasn't present).

The access permission of the kernel code/data are obviously different. In fact, there will probably be other issues such as device memory, etc. However, I think the diagram should make it obvious how the MMU manages to keep multi-threaded memory the same.

It is entirely possible that an entry in the 2nd level table could be different per thread. However, this would incur a cost when switching threads on the same CPU so normally data for all 'thread locals' is mapped and some other way to select the data is used. Normally the thread local data is found via a pointer or index register (special per CPU) which is mapped/points to data inside the 'process' or user memory. 'Thread local data' is not isolated from other threads, so if you have a memory overwrite in one thread you could kill another threads data.

Mitis answered 20/10, 2015 at 14:18 Comment(5)
Different CPUs have different MMU structures with different 'levels'. The concept of at least one level being different for processes and the same for threads will be universal.Mitis
In many CPUs, a limited amount of 'process ids' (or PID) can be tagged in the MMU table (domains on an ARM MMU). So a PTE entry may have an id. The id register is change on a task switch to enable/disable access. Here the TLB doesn't need flushing (nor cache). Many RTOS might use this mechanism, but processes are usually limited to 64-2k in number. Another scheme is the 'PID' forms part of the address (also limited in number). Linux uses both with the above as a fall-back when the limit is breached.Mitis
Does ARM PTE really have "process" tag? There is ASID, but only in TLB and only as optimization (pages.cs.wisc.edu/~remzi/OSTEP/vm-tlbs.pdf#page=9 "To reduce this overhead .. address space identifier (ASID) field in the TLB"). Different processes have separate memory mapping tables (trees), and on context switch some register is set (this is actually register of MMU, but changed as special CPU regiser ... infocenter.arm.com/help/topic/com.arm.doc.ddi0500g/… TTBR? lxr.free-electrons.com/source/arch/arm64/kernel/…) to the root of new tree.Stonechat
@Stonechat Do you refer to the diagram? The "Process A" and "Process B" blocks are PTE entries (physical memory). These are pointed at differently by the main PGD directory (which is rooted by TTBR). My intent was to show the sharing of 'kernel' page entries by the top level dual CPU TTBRs. Is it really that confusing?Mitis
@Stonechat If you meant the 'PID' or process ID comment above, that was for ARMv5 (architecture) and was obsoleted by ARMv6 and better, which use/prefer the ASID mechanics. 'Another scheme' was referring to the ASID. The OPs question was about multi-CPU and an MMU and was not specific about the CPU type. Ie, x86, PowerPC, etc... So, I was trying to gloss over any gory details.Mitis
S
2

Sorry for previous answer. Deleted the answer.

TI PandaBoard runs on OMAP4430 Dual Cortex A9 Processor. It has one MMU per core. It has 2 MMU for 2 cores.

http://forums.arm.com/index.php?/topic/15240-omap4430-panda-board-armcortex-a9-mp-core-mmu/

The above thread provides the info.

In Addition , some more information on ARM v7

Each core has the following features:

  1. ARM v7 CPU at 600 MHz
  2. 32 KB of L1 instruction CACHE with parity check
  3. 32 KB of L1 data CACHE with parity check
  4. Embedded FPU for single and double data precision scalar floating-point operations
  5. Memory management unit (MMU)
  6. ARM, Thumb2 and Thumb2-EE instruction set support
  7. TrustZone© security extension
  8. Program Trace Macrocell and CoreSight© component for software debug
  9. JTAG interface
  10. AMBA© 3 AXI 64-bit interface
  11. 32-bit timer with 8-bit prescaler
  12. Internal watchdog (working also as timer)

The dual core configuration is completed by a common set of components:

  1. Snoop control unit (SCU) to manage inter-process communication, cache-2-cache and system memory transfer, cache coherency
  2. Generic interrupt control (GIC) unit configured to support 128 independent interrupt sources with software configurable priority and routing between the two cores
  3. 64-bit global timer with 8-bit prescaler
  4. Asynchronous accelerator coherency port (ACP)
  5. Parity support to detect internal memory failures during runtime
  6. 512 KB of unified 8-way set associative L2 cache with support for parity check and ECC
  7. L2 Cache controller based on PL310 IP released by ARM
  8. Dual 64-bit AMBA 3 AXI interface with possible filtering on the second one to use a single port for DDR memory access

Though all these are for ARM , it will provide general idea.

Snood answered 29/3, 2012 at 16:46 Comment(0)
E
1

Answers here so far seem to be unaware of the existence of the Translation Lookaside Buffer (TLB), which is the MMU's way of converting the virtual addresses used by a process to a physical memory address.

Note that these days the TLB itself is a complicated beast with multiple levels of caching. Just like a CPU's regular RAM caches (L1-L3), you wouldn't necessarily expect it's state at any given instant to contain info exclusively about the currently running process but for that to be moved in piecemeal on demand; see the Context Switch section of the wikipedia page.

On SMP, all processors' TLBs need to keep a consistent view of the system page table. See e.g this section of the linux kernel book for one way of handling it.

Enteric answered 3/4, 2012 at 20:46 Comment(4)
Thanks for the answer. This means in someway, that there one single MMU but "a kind of several TLB" having the notion of processes. is it right ?Utrillo
But wait ;-) ... The TLB is not the only thing used to translate virtual dresses, what about adresses that are not in this cache, they have to be resolved through the page table provided by the OS.Utrillo
Well every core will need some sort of MMU, and the MMU will have access to some sort of TLB or TLB hierarchy. And there will be some way of ensuring consistency between multiple CPUs which may or may not include them sharing MMUs (see Tudor's comment on your question). If you're looking for the thing which there's only one of, it's probably the OS' page table en.wikipedia.org/wiki/Page_table which the MMUs and TLBs then realize in HW.Enteric
For your second question (sorry, we're overlapping responses here), see en.wikipedia.org/wiki/… Yes TLB misses can be expensive, and it's easy to contrive tests where TLB "churn" will impact performance #2876877Enteric
C
0

AFAIK there is a single MMU per physical processor, at least in SMP systems, so all cores share a single MMU.

In NUMA systems each core has a separate MMU, because each core has its own private memory.

Credible answered 29/3, 2012 at 16:58 Comment(4)
Thanks for the answer, but as a consequence, how does memory translation happens when two threads from different processes (thus diferent address spaces) are running simultaneously ?Utrillo
@Manuel Selva: I'm sorry, I really don't possess sufficient knowledge to answer the question. I know what you mean, but I really have no idea how this mechanism is implemented.Credible
no problem ;-) But you confirm your answer about the fact that there is only one MMU per physical processor ? Have you a link on that ?Utrillo
@Manuel Selva: There is an article here: zone.ni.com/devzone/cda/tut/p/id/6097, talking about parallel hardware. It first shows a diagram of a multiprocessor with an MMU for each physical CPU and then further down a diagram for a multi-core with a single MMU for both cores, since they are on a single chip.Credible
V
0

In ARMv8, Table base address register have CnP bit to support shard TLB in the inner shareable domain: enter image description here

Villiform answered 7/6, 2020 at 7:48 Comment(0)
U
-1

On the question of MMUs per processor there may be several. The assumption is that each MMU will add additional memory bandwidth. If DDR3-12800 memory allows 1600 mega-transfers per second on a processor with one MMU then one with four will theoretically allow 6400. Securing the bandwidth to the cores available is probably quite a feat. The bandwidth advertised will be whittled away quite a bit in the process.

The number of MMUs on a processor is independent of the number of cores on it. The obvious examples are the 16 core CPUs from AMD, they definitely don't have 16 MMUs. A dual-core processor, on the other hand, might have two MMUs. Or just one. Or three?

Edit

Maybe I'm confusing MMUs with channels?

Uela answered 15/5, 2012 at 15:53 Comment(1)
"Maybe I'm confusing MMUs with channels?" - Yes, you did. MMU is for actual virtual-to-physical address translation (en.wikipedia.org/wiki/Memory_management_unit "hardware unit having all memory references passed through itself, primarily performing the translation of virtual memory addresses to physical addresses. It is usually implemented as part of the central processing unit (CPU)"), and the channels are part of Memory Controller (en.wikipedia.org/wiki/Memory_controller) which physically implements CPU or chipset side of one or more channels.Stonechat

© 2022 - 2024 — McMap. All rights reserved.