How to debug an aarch64 translation fault?
Asked Answered
S

1

11

I am writing a simple kernel in armv8 (aarch64).

MMU config:

  • 48 VA bits (T1SZ=64-48=16)
  • 4K page size
  • All physical RAM flat mapped into kernel virtual memory (on TTBR1_EL1) (MMU is active with TTBR0_EL1=0, so I'm only using addresses in 0xffff< addr >, all flat-mapped into physical memory)

I'm mapping a new address space (starting at 1<<40) to some free physical region. When I try to access address 1<<40, I get an exception (of type "EL1 using SP1, synchronous"):

ESR_EL1=0x96000044
FAR_EL1=0xffff010000000000

Inspecting other registers, I have:

TTBR1_EL1=0x82000000
TTBR1_EL1[2]=0x0000000082003003

So, based on ARM Architecture Reference Manual for ARMv8 (ARMv8-A profile):

  • ESR (exception syndrome register) translates into: Exception Class=100101 (Data abort without a change in exception level) on pages D7-1933 sq. ; WnR=1 (faulting instruction is a write) ; DFSC=0b000100 (translation fault at level 0) on page D7-1958 ;
  • FAR_EL1 is the faulting address ; it indicates TTBR1_EL1 is used (since high bits are all 1). The VA top 9 bits are 0b000000010, which indicate that entry 2 is used in the table ;
  • Entry 2 in the table indicates a next-level table (low bits 0b11) at physical address 0x82003000.

So, translation fails at level 0, where it should not.

My question is: am I doing something wrong? Am I missing some info that could lead to the translation fault? And, more generally, how to debug a translation fault ?

Update:
Everthing works when I write to tables before enabling the MMU.
Whenever I write to tables AFTER enabling the MMU (via flat-mapped table region), mapping never works. I wonder why this happens.

I also tried manually writing to the selected tables (to remove any side effect from my mmapping function): same result (when writes are done before MMU is on, it works; after, it fails).

I tried doing tlbi and dsb sy instructions, followed by isb, without effect. Only one CPU is running at this time so caching should not be a problem - write instructions and MMU talk to the same caches (but I will test it next).

Stratify answered 18/8, 2017 at 20:18 Comment(0)
S
6

I overlooked caching issues within a single core. The problem was that, after turning the MMU on, the CPU and table walk unit didn't have the same view of memory. ARMv8 Cortex-A Programming Guide states that cache has to be cleaned/invalidated to point of unification (same view for a single core) after modifying tables.

Two possibilities can explain this behavior (I don't fully understand how caches work yet):

  1. First possibility: the MMU does not have the required address in its internal walk cache.
    In this case, when updating regular data and making it available to other core's L1, the dsb instruction simply waits for all cores to have a synchronized state (thanks to coherency network): other cores will know that the line has to be updated, and when they try to access it, it gets updated to L2 or migrated from the previous core's L1 to their L1.
    This does not happen with the MMU (no coherency participation), so it still sees the old value in L2.
    However, if this were the case, the same thing should happen before the MMU is turned on (because caching is activated way before), except if all memory is considered L1-non-cacheable before MMU is activated (which is possible, I'll have to double check that).
    A minimal way of fixing the problem may be to change caching policies for table pages, but the cache maintenance is still necessary to clear possible old values from the MMU.
  2. Second possibility: in all cases tested, the MMU already has the faulting address in its internal walk cache, which is not coherent with data L1 or L2.
    In that case, only an explicit invalidate can eject the old line from the MMU cache. Before the MMU is turned on, the cache contains nothing and never gets the old value (0), only the new one.
    I still think that case is unlikely because I tested many cases, and sometimes the offset between previsouly mapped memory (for example, entry 0 in the level 1 table) and newly mapped memory (for example, entry 128 in the same level 1 table) was greater than the cache line size (in this case, 1024 bytes, which is more than any cache line size).

So, I'm still not sure what exactly causes the problem, but cleaning/invalidating all the updated addresses works.

Stratify answered 19/8, 2017 at 10:16 Comment(4)
TLB caches have to be invalidated before enabling MMU. When TLB has previously resolved addresses, MMU would use them for a translation straight away rather than do 'walk' (eg read a new values from newly updated MMU tables). I have a doubts in necessity to invalidate data caches since 'walking' is a basically a memory reading. And that tables reading, in fact, places them into cache as any other sort of data when they are not presented in cache at first place.Rooted
Hi @maxbc, Is there some linux kernel primitive that I can invoke in the driver's kernel module ? I'm hitting a similar translation fault (level 1), and I was curious to get more details beyond the what was in your answer !Overleap
Sorry, I have absolutely no knowledge of how Linux handles this ! I was just writing a kernel from scratch.Stratify
I am having similar issues. I'm trying to map Flash into the VA (PA: 0) but always get hit with a Data Abort for Synchronous External abort, not on translation table walk. Oddly, all my other translations work and I use the same function to edit the tables. Interestingly, if I tried to deliberately access an unmapped address I was again hit with Data Abort, but this time for Translation Fault level 2, which led me here. I considered it may be caching, but I've explicitly marked all memory (including the page tables) as non-cacheable ... and all other translations are working.Carmancarmarthen

© 2022 - 2024 — McMap. All rights reserved.