TL;DR
I think it's the linear address.
Keep reading for the test methodology and the test code.
It's not the effective address (aka the offset)
To test this it suffices to use a segment with a base that is not aligned.
In my test, I've used a 32-bit data segment with a base of 1.
The test is a "simple" legacy (i.e. non-UEFI) bootloader that will create said descriptor and test accessing the offsets 0x7000 and 0x7003 with DWORD width.
The former will generate an #AC, the latter won't.
This demonstrates that it's not the offset alone that is checked, because 0x7000 is an aligned offset that still faults with a base of 1.
This is expected.
I have a tradition of using a minimal output for the tests, so an explanation is mandatory.
First, six blue As are written in six consecutive rows in the VGA buffer.
Then before executing a load, a pointer is set to each of these As.
The #AC handler will increment the pointed-to byte.
So, if a row contains a B, the access generated an #AC.
The first four rows are used for:
- Access using a segment with base 0 and offset 0x7000h. As expected, no #AC
- Access using a segment with base 0 and offset 0x7003h. As expected, #AC
- Access using a segment with base 1 and offset 0x7000h. This does generate an #AC thereby demonstrating that it's either the linear of the physical address that's checked.
- Access using a segment with base 1 and offset 0x7003h. This doesn't generate an #AC, confirming point 3.
The next two rows are used to check the linear address vs the physical address.
It's not the physical address: #AC instead of #PF
The #AC test only alignments up to 16 bytes but a linear and a physical address share the same alignment up to 4KiB at least.
We would need a memory access that requires a data structure aligned on, at least, 8KiB to test if it's the physical or the linear address that's used for the check.
Unfortunately, there is no such access (yet).
I thought I could still gather some insight by checking what exception is generated when a misaligned load target an unmapped page.
If a #PF is generated, the CPU will first translate the linear address and will then check. On the other way around, if an #AC is generated, the CPU will check before translating (remember that the page is not mapped).
I modified the test to enable page, map the minimum amount of pages and handle a #PF by incrementing the byte under the pointer by two.
When a load is executed, the corresponding A will either become a B if an #AC is generated or a C if a #PF is generated.
Note that both are faults (eip
on the stack points to the offending instruction) but both handlers resume from the next instruction (so each load is executed only once).
These are the meaning of the last two rows:
- Access to an unmapped page using a segment with base 1 and offset 0x7003h. This generates a #PF as expected (the access is aligned so the only exception possible here is a #PF).
- Access to an unmapped page using a segment with base 1 and offset 0x7000h. This generates an #AC, therefore the CPU checks the alignment before attempting to translate the address.
Point 6 seems to suggest that the CPU will perform the check on the linear address since no access to the page table is done.
In point 6 both exceptions could be generated, the fact that #PF is not generated means that the CPU hasn't attempted translating the address when the alignment check is performed. (Or that #AC logically takes precedence. But likely the hardware wouldn't do a page walk before taking the #AC exception, even if it did probe the TLB after doing the base+offset calculation.)
Test code
The code is messy and more cumbersome than one may expect.
The main hindrance is #AC only working at CPL=3.
So we need to create the CPL=3 descriptor, plus a TSS segment and a TSS descriptor.
To handle the exception we need an IDT and we also need paging.
BITS 16
ORG 7c00h
;Skip the BPB (My BIOS actively overwrite it)
jmp SHORT __SKIP_BPB__
;I eyeballed the BPB size (at least the part that may be overwritten)
TIMES 40h db 0
__SKIP_BPB__:
;Set up the segments (including CS)
xor ax, ax
mov ds, ax
mov ss, ax
xor sp, sp
jmp 0:__START__
__START__:
;Clear and set the video mode (before we switch to PM)
mov ax, 03h
int 10h
;Disable the interrupts and load the GDT and IDT
cli
lgdt [GDT]
lidt [IDT]
;Enable PM
mov eax, cr0
or al, 1
mov cr0, eax
;Write a TSS segment, we zeros 104h DWORDs and only set the SS0:ESP0 fields
mov di, 7000h
mov cx, 104h
xor ax, ax
rep stosd
mov DWORD [7004h], 7c00h ;ESP0
mov WORD [7008h], 10h ;SS0
;Set AC in EFLAGS
pushfd
or DWORD [esp], 1 << 18
popfd
;Set AM in CR0
mov eax, cr0
or eax, 1<<18
mov cr0, eax
;OK, let's go in PM for real
jmp 08h:__32__
__32__:
BITS 32
;Set the stack and DS
mov ax, 10h
mov ss, ax
mov esp, 7c00h
mov ds, ax
;Set the #AC handler
mov DWORD [IDT+8+17*8], ((AC_handler-$$+7c00h) & 0ffffh) | 00080000h
mov DWORD [IDT+8+17*8+4], 8e00h | (((AC_handler-$$+7c00h) >> 16) << 16)
;Set the #PF handler
mov DWORD [IDT+8+14*8], ((PF_handler-$$+7c00h) & 0ffffh) | 00080000h
mov DWORD [IDT+8+14*8+4], 8e00h | (((PF_handler-$$+7c00h) >> 16) << 16)
;Set the TSS
mov ax, 30h
ltr ax
;Paging is:
;7xxx -> Identity mapped (contains code and all the stacks and system structures)
;8xxx -> Not present
;9xxx -> Mapped to the VGA text buffer (0b8xxxh)
;Note that the paging structures are at 6000h and 5000h, this is OK as these are physical addresses
;Set the Page Directory at 6000h
mov eax, 6000h
mov cr3, eax
;Set the Page Directory Entry 0 (for 00000000h-00300000h) to point to a Page Table at 5000h
mov DWORD [eax], 5007h
;Set the Page Table Entry 7 (for 00007xxxh) to identity map and Page Table Entry 8 (for 000008xxxh) to be not present
mov eax, 5000h + 7*4
mov DWORD [eax], 7007h
mov DWORD [eax+4], 8006h
;Map page 9000h to 0b8000h
mov DWORD [eax+8], 0b801fh
;Enable paging
mov eax, cr0
or eax, 80000000h
mov cr0, eax
;Change privilege (goto CPL=3)
push DWORD 23h ;SS3
push DWORD 07a00h ;ESP3
push DWORD 1bh ;CS3
push DWORD __32user__ ;EIP3
retf
__32user__:
;
;Here we are at CPL=3
;
;Set DS to segment with base 0 and ES to one with base 1
mov ax, 23h
mov ds, ax
mov ax, 2bh
mov es, ax
;Write six As in six consecutive row (starting from the 4th)
xor ecx, ecx
mov ecx, 6
mov ebx, 9000h + 80*2*3 ;Points to 4th row in the VGA text framebuffer
.init_markers:
mov WORD [ebx], 0941h
add bx, 80*2
dec ecx
jnz .init_markers
;ebx points to the first A
sub ebx, 80*2 * 6
;Base 0 + Offset 0 = 0, Should not fault (marker stays A)
mov eax, DWORD [ds:7000h]
;Base 0 + Offset 1 = 1, Should fault (marker becomes B)
add bx, 80*2
mov eax, DWORD [ds:7001h]
;Base 1 + Offset 0 = 1, Should fault (marker becomes B)
add bx, 80*2
mov eax, DWORD [es:7000h]
;Base 1 + Offset 3 = 4, Should not fault (marker stays A)
add bx, 80*2
mov eax, DWORD [es:7003h]
;Base 1 + Offset 3 = 4 but page not mapped, Should #PF (markers becomes C)
add bx, 80*2
mov eax, DWORD [es:8003h]
;Base 1 + Offset 0 = 1 but page not mapped, if #PF the markers becomes C, if #AC the markers becomes B
add bx, 80*2
mov eax, DWORD [es:8000h]
;Loop foever (cannot use HLT at CPL=3)
jmp $
;#PF handler
;Increment the byte pointed by ebx by two
PF_handler:
add esp, 04h ;Remove the error code
add DWORD [esp], 6 ;Skip the current instruction
add BYTE [ebx], 2 ;Increment
iret
;#AC handler
;Same as the #PF handler but increment by one
AC_handler:
add esp, 04h
add DWORD [esp], 6
inc BYTE [ebx]
iret
;The GDT (entry 0 is used as the content for GDTR)
GDT dw GDT.end-GDT - 1
dd GDT
dw 0
dd 0000ffffh, 00cf9a00h ;08 Code, 32, DPL 0
dd 0000ffffh, 00cf9200h ;10 Data, 32, DPL 0
dd 0000ffffh, 00cffa00h ;18 Code, 32, DPL 3
dd 0000ffffh, 00cff200h ;20 Data, 32, DPL 3
dd 0001ffffh, 00cff200h ;28 Data, 32, DPL 3, Base = 1
dd 7000ffffh, 00cf8900h ;30 Data, 32, 0 (TSS)
.end:
;The IDT, to save space the entries are set dynamically
IDT dw 18*8-1
dd IDT+8
dw 0
;Signature
TIMES 510-($-$$) db 0
dw 0aa55h
Does it make sense to check the linear address?
I don't think it's particularly relevant.
As noted above, a linear and a physical address share the same alignment up to 4KiB.
So, for now, it doesn't matter at all.
Right now, accesses wider than 64 bytes would still need to be performed in chunks and this limit is set deep in the microarchitectures of the x86 CPUs.