Address canonical form and pointer arithmetic
Asked Answered
D

1

6

On AMD64 compliant architectures, addresses need to be in canonical form before being dereferenced.

From the Intel manual, section 3.3.7.1:

In 64-bit mode, an address is considered to be in canonical form if address bits 63 through to the most-significant implemented bit by the microarchitecture are set to either all ones or all zeros.

Now, the most significat implemented bit on current operating systems and architectures is the 47th bit. This leaves us with a 48-bit address space.

Especially when ASLR is enabled, user programs can expect to receive an address with the 47th bit set.

If optimizations such as pointer tagging are used and the upper bits are used to store information, the program must make sure the 48th to 63th bits are set back to whatever the 47th bit was before dereferencing the address.

But consider this code:

int main()
{
    int* intArray = new int[100];

    int* it = intArray;

    // Fill the array with any value.
    for (int i = 0; i < 100; i++)
    {
        *it = 20;
        it++;   
    }

    delete [] intArray;
    return 0;
}

Now consider that intArray is, say:

0000 0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1100

After setting it to intArray and increasing it once, and considering sizeof(int) == 4, it will become:

0000 0000 0000 0000 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

The 47th bit is in bold. What happens here is that the second pointer retrieved by pointer arithmetic is invalid because not in canonical form. The correct address should be:

1111 1111 1111 1111 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

How do programs deal with this? Is there a guarantee by the OS that you will never be allocated memory whose address range does not vary by the 47th bit?

Descent answered 16/8, 2016 at 14:35 Comment(3)
Please do not post your homework problems directly without even the LEAST effort directed at disguising that fact, which would be obvious anyway. (FYI, one typically doesn't post questions for experts and start off said questions with definitions of what's canonical)Thong
a) anybody who uses pointer-tagging deserves all the pain they get. b) usually negative addresses (as in interpreted as a negative two's complement number) are reserved for the operating system and do not concern the user.Veradis
@Bruce This question comes from a discussion I had with another member of my team in a software company. I added information about canonical addressing because many are not familiar with the concept. I really don't see how you could be so sure that this was homework but you could have asked before directly accusing me of "not putting effort into disguising that fact". I wish I had this kind of homework.Descent
W
9

The canonical address rules mean there is a giant hole in the 64-bit virtual address space. 2^47-1 is not contiguous with the next valid address above it, so a single mmap won't include any of the unusable range of 64-bit addresses.

+----------+
| 2^64-1   |   0xffffffffffffffff
| ...      |
| 2^64-2^47|   0xffff800000000000
+----------+
|          |
| unusable |      not to scale: this part is 2^16 times as large
|          |
+----------+
| 2^47-1   |   0x00007fffffffffff
| ...      |
| 0        |   0x0000000000000000
+----------+

Also most kernels reserve the high half of the canonical range for their own use. e.g. x86-64 Linux's memory map. User-space can only allocate in the contiguous low range anyway so the existence of the gap is irrelevant.

Is there a guarantee by the OS that you will never be allocated memory whose address range does not vary by the 47th bit?

Not exactly. The 48-bit address space supported by current hardware is an implementation detail. The canonical-address rules ensure that future systems can support more virtual address bits without breaking backwards compatibility to any significant degree.

At most, you'd just need a compat flag to have the OS not give the process any memory regions with high bits not all the same. (Like Linux's current MAP_32BIT flag for mmap, or a process-wide setting). That could support programs that used the high bits for tags and manually redid sign-extension.

Future hardware won't need to support any kind of flag to ignore high address bits or not, because junk in the high bits is currently an error. Intel 5-level paging adds another 9 virtual address bits, widening the canonical high andd low halves. white paper.

See also Why in 64bit the virtual address are 4 bits short (48bit long) compared with the physical address (52 bit long)?


Fun fact: Linux defaults to mapping the stack at the top of the lower range of valid addresses. (Related: Why does Linux favor 0x7f mappings?)

$ gdb /bin/ls
...
(gdb) b _start
Function "_start" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (_start) pending.
(gdb) r
Starting program: /bin/ls

Breakpoint 1, 0x00007ffff7dd9cd0 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) p $rsp
$1 = (void *) 0x7fffffffd850
(gdb) exit

$ calc
2^47-1
              0x7fffffffffff

(Modern GDB can use starti to break before the first user-space instruction executes instead of messing around with breakpoint commands.)

Wileywilfong answered 16/8, 2016 at 19:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.