How to use "GS:" in 64-bit Windows Assembly (eg, porting TLS code)
Asked Answered
H

7

10

How can an user-space program configure "GS:" under 64-bit Windows (currently XP-64)?
(By configure, set GS:0 at an arbitrary 64-bit linear address).

I am trying to port a "JIT" environment to X86-64 that was originally developed for Win32.

One unfortunate design aspect is that identical code needs to run on multiple user-space threads (eg, "fibers"). The Win32 version of the code uses the GS selector for this, and generates the proper prefix to access the local data - "mov eax,GS:[offset]" points to the correct data for the current task. The code from the Win32 version would load a value into GS, if only it had a value that would work.

So far I've been able to find that 64-bit windows doesn't support the LDT, so the method used under Win32 won't work. However, the X86-64 instruction set includes "SWAPGS", as well as a method to load GS without using the legacy segmentation - but that only works in kernel space.

According to X64 manuals, even if Win64 allowed access to descriptors -- which it doesn't -- there's no way to set the high 32-bits of the segment base. The only way to set these is through GS_BASE_MSR (and corresponding FS_BASE_MSR - the other segment bases are ignored in 64-bit mode). The WRMSR instruction is Ring0, so I can't use it directly.

I am hoping for a Zw* function that allows me to change "GS:" in user space, or some other dark corner of the Windows API. I believe Windows still uses FS: for its own TLS, so some mechanism must be available?


This sample code illustrates the problem. I apologize in advance for using byte code - VS won't do inline assembly for the 64-bit compile, and I was trying to keep this as one file for illustrative purposes.

The program displays "PASS" on XP-32, and doesn't on XP-x64.


#include <windows.h>
#include <string.h>
#include <stdio.h>


unsigned char GetDS32[] = 
            {0x8C,0xD8,     // mov eax, ds
             0xC3};         // ret

unsigned char SetGS32[] =
            {0x8E,0x6C,0x24,0x04,   // mov gs, ss:[sp+4] 
             0xC3 };                // ret

unsigned char UseGS32[] = 
           { 0x8B,0x44,0x24,0x04,   // mov eax, ss:[sp+4] 
             0x65,0x8B,0x00,        // mov eax, gs:[eax] 
             0xc3 };                // ret

unsigned char SetGS64[] =
            {0x8E,0xe9,             // mov gs, rcx
             0xC3 };                // ret

unsigned char UseGS64[] =       
           { 0x65,0x8B,0x01,         // mov eax, gs:[rcx]
             0xc3 };

typedef WORD(*fcnGetDS)(void);
typedef void(*fcnSetGS)(WORD);
typedef DWORD(*fcnUseGS)(LPVOID);
int (*NtSetLdtEntries)(DWORD, DWORD, DWORD, DWORD, DWORD, DWORD);

int main( void )
{
    SYSTEM_INFO si;
    GetSystemInfo(&si);
    LPVOID p = VirtualAlloc(NULL, 1024, MEM_COMMIT|MEM_TOP_DOWN,PAGE_EXECUTE_READWRITE);
    fcnGetDS GetDS = (fcnGetDS)((LPBYTE)p+16);
    fcnUseGS UseGS = (fcnUseGS)((LPBYTE)p+32);
    fcnSetGS SetGS = (fcnSetGS)((LPBYTE)p+48);
    *(DWORD *)p = 0x12345678;

    if (si.wProcessorArchitecture == PROCESSOR_ARCHITECTURE_AMD64) 
    {
        memcpy( GetDS, &GetDS32, sizeof(GetDS32));
        memcpy( UseGS, &UseGS64, sizeof(UseGS64));
        memcpy( SetGS, &SetGS64, sizeof(SetGS64));
    }
    else
    {
        memcpy( GetDS, &GetDS32, sizeof(GetDS32));
        memcpy( UseGS, &UseGS32, sizeof(UseGS32));
        memcpy( SetGS, &SetGS32, sizeof(SetGS32));
    }

    SetGS(GetDS());
    if (UseGS(p) != 0x12345678) exit(-1);

    if (si.wProcessorArchitecture == PROCESSOR_ARCHITECTURE_AMD64) 
    {
        // The gist of the question - What is the 64-bit equivalent of the following code
    }
    else
    {
        DWORD base = (DWORD)p;
        LDT_ENTRY ll;
        int ret;
        *(FARPROC*)(&NtSetLdtEntries) = GetProcAddress(LoadLibrary("ntdll.dll"), "NtSetLdtEntries");
        ll.BaseLow = base & 0xFFFF;
        ll.HighWord.Bytes.BaseMid = base >> 16;
        ll.HighWord.Bytes.BaseHi = base >> 24;
        ll.LimitLow = 400;     
        ll.HighWord.Bits.LimitHi = 0;
        ll.HighWord.Bits.Granularity = 0;
        ll.HighWord.Bits.Default_Big = 1; 
        ll.HighWord.Bits.Reserved_0 = 0;
        ll.HighWord.Bits.Sys = 0; 
        ll.HighWord.Bits.Pres = 1;
        ll.HighWord.Bits.Dpl = 3; 
        ll.HighWord.Bits.Type = 0x13; 
        ret = NtSetLdtEntries(0x80, *(DWORD*)&ll, *((DWORD*)(&ll)+1),0,0,0);
        if (ret < 0) { exit(-1);}
        SetGS(0x84);
    }
    if (UseGS(0) != 0x12345678) exit(-1);
    printf("PASS\n");
}
Hunyadi answered 25/7, 2009 at 0:32 Comment(0)
J
4

You can modify the thread context via the SetThreadcontext API directly. However, you need to make sure that the thread is not running while the context is changed. Either suspend it and modify the context from another thread, or trigger a fake SEH exception and modify the thread context in the SEH handler. The OS will then change the thread context for you and re-schedule the thread.

Update:

Sample code for the second approach:

__try
{
    __asm int 3 // trigger fake exception
}
__except(filter(GetExceptionCode(), GetExceptionInformation()))
{
}

int filter(unsigned int code, struct _EXCEPTION_POINTERS *ep)
{
    ep->ContextRecord->SegGs = 23;
    ep->ContextRecord->Eip++;
    return EXCEPTION_CONTINUE_EXECUTION;
}

The instruction in the try block basically raises a software exception. The OS then transfers control to the filter procedure which modifies the thread context, effectively telling the OS to skip the int3 instruction and to continue execution.
It's kind of a hack, but its all documented functionality :)

Jonette answered 26/7, 2009 at 19:19 Comment(4)
I think it's simpler - "mov ax, 23h ; mov gs, ax ; " -- there are two problems - (1) how to get set a linear base on descriptor 20h in 64-bit mode? and (2) using a 64-bit linear address (going through the segment register only sets the low 32-bits as far as I can tell)Hunyadi
I see your problem now. Well, if you cannot modify the LDT (because this concept isn't used anymore), your only chance left is to modify the GDT (if that even exists in X64 - since x64 is tied mostly to the flat address space model), and that one is only accessible form kernel mode. I think what you want to do is not possible from user mode (I might be wrong, though)Jonette
Just read on wikipedia that windows will bugcheck if you attempt to modify the GDT on X64: en.wikipedia.org/wiki/Global_Descriptor_TableJonette
x64 has the alternative mechanism ("gs.base MSR"), but WRMSR is a Ring0 instruction. It would have been nice if the CONTEXT structure had GSBASE but that doesn't seem to be the case in the headers I have.Hunyadi
C
2

Why do you need to set the GS register? Windows sets if for you, to point to TLS space.

While I haven't coded for X64, I have built a compiler that generates X32 bit code that manages threads, using FS. Under X64, GS replaces FS and everything else pretty works the same. So, GS points to the thread local store. If you allocated a block of thread local variables (on Win32, we allocate 32 of 64 at offset 0), your thread now has direct access to 32 storage locations to whatever it wishes to do with. You don't need to allocate working thread-specific space; Windows has done it for you.

Of course, you might want to copy what you consider your specific thread data into this space you've set aside, in whatever scheduler you've set up to run your language specific threads.

Cooney answered 28/7, 2009 at 4:12 Comment(3)
I need GS to point to the app-thread specific data - There are multiple app-threads per O/S thread, so I can't rely on the OS.Hunyadi
With multiple app-threads per OS thread, you must be scheduling your own app threads. In my experience, there is small amount of data that needs to be accessible fast via GS or whatever. Have the scheduler copy that data to the TLS area. Also, have the scheduler copy a pointer to "the rest" of your app-thread data to one last TLS cell. Now everything is addressable via GS: the critical stuff is in TLS, the less critical accessible via GS and an extra load. You may not like this choice, but if you can't change GS either you burn GP register permanently or you do this.Cooney
6 years after this answer was posted, somebody casts a downvote, but can't be bothered to say why. Nice.Cooney
R
1

Why not use GetFiberData or are you trying to avoid the two extra instructions?

Rellia answered 25/7, 2009 at 0:32 Comment(0)
S
1

Haven't ever modified GS in x64 code, so I may be wrong, but shouldn't you be able to modify GS by PUSH/POP or by LGS?

Update: Intel manuals say also mov SegReg, Reg is permissible in 64-bit mode.

Showiness answered 26/7, 2009 at 19:13 Comment(0)
G
1

Since x86_64 has many more registers than x86, one option that you may want to consider if you can't use GS would simply be to use one of the general purpose registers (eg, EBP) as a base pointer, and make up for the difference with the new R8-R15 registers.

Gastight answered 29/7, 2009 at 2:46 Comment(1)
While this is a plausible workaround, x86-64 didn't add a new term to the addressing - the existing code makes heavy use of scaled-index + base addressing, with the GS term as a 3rd term. I think I can use use 'lea' followed by a two-register form, but I also have to find cases like "mov eax, mem", which accept a prefix but need completely replaced to use register-based addressing.Hunyadi
L
1

What happens if you just move to OS threads? Is performance that bad?

You could use a single pointer-sized TLS slot to store the base of your lightweight thread's storage area. You'd just have to swap out one pointer during your context switch. Load one of the new temp registers from there whenever you need the value, and you don't have to worry about using one of the few preserved across function calls.

Another supported solution would be to use the Fiber APIs to schedule your lightweight threads. You would then change the JIT to make proper calls to FlsGet/SetValue.

Sorry, it sounds like the old code is written to rely on segment prefixes for addressing and now the LDT is just not available for that sort of thing. You're going to have to fix the code generation a little bit.

the existing code makes heavy use of scaled-index + base addressing, with the GS term as a 3rd term. I think I can use use 'lea' followed by a two-register form

Sounds like a good plan.

cases like "mov eax, mem", which accept a prefix but need completely replaced to use register-based addressing

Perhaps you could move those to address + offset addressing. The offset register could be the register holding the base of your TLS block.

Lexie answered 31/7, 2009 at 0:11 Comment(0)
L
0

x86-64 didn't add a new term to the addressing - the existing code makes heavy use of scaled-index + base addressing, with the GS term as a 3rd term.

I'm fairly confused by your question, but hope this assembler helps. I haven't ported it into C code yet, but will be doing so shortly:

Reading __declspec(thread) data

    mov     ecx, cs:TlsIndex ; TlsIndex is a memory location 
                             ; containing a DWORD with the value 0
    mov     rax, gs:58h
    mov     edx, 830h
    mov     rax, [rax+rcx*8]
    mov     rax, [rdx+rax]
    retn

Sorry, I don't have an example of writing data, the above is taken from some disassembled code I am reverse engineering.

Update: Here is the equiv. C code for the above, although I didn't write. I believe it was authored by NTAuthority and/or citizenmp.

rage::scrThread* GetActiveThread()
{
    char* moduleTls = *(char**)__readgsqword(88);

    return *reinterpret_cast<rage::scrThread**>(moduleTls + 2096);
}

And here's the same thing being written to:

void SetActiveThread(rage::scrThread* thread)
{
    char* moduleTls = *(char**)__readgsqword(88);
    *reinterpret_cast<rage::scrThread**>(moduleTls + 2096) = thread;
}
Lanchow answered 1/2, 2017 at 9:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.