What's the proper way of calling a Win32/64 function from LLVM?
Asked Answered
S

1

10

I'm attempting to call a method from LLVM IR back to C++ code. I'm working in 64-bit Visual C++, or as LLVM describes it:

Machine CPU:      skylake
Machine info:     x86_64-pc-windows-msvc

For integer types and pointer types my code works fine as-is. However, floating point numbers seem to be handled a bit strange.

Basically the call looks like this:

struct SomeStruct 
{
    static void Breakpoint( return; } // used to set a breakpoint
    static void Set(uint8_t* ptr, double foo) { return foo * 2; }
};

and LLVM IR looks like this:

define i32 @main(i32, i8**) {
varinit:
  // omitted here: initialize %ptr from i8**. 
  %5 = load i8*, i8** %instance0

  // call to some method. This works - I use it to set a breakpoint
  call void @"Helper::Breakpoint"(i8* %5)

  // this call fails:
  call void @"Helper::Set"(i8* %5, double 0xC19EC46965A6494D)
  ret i32 0
}

declare double @"SomeStruct::Callback"(i8*, double)

I figured that the problem is probably in the way the calling conventions work. So I've attempted to make some adjustments to correct for that:

// during initialization of the function
auto function = llvm::Function::Create(functionType, llvm::Function::ExternalLinkage, name, module);
function->setCallingConv(llvm::CallingConv::X86_64_Win64);
...

// during calling of the function
call->setCallingConv(llvm::CallingConv::X86_64_Win64);

Unfortunately no matter what I try, I end up with 'invalid instruction' errors, which this user reports to be an issue with calling conventions: Clang producing executable with illegal instruction . I've tried this with X86-64_Win64, Stdcall, Fastcall and no calling convention specs - all with the same result.

I've read up on https://msdn.microsoft.com/en-us/library/ms235286.aspx in an attempt to figure out what's going on. Then I looked at the assembly output that's supposed to be generated by LLVM (using the targetMachine->addPassesToEmitFile API call) and found:

    movq    (%rdx), %rsi
    movq    %rsi, %rcx
    callq   "Helper2<double>::Breakpoint"
    vmovsd  __real@c19ec46965a6494d(%rip), %xmm1
    movq    %rsi, %rcx
    callq   "Helper2<double>::Set"
    xorl    %eax, %eax
    addq    $32, %rsp
    popq    %rsi

According to MSDN, argument 2 should be in %xmm1 so that also seems correct. However, when checking if everything works in the debugger, Visual Studio reports a lot of question marks (e.g. 'illegal instruction').

Any feedback is appreciated.


The disassembly code:

00000144F2480007 48 B8 B6 48 B8 C8 FA 7F 00 00 mov         rax,7FFAC8B848B6h  
00000144F2480011 48 89 D1             mov         rcx,rdx  
00000144F2480014 48 89 54 24 20       mov         qword ptr [rsp+20h],rdx  
00000144F2480019 FF D0                call        rax  
00000144F248001B 48 B8 C0 48 B8 C8 FA 7F 00 00 mov         rax,7FFAC8B848C0h  
00000144F2480025 48 B9 00 00 47 F2 44 01 00 00 mov         rcx,144F2470000h  
00000144F248002F ??                   ?? ?? 
00000144F2480030 ??                   ?? ?? 
00000144F2480031 FF 08                dec         dword ptr [rax]  
00000144F2480033 10 09                adc         byte ptr [rcx],cl  
00000144F2480035 48 8B 4C 24 20       mov         rcx,qword ptr [rsp+20h]  
00000144F248003A FF D0                call        rax  
00000144F248003C 31 C0                xor         eax,eax  
00000144F248003E 48 83 C4 28          add         rsp,28h  
00000144F2480042 C3                   ret  

Some of the information about the memory is missing. Memory view:

0x00000144F248001B 48 b8 c0 48 b8 c8 fa 7f 00 00 48 b9 00 00 47 f2 44 01 00 00 62 f1 ff 08 10 09 48 8b 4c 24 20 ff d0 31 c0 48 83 c4 28 c3 00 00 00 00 00 ...

The question marks that are missing here are: '62 f1 '.


Some code is helpful to see how I get the JIT to compile etc. I'm afraid it's a bit long, but helps to get the idea... and I have no clue how to create a smaller piece of code.

    // Note: FunctionBinderBase basically holds an llvm::Function* object
    // which is bound using the above code and a name.
    llvm::ExecutionEngine* Module::Compile(std::unordered_map<std::string, FunctionBinderBase*>& externalFunctions)
    {
        //          DebugFlag = true;

#if (LLVMDEBUG >= 1)
        this->module->dump();
#endif

        // -- Initialize LLVM compiler: --
        std::string error;

        // Helper function, gets the current machine triplet.
        llvm::Triple triple(MachineContextInfo::Triplet()); 
        const llvm::Target *target = llvm::TargetRegistry::lookupTarget("x86-64", triple, error);
        if (!target)
        {
            throw error.c_str();
        }

        llvm::TargetOptions Options;
        // Options.PrintMachineCode = true;
        // Options.EnableFastISel = true;

        std::unique_ptr<llvm::TargetMachine> targetMachine(
            target->createTargetMachine(MachineContextInfo::Triplet(), MachineContextInfo::CPU(), "", Options, llvm::Reloc::Default, llvm::CodeModel::Default, llvm::CodeGenOpt::Aggressive));

        if (!targetMachine.get())
        {
            throw "Could not allocate target machine!";
        }

        // Create the target machine; set the module data layout to the correct values.
        auto DL = targetMachine->createDataLayout();
        module->setDataLayout(DL);
        module->setTargetTriple(MachineContextInfo::Triplet());

        // Pass manager builder:
        llvm::PassManagerBuilder pmbuilder;
        pmbuilder.OptLevel = 3;
        pmbuilder.BBVectorize = false;
        pmbuilder.SLPVectorize = true;
        pmbuilder.LoopVectorize = true;
        pmbuilder.Inliner = llvm::createFunctionInliningPass(3, 2);
        llvm::TargetLibraryInfoImpl *TLI = new llvm::TargetLibraryInfoImpl(triple);
        pmbuilder.LibraryInfo = TLI;

        // Generate pass managers:

        // 1. Function pass manager:
        llvm::legacy::FunctionPassManager FPM(module.get());
        pmbuilder.populateFunctionPassManager(FPM);

        // 2. Module pass manager:
        llvm::legacy::PassManager PM;
        PM.add(llvm::createTargetTransformInfoWrapperPass(targetMachine->getTargetIRAnalysis()));
        pmbuilder.populateModulePassManager(PM);

        // 3. Execute passes:
        //    - Per-function passes:
        FPM.doInitialization();
        for (llvm::Module::iterator I = module->begin(), E = module->end(); I != E; ++I)
        {
            if (!I->isDeclaration())
            {
                FPM.run(*I);
            }
        }
        FPM.doFinalization();

        //   - Per-module passes:
        PM.run(*module);

        // Fix function pointers; the PM.run will ruin them, this fixes that.
        for (auto it : externalFunctions)
        {
            auto name = it.first;
            auto fcn = module->getFunction(name);
            it.second->function = fcn;
        }

#if (LLVMDEBUG >= 2)
        // -- ASSEMBLER dump code
        // 3. Code generation pass manager:

        llvm::legacy::PassManager CGP;
        CGP.add(llvm::createTargetTransformInfoWrapperPass(targetMachine->getTargetIRAnalysis()));
        pmbuilder.populateModulePassManager(CGP);

        std::string result;
        llvm::raw_string_ostream str(result);
        llvm::buffer_ostream os(str);

        targetMachine->addPassesToEmitFile(CGP, os, llvm::TargetMachine::CodeGenFileType::CGFT_AssemblyFile);

        CGP.run(*module);

        str.flush();

        auto stringref = os.str();
        std::string assembly(stringref.begin(), stringref.end());

        std::cout << "ASM code: " << std::endl << "---------------------" << std::endl << assembly << std::endl << "---------------------" << std::endl;
        // -- end of ASSEMBLER dump code.

        for (auto it : externalFunctions)
        {
            auto name = it.first;
            auto fcn = module->getFunction(name);
            it.second->function = fcn;
        }

#endif

#if (LLVMDEBUG >= 2)
        module->dump(); 
#endif

        // All done, *RUN*.

        llvm::EngineBuilder engineBuilder(std::move(module));
        engineBuilder.setEngineKind(llvm::EngineKind::JIT);
        engineBuilder.setMCPU(MachineContextInfo::CPU());
        engineBuilder.setMArch("x86-64");
        engineBuilder.setUseOrcMCJITReplacement(false);
        engineBuilder.setOptLevel(llvm::CodeGenOpt::None);

        llvm::ExecutionEngine* engine = engineBuilder.create();

        // Define external functions
        for (auto it : externalFunctions)
        {
            auto fcn = it.second;
            if (fcn->function)
            {
                engine->addGlobalMapping(fcn->function, const_cast<void*>(fcn->FunctionPointer())); // Yuck... LLVM only takes non-const pointers
            }
        }

        // Finalize
        engine->finalizeObject();

        return engine;
    }

Update (progress)

Apparently my Skylake has problems with the vmovsd instruction. When running the same code on a Haswell (server), the test succeeds. I've checked the assembly output on both - they are exactly the same.

Just to be sure: XSAVE/XRESTORE shouldn't be the problem on Win10-x64, but let's find out anyways. I've checked the features with the code from https://msdn.microsoft.com/en-us/library/hskdteyh.aspx and the XSAVE/XRESTORE from https://insufficientlycomplicated.wordpress.com/2011/11/07/detecting-intel-advanced-vector-extensions-avx-in-visual-studio/ . The latter runs just fine. As for the former, these are the results:

GenuineIntel
Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
3DNOW not supported
3DNOWEXT not supported
ABM not supported
ADX supported
AES supported
AVX supported
AVX2 supported
AVX512CD not supported
AVX512ER not supported
AVX512F not supported
AVX512PF not supported
BMI1 supported
BMI2 supported
CLFSH supported
CMPXCHG16B supported
CX8 supported
ERMS supported
F16C supported
FMA supported
FSGSBASE supported
FXSR supported
HLE supported
INVPCID supported
LAHF supported
LZCNT supported
MMX supported
MMXEXT not supported
MONITOR supported
MOVBE supported
MSR supported
OSXSAVE supported
PCLMULQDQ supported
POPCNT supported
PREFETCHWT1 not supported
RDRAND supported
RDSEED supported
RDTSCP supported
RTM supported
SEP supported
SHA not supported
SSE supported
SSE2 supported
SSE3 supported
SSE4.1 supported
SSE4.2 supported
SSE4a not supported
SSSE3 supported
SYSCALL supported
TBM not supported
XOP not supported
XSAVE supported

It's weird, so I figured: why not simply emit the instruction directly.

int main()
{
    const double value = 1.2;
    const double value2 = 1.3;

    auto x1 = _mm_load_sd(&value);
    auto x2 = _mm_load_sd(&value2);

    std::string s;
    std::getline(std::cin, s);
}

This code runs fine. The disassembly:

    auto x1 = _mm_load_sd(&value);
00007FF7C4833724 C5 FB 10 45 08       vmovsd      xmm0,qword ptr [value]  

    auto x1 = _mm_load_sd(&value);
00007FF7C4833729 C5 F1 57 C9          vxorpd      xmm1,xmm1,xmm1  
00007FF7C483372D C5 F3 10 C0          vmovsd      xmm0,xmm1,xmm0  

Apparently it won't use register xmm1, but still proves that the instruction itself does the trick.

Sedgewick answered 25/8, 2016 at 15:46 Comment(14)
It should be %xmm0.Lulu
@Lulu Yea, I agree, but even if that's the case I wouldn't expect it to crash.Sedgewick
Maybe your crash has nothing to do with calling convention. Illegal instruction emitted by clang is printed as ud2, not as question marks. You can switch to disassembly view in Visual Studio and step by instruction to see what really is going on during the call.Lulu
@Lulu Could be that that's going on, but that's weird since I'm not doing anything strange, right? I've added a screenshot of what I see in the VS disassembly view... as you can see it's all question marks.Sedgewick
Can you determine the content of question marks manually? You can use either memory view or * (unsigned char * )(0xaddress) in the debugger.Lulu
@Lulu I've added the information you requested. Apparently I modified some of the code, so I added a bit more information and updated the IR/ASM stuff that has changed as well. For completeness, I also added the code I use to emit the code. PS: I tried with double Get() { return 2.2; } as well; this gives the same behavior.Sedgewick
The instruction behind ?? is vmovsd xmm1, qword ptr [rcx]. So it is the load of your double constant that failed.Lulu
Do you by any chance use Windows Vista, Windows 7 without SP1 or Windows Server 2008 R2 without SP1? These OS doesn't support AVX (vmovsd is AVX instruction) even if processor does. (drdobbs.com/parallel/windows-7-and-windows-server-2008-r2-ser/…)Lulu
@Lulu I'm working here on Windows 10 x64 professional edition, so that shouldn't be the problem. Still, I did notice that LLVM reports some strange features for this CPU. Notice the AVX512 sets here; I'm pretty sure that's a mistake...: sse4a avx512bw cx16 tbm xsave fma4 avx512vl prfchw bmi2 adx xsavec fsgsbase avx avx512cd avx512pf rtm popcnt fma bmi aes rdrnd xsaves sse4.1 sse4.2 avx2 avx512er sse lzcnt pclmul avx512f f16c ssse3 mmx pku cmov xop rdseed movbe hle xsaveopt sha sse2 sse3 avx512dq . Also, I'll try to run some tests on another system in a second...Sedgewick
@Lulu Just tested on my server (Xeon E5-2650L v3). That's a Haswell architecture. Apparently it does work there. I just checked the rest -- apparently the emitted assembly code is exactly the same. You're right, it almost seems like AVX execution is disabled. Iirc I have some code lying around here to check that, give me a moment...Sedgewick
@Lulu Just added the latest information. XSAVE/XRESTORE and CPUID flags seem to be fine.Sedgewick
@Lulu I've just finished figuring out the details of the error, as you can read in the answer I whipped up. It turns out that the problem is in the SSE level detection in LLVM. I've made a bug report for the team. Still, it was your idea to check the memory bytes that got me to the answer, so if you write something down, I'd be more than happy to award you the bounty.Sedgewick
Nah, your self-answer summarizes it perfectly. I don't mind when I can't get a reputation due to site shortcomings. I enjoyed the good mystery and talking with you and this is what really matters.Lulu
@Lulu Allright. In that case I was happy to provide a good mystery :-)Sedgewick
S
4

I just checked on another Intel Haswell what's going on here, and found this:

0000015077F20110 C5 FB 10 08          vmovsd      xmm1,qword ptr [rax] 

Apparently on Intel Haswell it emits another byte code instruction than on my Skylake.

@Ha. actually was kind enough to point me in the right direction here. Yes, the hidden bytes indeed indicate VMOVSD, but apparently it's encoded as EVEX. That's all nice and well, but EVEX prefix / encoding will be introduced in the latest Skylake architecture as part of AVX512, which won't be supported until Skylake Purley in 2017. In other words, this is an invalid instruction.

To check, I've put a breakpoint in X86MCCodeEmitter::EmitMemModRMByte. At some point, I do see an bool HasEVEX = [...] evaluating to true. This confirms that the codegen / emitter is producing the wrong output.

My conclusion is therefore that this has to be a bug in the target information of LLVM for Skylake CPU's. That means there are only two things remaining to do: figure out where this bug is exactly in LLVM so we can solve this and report the bug to the LLVM team...

So where is it in LLVM? That's tough to tell... x86.td.def defines skylake features as 'FeatureAVX512' which will probably trigger X86SSELevel to AVX512F. That in turn will give the wrong instructions. As a workaround, it's best to simply tell LLVM that we have an Intel Haswell instead and all will be well:

// MCPU is used to call createTargetMachine
llvm::StringRef MCPU = llvm::sys::getHostCPUName();
if (MCPU.str() == "skylake")
{
    MCPU = llvm::StringRef("haswell");
}

Test, works.

Sedgewick answered 28/8, 2016 at 10:21 Comment(2)
This explains why I couldn't find the instruction encoding in the documentation - it is too new. I haven't seen EVEX encoding yet, so I didn't recognize it as such. Well, at least I learned something new today.Lulu
@Lulu Same here. I read the Intel specs a year ago, and it didn't describe EVEX back then. That actually makes sense, because CPU's don't have the zmm registers yet. Also, Intel told that Skylake was going to support AVX512, and explained later that it's only late Skylake Purley (Xeon) / Kings Landing that would support AVX512 and Skylake AVX512F. So, the fact is that it won't support any AVX512 as of today. It was quite confusing imho. CPUID reflects this, but apparently the SSELevel doesn't work like that (that's just encoding). Confusing, confusing... Anyways, thanks for all the help!Sedgewick

© 2022 - 2024 — McMap. All rights reserved.