Random crashes on Windows 10 64bit with ATL subclassing
Asked Answered
P

2

13

Just from the start: Since March 1st 2017 this is a bug confirmed by Microsoft. Read comments at the end.

Short description:

I have random crashes in larger application using MFC, ATL. In all such cases after ATL subclassing was used for a window upon simple actions with a window (moving, resizing, setting the focus, painting etc.) I get a crash on a random execution address.

First it looked like a wild pointer or heap corruption but I narrowed the complete scenario down to a very simple application using pure ATL and only Windows API.

Requirements / my used scenarios:

  • The application was created with VS 2015 Enterprise Update 3.
  • The program should be compiled as 32bit.
  • Test application uses CRT as a shared DLL.
  • The application runs under Windows 10 Build 14393.693 64bit (but we have repros under Windows 8.1 and Windows Server 2012 R2, all 64bit)
  • atlthunk.dll has version 10.0.14393.0

What the application does:

It simply creates a frame window and tries to create many static windows with the windows API. After the static window is created, this window is subclassed with the ATL CWindowImpl::SubclassWindow method. After the subclass operation a simple window message is sent.

What happens:

Not on every run, but very often the application crashes upon SendMessage to the subclassed window. On the 257 window ( or another multiple of 256+1) the subclass fails in some way. The ATL thunk that is created is invalid. It seems that the stored execution address of the new subclass-function isn't correct. Sending any the message to the window causes a crash. The callstack is always the same. The last visible and known address in the callstack is in the atlthunk.dll

atlthunk.dll!AtlThunk_Call(unsigned int,unsigned int,unsigned int,long) Unknown
atlthunk.dll!AtlThunk_0x00(struct HWND__ *,unsigned int,unsigned int,long)  Unknown
user32.dll!__InternalCallWinProc@20()   Unknown
user32.dll!UserCallWinProcCheckWow()    Unknown
user32.dll!SendMessageWorker()  Unknown
user32.dll!SendMessageW()   Unknown
CrashAtlThunk.exe!WindowCheck() Line 52 C++

The thrown exception in the debugger is shown as:

Exception thrown at 0x0BF67000 in CrashAtlThunk.exe: 
0xC0000005: Access violation executing location 0x0BF67000.

or another sample

Exception thrown at 0x2D75E06D in CrashAtlThunk.exe: 
0xC0000005: Access violation executing location 0x2D75E06D.

What I know about atlthunk.dll:

Atlthunk.dll seems to be only part of 64bit OS. I found it on a Win 8.1 and Win 10 systems.

If atlthunk.dll is available (all Windows 10 machines), this DLL cares about the thunking. If the DLL isn't present, thunking is done in the standard way: allocating a block on the heap, marking it as executable, adding some load and a jump statement.

If the DLL is present. It contains 256 predefined slots for subclassing. If 256 subclasses are done, the DLL reloads itself a second time into memory and uses the next 256 available slots in the DLL.

As far as I see, the atlthunk.dll belongs to the Windows 10 and isn't exchangeable or redistributable.

Things checked:

  • Antivirus system was turned of or on, no change
  • Data execution protection doesn't matter. (/NXCOMPAT:NO and the EXE is defined as an exclusion in the system settings, crashes too)
  • Additional calls to FlushInstructionCache or Sleep calls after the subclass doesn't have any effect.
  • Heap integrity isn't a problem here, I rechecked it with more than one tool.
  • and a thousands more (I may already forgot what I tested)... ;)

Reproducibility:

The problem is somehow reproducible. It doesn't crashes all the time, it crashes randomly. I have a machine were the code crashes on every third execution.

I can repro it on two desktop stations with i7-4770 and a i7-6700.

Other machines seem not to be affected at all (works always on a Laptop i3-3217, or desktop with i7-870)

About the sample:

For simplicity I use a SEH handler to catch the error. If you debug the application the debugger will show the callstack mentioned above. The program can be launched with an integer on the command line.In this case the program launches itself again with the count decremented by 1.So if you launch CrashAtlThunk 100 it will launch the application 100 times. Upon an error the SEH handler will catch the error and shows the text "Crash" in a message box. If the application runs without errors, the application shows "Succeeded" in a message box. If the application is started without a parameter it is just executed once.

Questions:

  • Does anybody else can repro this?
  • Does anybody saw similar effects?
  • Does anybody know or can imagine a reason for this?
  • Does anybody know how to get around this problem?

Notes:

2017-01-20 Support case at Microsoft opened.

The code

// CrashAtlThunk.cpp : Defines the entry point for the application.
//

// Windows Header Files:
#include <windows.h>

// C RunTime Header Files
#include <stdlib.h>
#include <malloc.h>
#include <memory.h>
#include <tchar.h>

#define _ATL_CSTRING_EXPLICIT_CONSTRUCTORS      // some CString constructors will be explicit

#include <atlbase.h>
#include <atlstr.h>
#include <atlwin.h>


// Global Variables:
HINSTANCE hInst;                                // current instance

const int NUM_WINDOWS = 1000;

//------------------------------------------------------
//    The problematic code
//        After the 256th subclass the application randomly crashes.

class CMyWindow : public CWindowImpl<CMyWindow>
{
public:
    virtual BOOL ProcessWindowMessage(_In_ HWND hWnd, _In_ UINT uMsg, _In_ WPARAM wParam, _In_ LPARAM lParam, _Inout_ LRESULT& lResult, _In_ DWORD dwMsgMapID) override
    {
        return FALSE;
    }
};

void WindowCheck()
{
    HWND ahwnd[NUM_WINDOWS];
    CMyWindow subclass[_countof(ahwnd)];

    HWND hwndFrame;
    ATLVERIFY(hwndFrame = ::CreateWindow(_T("Static"), _T("Frame"), SS_SIMPLE, 0, 0, 10, 10, NULL, NULL, hInst, NULL));

    for (int i = 0; i<_countof(ahwnd); ++i)
    {
        ATLVERIFY(ahwnd[i] = ::CreateWindow(_T("Static"), _T("DummyWindow"), SS_SIMPLE|WS_CHILD, 0, 0, 10, 10, hwndFrame, NULL, hInst, NULL));
        if (ahwnd[i])
        {
            subclass[i].SubclassWindow(ahwnd[i]);
            ATLVERIFY(SendMessage(ahwnd[i], WM_GETTEXTLENGTH, 0, 0)!=0);
        }
    }
    for (int i = 0; i<_countof(ahwnd); ++i)
    {
        if (ahwnd[i])
            ::DestroyWindow(ahwnd[i]);
    }
    ::DestroyWindow(hwndFrame);
}
//------------------------------------------------------

int APIENTRY wWinMain(_In_ HINSTANCE hInstance,
                     _In_opt_ HINSTANCE hPrevInstance,
                     _In_ LPWSTR    lpCmdLine,
                     _In_ int       nCmdShow)
{
    hInst = hInstance; 

    int iCount = _tcstol(lpCmdLine, nullptr, 10);

    __try
    {
        WindowCheck();
        if (iCount==0)
        {
            ::MessageBox(NULL, _T("Succeeded"), _T("CrashAtlThunk"), MB_OK|MB_ICONINFORMATION);
        }
        else
        {
            TCHAR szFileName[_MAX_PATH];
            TCHAR szCount[16];
            _itot_s(--iCount, szCount, 10);
            ::GetModuleFileName(NULL, szFileName, _countof(szFileName));
            ::ShellExecute(NULL, _T("open"), szFileName, szCount, nullptr, SW_SHOW);
        }
    }
    __except (EXCEPTION_EXECUTE_HANDLER)
    {
        ::MessageBox(NULL, _T("Crash"), _T("CrashAtlThunk"), MB_OK|MB_ICONWARNING);
        return FALSE;
    }

    return 0;
}

Comment after answered by Eugene (Feb. 24th 2017):

I don't want to change my original question, but I want to add some additional information how to get this into a 100% Repro.

1, Change the main function to

int APIENTRY wWinMain(_In_ HINSTANCE hInstance,
                     _In_opt_ HINSTANCE hPrevInstance,
                     _In_ LPWSTR    lpCmdLine,
                     _In_ int       nCmdShow)
{
    // Get the load address of ATLTHUNK.DLL
    // HMODULE hMod = LoadLibrary(_T("atlThunk.dll"));

    // Now allocate a page at the prefered start address
    void* pMem = VirtualAlloc(reinterpret_cast<void*>(0x0f370000), 0x10000, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
    DWORD dwLastError = ::GetLastError();

    hInst = hInstance; 

    WindowCheck();

    return 0;
}
  1. Uncomment the LoadLibrary call. Compile.

  2. Run the programm once and stop in the debugger. Note the address where the library was loaded (hMod).

  3. Stop the program. Now comment the Library call again and change the VirtualAlloc call to the address of the previous hMod value, this is the prefered load address in this window session.

  4. Recompile and run. CRASH!

Thanks to eugene.

Up to now. Microsoft ist still investigating about this. They have dumps and all code. But I don't have a final answer. Fact is we have a fatal bug in some Windows 64bit OS.

I currently made the following changes to get around this

  1. Open atlstdthunk.h of VS-2015.

  2. Uncomment the #ifdef block completely that defines USE_ATL_THUNK2. Code lines 25 to 27.

  3. Recompile your program.

This enables the old thunking mechanism well known from VC-2010, VC-2013... and this works crash free for me. As long as there are no other already compiled libraries involved that may subclass or use 256 windows via ATL in any way.

Comment (Mar. 1st 2017):

  • Microsoft confirmed that this is a bug. It should be fixed in Windows 10 RS2.
  • Mircrosoft agrees that editing the headers in the atlstdthunk.h is a workaround for the problem.

In fact this says. As long as there is no stable patch I can never use the normal ATL thunking again, because I will never know what Window versions out in the world will use my program. Because Windows 8 and Windows 8.1 and Windows 10 prior to RS2 will suffer on this bug.

Final Comment (Mar. 9th 2017):

  • Builds with VS-2017 are affected too, there is no difference between VS-2015 and VS-2017
  • Microsoft decided that there will be no fix for older OS, regarding this case.
  • Neither Windows 8.1, Windows Server 2012 RC2 or other Windows 10 builds will get a patch to fix this issue.
  • The issue is to rare and the impact for our company is to small. Also the fix from our side is to simple. Other reports of this bug are not known.
  • The case is closed.

My advice for all programers: Change the the atlstdthunk.h in your Visual Studio version VS-2015, VS-2017 (see above). I don't understand Microsoft. This bug is a serious problem in the ATL thunking. It may hit every programmer that uses a greater number of windows and/or subclassing.

We only know of a fix in Windows 10 RS2. So all older OS are affected! So I recommend to disable the use of the atlthunk.dll by commenting out the define noted above.

Pulvinate answered 19/1, 2017 at 12:8 Comment(26)
You never mentioned, which SEH exception is raised. Which one is it? Besides, you call ShellExecute on a thread, that never initialized COM. That's not entirely prudent either.Esker
One potential problem, you are destroying windows (::DestroyWindow) - which will posts messages to the window - and then letting your subclass array immediately go out of scope. This will mean that window destruction messages will have nowhere valid to be processed. Also if there are any pending messages these will have the same problem.Furan
@RichardCritten: Neither one is a potential issue. DestroyWindow is strictly serialized. When it returns, all messages have been sent (they aren't posted) and processed. And if there are indeed pending messages, DispatchMessage won't be able to find the destination window, and nothing will happen.Esker
@RichardCritten: In normal cases the crash has nothing to do with the destruction phase. The crash happens in the loop in the SendWindow line. Also it is completely safe to destroy a subclassed windows. This is true for MFC and ATL subclassing. Also in my case there are no messages in any message queue... and as you can see I even have no message loop at all.Pulvinate
@IInspectable: OK about the missing COM init, but anyhow the ShellExecute is only executed when the program succeeded and when we have a count.Pulvinate
@IInspectable: Added type of SEH exception to the question: Exception thrown at 0x0BF67000 in CrashAtlThunk.exe: 0xC0000005: Access violation executing location 0x0BF67000.Pulvinate
I'd still recommend initializing COM on the calling thread, just to preclude that as a potential issue. What's interesting about the exception is, that the program is trying to execute code at a memory page boundary. Could you have a look at the thunk code? Does it straddle memory pages? Also worth investigating: Does thunk code straddle memory pages in case everything goes well?Esker
@Pulvinate you don't need to use the legacy thunking, I would highly suggest disabling it. The legacy thunking compat was for older operating systems and is not compatible with DEP.Kavita
@Mgetz: What do you mean with legacy thunking? I just use ATL subclassing. The rest is done by the ATL. Incl. the way it wants to subclass, AND this is not OLD way.Pulvinate
@IInspectable: The thunk code itself is valid. atlthunk.dll has 256 entry points, and the address and the code is valid. The problem is what AtlThunk_Call is doing... it seams that the given address to be called is wrong.Pulvinate
I don't know, whether the thunk address is right or not. The access violation SEH exception merely indicates that the program is trying to execute code in a memory page, that's either not readable or executable. This could be due to a wrong address, or invalid page protection settings.Esker
@IInspectable: The memory address of the crash seems to be random in some way: Exception thrown at 0x2D75E06D in CrashAtlThunk.exe: 0xC0000005: Access violation executing location 0x2D75E06D. The address definitly doesn't exists.Pulvinate
@Pulvinate better question: why do you need to subclass a window more than 255 times? this seems like a bad idea.Kavita
@Mgetz: Bad idea? ;). Also I don't subclass 1 window 256 times. I have 256 windows that are subclassed 1 time each. Think about a large address management system. A lot of Tab-Views. Large dialog templates. Each Tab-View contains x edit controls. and list controls or whatever. This controls get subclassed. 256 subclassed windows in one open application with cascaded dialogs is easily reachable.Pulvinate
@Pulvinate using GDI windows more than necessary has long been known to be an anti-pattern. While I do suggest to switching to windowless controls; I understand that's not an easy undertaking. That said it is a LOT more scalable than using windowed controls, as GDI has a lot of limits. Clearly the ATL does too.Kavita
@Mgetz: You suggestion has nothing to do with my question and the bug, if it is one. Yes you are right, but even 256 windows for a standard ATL application is reached very fast with tooltips, frame, docking,.. tabs. I don't want to discuss modern style application design and handling older legacy applications... thank you for reading.Pulvinate
I just opened a case at the Microsoft support. Added text for this in the question.Pulvinate
Did you pick one of the support plans, or did you post at Connect? If the latter is the case, a link to the incident would be helpful.Esker
We have a contract. So I used this way. connect seams to be a "no reaction / can not repro" portal. Even in my MVP time I never had luck using it.Pulvinate
Can you post link for Connect issue? We also managed to be affected with this bug. As workaround we place a copy of atlstdthunk.h in our project tree and comment out #define USE_ATL_THUNK2 in it so it is working as previous versions of VS.Tetchy
I currently have only a support case. So I opened a support case and it is still open. Currently I this is the same fix... Can you repro the problem with my code?Pulvinate
But in fact. This is no solution. It is just a possibility to get around the bug.Pulvinate
No, we were unable to reproduce with your code. Maybe you can provide content of project file so we will have the same settings.Tetchy
I send you the project. When you use my code. Launch the programm with a parameter like 1000 or 100. This will consecutively launch the program the given number of times. Email martin(dot)richter(at)grutzeck(dot)dePulvinate
Finally MS confirmed the bug. I added a comment to the question.Pulvinate
Case is closed by Microsoft. There will be no fix. See my advice and a final comment at the bottom of the question.Pulvinate
G
8

This is the bug inside atlthunk.dll. When it loads itself second time and further this happens manually via MapViewOfFile call. In this case not every address relative to the module base is properly changed (when DLL loaded by LoadLibarary/LoadLibraryEx calls system loader does this automatically). Then if the first time DLL was loaded on preferred base address everything works fine as unchanged addresses point to the similar code or data. But if not you got crash when 257th subclassed window handles messages.

Since Vista we have "address space layout randomization" feature this explains why your code crashes randomly. To have crash every time you have to discover atlthunk.dll base address on your OS (it differs on different OS versions) and do one memory page address space reservation at this address using VirtualAlloc call before the first subclass. To find the base address you can use dumpbin /headers atlthunk.dll command or parse PE headers manually.

My test shows that on Windows 10 build 14393.693 x32 version is affected but x64 is not. On Server 2012R2 with latest updates both (x32 and x64) versions are affected.

BTW, atlthunk.dll code has around 10 times more CPU instructions per thunk call as previous implementation. It may be not very significant but it slows down the message processing.

Gorges answered 23/2, 2017 at 9:34 Comment(8)
So you can repro the problem? I have the crashes only on a Windows 10 x64 machine. Or are you talking about the DLL version?Pulvinate
Yes I mean x64/x32 DLL on x64 OS.Gorges
And yes, I slightly modify your code to reserve memory at preferred DLL location before first subclass. This gives 100% crash reproducibility.Gorges
Can you send me your version? martin(dot)richter(at)grutzeck(dot)de. I Don't understand your trick fully. The prefered base address for this DLL is always 10000000. I get always different addresses.Pulvinate
It is easy. LoadLibrary to get the first load address. FreeLibrary again. Now reserve the memory with VirtualAlloc. and CRASH!Pulvinate
This is not 100% bulletproof but should do the thick most of the time. I am glad that it helps.Gorges
Finally MS confirmed the bug. I added a comment to the question.Pulvinate
Case is closed by Microsoft. There will be no fix. See my advice and a final comment at the bottom of the question.Pulvinate
F
0

Slightly more automatic form of what was already described:

// A minimum ATL program with more than 256 windows. In practise they would not be toplevel, but e.g. buttons.
// Thanks to https://www.codeguru.com/cpp/com-tech/atl/article.php/c3605/Using-the-ATL-Windowing-Classes.htm
// for helping with ATL.
// You need to be up to date, like have KB3030947 or KB3061512. Otherwise asserts will fail instead.
#undef _DEBUG
#include <atlbase.h>
ATL::CComModule _Module;
#include <atlwin.h>
#include <assert.h>
#include <string>

BEGIN_OBJECT_MAP(ObjectMap) END_OBJECT_MAP()

struct CMyWindow : CWindowImpl<CMyWindow>
{
    BEGIN_MSG_MAP(CMyWindow) END_MSG_MAP()
};

int __cdecl wmain()
{
    // Exacerbate the problem, which can happen more like if by chance.
    PROCESS_INFORMATION process = { 0 };
    {
        // Be sure another process has atlthunk loaded.
        WCHAR cmd[] = L"rundll32 atlthunk,x";
        STARTUPINFOW startup = { sizeof(startup) };
        BOOL success = CreateProcessW(0, cmd, 0, 0, 0, 0, 0, 0, &startup, &process);
        assert(success && process.hProcess);
        CloseHandle(process.hThread);
        // Get atlthunk's usual address.
        HANDLE file = CreateFileW((std::wstring(_wgetenv(L"SystemRoot")) + L"\\system32\\atlthunk.dll").c_str(), GENERIC_READ,
            FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
        assert(file != INVALID_HANDLE_VALUE);
        HANDLE mapping = CreateFileMappingW(file, 0, PAGE_READONLY | SEC_IMAGE, 0, 0, 0);
        assert(mapping);
        void* view = MapViewOfFile(mapping, 0, 0, 0, 0);
        assert(view);
        UnmapViewOfFile(view);
        VirtualAlloc(view, 1, MEM_COMMIT | MEM_RESERVE, PAGE_NOACCESS);
    }
    _Module.Init(0, 0);
    const int N = 300;
    CMyWindow wnd[N];
    for (int i = 0; i < N; ++i)
    {
        wnd[i].Create(0, CWindow::rcDefault, L"Hello", (i < N - 1) ? 0 : (WS_OVERLAPPEDWINDOW | WS_VISIBLE));
        wnd[i].DestroyWindow();
    }
    TerminateProcess(process.hProcess, 0);
    CloseHandle(process.hProcess);
    MSG msg;
    while (GetMessageW(&msg, 0, 0, 0))
    {
        TranslateMessage(&msg);
        DispatchMessageW(&msg);
    }
    _Module.Term();
}
Frigate answered 6/10, 2020 at 10:39 Comment(1)
Just a note that despite earlier postings, the bug "only" affects 32bit code on Windows 8.1. Windows 8.1 unfortunately went out of mainstream support in 2018, thus the difficulty getting the bug fixed (yes, the bug was reported in 2017).Frigate

© 2022 - 2024 — McMap. All rights reserved.