For the last couple days we have seen intermittent crashes of the w3wp.exe worker process serving the main application pool for our corporate web site. Sometimes the crashes are isolated, and IIS is able to restart the worker process successfully. But if more than 5 crashes happen in 5 minutes, IIS Rapid Fail Protection kicks in and stops the application pool. Here is an example entry from the Application event log just before the crash:
An unhandled exception occurred and the process was terminated.
Application ID: /LM/W3SVC/2/ROOT
Process ID: 3640
Exception: System.Threading.ThreadAbortException
Message: Thread was being aborted.
StackTrace: at System.Web.HttpRuntime.ProcessRequestNotificationPrivate(IIS7WorkerRequest wr, HttpContext context)
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)
Immediately after the crash due to the ThreadAbortException, there is a more serious event logged:
Faulting application name: w3wp.exe, version: 8.0.9200.16384, time stamp: 0x5010885f
Faulting module name: KERNELBASE.dll, version: 6.2.9200.17366, time stamp: 0x554d16f6
Exception code: 0xe0434352
Fault offset: 0x00010192
Faulting process id: 0xe38
Faulting application start time: 0x01d100dc662652d6
Faulting application path: C:\Windows\SysWOW64\inetsrv\w3wp.exe
Faulting module path: C:\Windows\SYSTEM32\KERNELBASE.dll
Report Id: db5b0d5b-6cd0-11e5-9418-005056900458
Faulting package full name:
Faulting package-relative application ID:
Now, a ThreadAbortException should never cause w3wp.exe to crash, seeing as it is thrown every time a standard Response.Redirect() is performed. MSDN confirms this, and I also confirmed it with a simple test. However, at least one other person has seen a similar crash recently with a similar environment: Thread.Abort in ASP.NET app causes w3wp.exe to crash. (But that may be an unrelated issue.)
Our environment:
- Corporate web site with shopping cart and partner web services; targets .NET 4.5. (900,000+ lines of custom code including business logic DLL's and in-house libraries.)
- 2 VMWare web servers in a load-balanced pool using Windows NLB
- IIS 8.0 / Windows 2012 Server Standard / .NET 4.6.00081
- App pool running in 32 bit mode because we have to support a handful of classic ASP pages calling a legacy VB6 DLL.
Background:
A couple days prior to the start of crashes, we upgraded to .NET 4.6. We have the new RyuJIT enabled (the default setting) and we have installed all updates to address the critical compiler issue described here: http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx.
We had also deployed a new version of our web code (as we do several times per week). Naturally we double-checked the code changes for any potential crash vulnerabilities, but none of our changes seem vulnerable to infinite loops, recursive stack overflows, or high memory usage -- the normal culprits when w3wp.exe crashes with an unhandled exception.
Sometimes the crash affects one web server within minutes of another, but other times only one web server is affected.
Things I've tried:
- Restarted the servers and installed all Windows Updates.
- Analyzed the IIS logs to see if any suspicious/bad requests are coming in just before the crashes. I couldn't find any pattern -- all the requests look normal.
- Enabled automatic crash minidumps for w3wp.exe (as described at https://msdn.microsoft.com/en-us/library/bb787181.aspx) and analyzed them using WinDbg. Unfortunately the CLR "stack trace of interest" does not show anything useful, just a couple empty GC frames not related to our code:
> 0:026> !clrstack > OS Thread Id: 0x1ff0 (26) > Child SP IP Call Site > 2321f96c 771bdf8c [GCFrame: 2321f96c] > 2321f9a4 771bdf8c [GCFrame: 2321f9a4]
Any ideas?
Update:
We have rolled back .NET 4.6 and recent Windows Updates on our web servers. We have been monitoring this for either 2 or 3 days, depending on when the server was rolled back, and in each case, there have been zero subsequent crashes, despite maintaining the same application code. This pretty definitively proves that either .NET 4.6 or the other Windows Updates caused the intermittent crashing, not our code, because w3wp.exe was previously crashing several times per day.
We are now trying to prove this to Microsoft Support, but it's an uphill battle because the issue was random, intermittent, and we could not reproduce it reliably. (They have provided a dump analysis but it seems to be a red herring.) We are also in the process of reapplying the updates in groups and waiting several days to watch for crashes, in an effort to isolate the faulty update. Obviously this is a tedious process.
Update #2:
We've now re-applied all the pre-.NET 4.6 Windows Updates that were removed in troubleshooting, and the servers have been running for several days without crashes. The only things left to re-apply are .NET 4.6 and its own updates, but my management is understandably reluctant to install things that will likely cause crashes in production. So I'm continuing to work with MS to analyze different crash dumps to pinpoint the problem.
enable 32-bit applications
in IIS app pool) – Interfluve