Intermittent crash of w3wp.exe with ThreadAbortException after .NET 4.6 upgrade
Asked Answered
B

3

10

For the last couple days we have seen intermittent crashes of the w3wp.exe worker process serving the main application pool for our corporate web site. Sometimes the crashes are isolated, and IIS is able to restart the worker process successfully. But if more than 5 crashes happen in 5 minutes, IIS Rapid Fail Protection kicks in and stops the application pool. Here is an example entry from the Application event log just before the crash:

An unhandled exception occurred and the process was terminated.
Application ID: /LM/W3SVC/2/ROOT
Process ID: 3640
Exception: System.Threading.ThreadAbortException
Message: Thread was being aborted.
StackTrace:    at System.Web.HttpRuntime.ProcessRequestNotificationPrivate(IIS7WorkerRequest wr, HttpContext context)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)

Immediately after the crash due to the ThreadAbortException, there is a more serious event logged:

Faulting application name: w3wp.exe, version: 8.0.9200.16384, time stamp: 0x5010885f
Faulting module name: KERNELBASE.dll, version: 6.2.9200.17366, time stamp: 0x554d16f6
Exception code: 0xe0434352
Fault offset: 0x00010192
Faulting process id: 0xe38
Faulting application start time: 0x01d100dc662652d6
Faulting application path: C:\Windows\SysWOW64\inetsrv\w3wp.exe
Faulting module path: C:\Windows\SYSTEM32\KERNELBASE.dll
Report Id: db5b0d5b-6cd0-11e5-9418-005056900458
Faulting package full name: 
Faulting package-relative application ID: 

Now, a ThreadAbortException should never cause w3wp.exe to crash, seeing as it is thrown every time a standard Response.Redirect() is performed. MSDN confirms this, and I also confirmed it with a simple test. However, at least one other person has seen a similar crash recently with a similar environment: Thread.Abort in ASP.NET app causes w3wp.exe to crash. (But that may be an unrelated issue.)

Our environment:

  • Corporate web site with shopping cart and partner web services; targets .NET 4.5. (900,000+ lines of custom code including business logic DLL's and in-house libraries.)
  • 2 VMWare web servers in a load-balanced pool using Windows NLB
  • IIS 8.0 / Windows 2012 Server Standard / .NET 4.6.00081
  • App pool running in 32 bit mode because we have to support a handful of classic ASP pages calling a legacy VB6 DLL.

Background:

A couple days prior to the start of crashes, we upgraded to .NET 4.6. We have the new RyuJIT enabled (the default setting) and we have installed all updates to address the critical compiler issue described here: http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx.

We had also deployed a new version of our web code (as we do several times per week). Naturally we double-checked the code changes for any potential crash vulnerabilities, but none of our changes seem vulnerable to infinite loops, recursive stack overflows, or high memory usage -- the normal culprits when w3wp.exe crashes with an unhandled exception.

Sometimes the crash affects one web server within minutes of another, but other times only one web server is affected.

Things I've tried:

  • Restarted the servers and installed all Windows Updates.
  • Analyzed the IIS logs to see if any suspicious/bad requests are coming in just before the crashes. I couldn't find any pattern -- all the requests look normal.
  • Enabled automatic crash minidumps for w3wp.exe (as described at https://msdn.microsoft.com/en-us/library/bb787181.aspx) and analyzed them using WinDbg. Unfortunately the CLR "stack trace of interest" does not show anything useful, just a couple empty GC frames not related to our code:
> 0:026> !clrstack
> OS Thread Id: 0x1ff0 (26)
> Child SP       IP Call Site
> 2321f96c 771bdf8c [GCFrame: 2321f96c]
> 2321f9a4 771bdf8c [GCFrame: 2321f9a4]

Any ideas?

Update:

We have rolled back .NET 4.6 and recent Windows Updates on our web servers. We have been monitoring this for either 2 or 3 days, depending on when the server was rolled back, and in each case, there have been zero subsequent crashes, despite maintaining the same application code. This pretty definitively proves that either .NET 4.6 or the other Windows Updates caused the intermittent crashing, not our code, because w3wp.exe was previously crashing several times per day.

We are now trying to prove this to Microsoft Support, but it's an uphill battle because the issue was random, intermittent, and we could not reproduce it reliably. (They have provided a dump analysis but it seems to be a red herring.) We are also in the process of reapplying the updates in groups and waiting several days to watch for crashes, in an effort to isolate the faulty update. Obviously this is a tedious process.

Update #2:

We've now re-applied all the pre-.NET 4.6 Windows Updates that were removed in troubleshooting, and the servers have been running for several days without crashes. The only things left to re-apply are .NET 4.6 and its own updates, but my management is understandably reluctant to install things that will likely cause crashes in production. So I'm continuing to work with MS to analyze different crash dumps to pinpoint the problem.

Brietta answered 7/10, 2015 at 19:55 Comment(17)
Are you manually starting any threads in your site code?Couchman
@Couchman Yes, our code has been doing that to parallelize certain medium-length API calls for several years now. But it's never been a problem, and that part of the code hasn't changed recently.Brietta
An exception in any thread not associated with an HTTP request will tear down the process. I bet it has nothing to do with .NET 4.6, that may be a coincidence. You shouldn't spin up your own threads. Depending on how long the tasks are you may be able to use Task-based Asynchronous Programming, or move to some other method of running that code in the background. See Phil Haack and Scott Hanselman's blog posts.Couchman
@Couchman In general I agree that we shouldn't spin up our own threads. But we have a use case where we want to simultaneously call multiple different API's and strictly control the number of threads used (one per partner, typically just a few dozen at once) and the duration (around 30 seconds). So for this we like the fine control that manual threading gives us vs. thread-pool-backed implementations like Tasks. In any case, if one of our user threads was being manually aborted, wouldn't the crash dump stack trace show that? I guess will try to reproduce this scenario.Brietta
@Couchman Actually, according to this MSDN reference, if the exception is a ThreadAbortException, the process will not terminate, and the CLR will just terminate the thread gracefully: msdn.microsoft.com/en-us/library/ms228965%28v=vs.110%29.aspx. And I verified this with a small test app: If the exception in the thread was just a normal unhandled new Exception("foo"), it crashed w3wp.exe. But if it was a ThreadAbortException caused by a manual Thread.Abort(), the process didn't crash.Brietta
I wouldn't really trust your test, as there's likely something you're leaving out. Instead, read over the information in Phil Haack's blog about what you can do to associate a thread with an HTTP request, or find a better way than spinning up your own threads.Couchman
@Couchman Happy to continue this in chat, but those blog posts don't really contradict MSDN. They're just leaving out a special case of ThreadAbortExceptions on spawned Threads, which do not crash the process. Here is a very simple test proving that MSDN is correct: pastebin.com/dtzkE3gG. So in reference to my question: 1) ThreadAbortExceptions on spawned threads should not be causing the issue; and 2) Even if they were, I should see a stack trace pointing at user code, which I don't.Brietta
Have you tried to disable the RyuJIT on the whole machine? We've had funny problems with this.Shalloon
There is an error in 4.6, do not use it if possible. nickcraver.com/blog/2015/07/27/why-you-should-wait-on-dotnet-46Legislate
@SimonMourier Have you seen those problems even after installing the update that is supposed to fix the RyuJIT issue? The update is KB3083184/5/6 depending on the version of Windows you're running.Brietta
Absolutely. We've seen a specific problem after all known updates were installed. Note it was only with optimisations on (release compilation). Disabling the ryujit fixed it immediately. It's just a registry key to set/unset to test this.Shalloon
@SimonMourier Wow, interesting. I'm aware of that registry key and I will ask our admins if they can disable it on one of our web servers. Then we can see if that web server still randomly crashes.Brietta
Disabling the RyuJit didn't help :(Brietta
Hmm, just realized I overlooked that your w3wp.exe is running as 32-bit process (syswow64) and your crash dump offsets show 32-bit mem addresses... so why were you using RyuJIT? And are you sure you analyzed your crash dump correctly? Do you have a place you can post it so we could analyze it? (I understand if security reasons preclude this). Can you run your web app as 64-bit, and does that help? (Uncheck enable 32-bit applications in IIS app pool)Interfluve
@Interfluve Technically I said RyuJIT was enabled (based on the default .NET 4.6 behavior) not that we were actively using it. Actually I didn't realize that RyuJIT only works for 64-bit processes, so thank you for pointing that out. In any case, we can't disable 32-bit because we have a handful of classic ASP pages calling legacy a VB6 DLL in the same app pool.Brietta
@Interfluve I'm using WinDbg to analyze the crash dumps, but I don't know the tool very well. I sent the dumps to MS Support, who sent back an analysis of a single dump that pointed to infinite recursion happening in the markup rendering (not code behind) of a Master page being used by a specific page on the site. But that page takes no user input, and the URL worked fine whenever we hammered it. I believe the problem is not just that page but is randomly affecting other pages that also have no custom recursive code. MS is not providing a similar detailed analysis of any other dump...Brietta
Here is a link to the dump analysis that MS gave us: pastebin.com/G74sxhT3.Brietta
O
3

@Jordan Rieger, this bug should be fixed in .NET 4.6.1 Can you please confirm whether the problem is fixed in the new framework? Or if it still persists? Thanks.

Orchitis answered 9/3, 2016 at 22:31 Comment(1)
It appears that .NET 4.6.1 did address the issue as we have had it installed for several weeks without encountering this problem. Rolling back from .NET 4.6 to 4.5 also fixed it for us temporarily, but I am happy to now be on the latest stable version.Brietta
I
5

You didn't show any code, but the evidence suggests this is an issue with your application code, and not with .NET 4.6 or with ThreadAbortException specifically.

Basic troubleshooting steps here: you said there were code changes AND environment changes; so rule one of them out.

  • Test app on a VM with .NET 4.5 installed. If you do not get error, .NET 4.6 may be the cause.

  • Test older version of your app on same server. If no issue noticed, code change is likely cause.

  • Test app on machine with VS.NET installed, attach to the w3wp.exe process, and debug it (Tools > Attach to Process). Catch the ThreadAbortException and trace through it.

  • If you don't already, you should be logging the event that your w3wp.exe process stops.. though this obviously will not handle all exceptions. Google this, but this guy describes a solution that I also use

  • If you don't already, define an Application_Error handler in Global to log exceptions. Microsoft demonstrates this. Create a System.Web.Configuration option that you can toggle in your web.config file to enable different levels of logging, including writing to a local file, and writing to the windows event logs, for example. You can also install a logging handler tool like Elmah.

  • Create a barebones simple web app and test Response.Redirect to verify whether it crashes the w3wp.exe (worker process) with .NET 4.6. I did this, and it didn't, so I suspect your code. Or possible weird server/patch level emergent issue.. these steps should help you pinpoint it.

Side note: Even though it shouldn't affect the app process, I recommend fixing the Response.Redirect() issues. We did this recently in an Enterprise app, and yes it was a change of wide scope, but we no longer get the TAE exceptions. The fix is simple: just call Response.Redirect(false); and then make sure that there is no code that will run after that function is called (call return for example). This post explains

Interfluve answered 14/10, 2015 at 15:25 Comment(10)
We reverted to .NET 4.5 on one of our web servers yesterday (but still using our latest code.) So far, that server has not crashed -- a strong indication that .NET 4.6 is to blame, but I can't say for sure until it goes longer without a crash, because the crash is random and impossible to reproduce on demand. We have provided crash dumps to Microsoft Support, but their analysis seems unhelpful so far. Response.Redirect() is likely not related to our issue because the CLR stack trace in the dumps is pointing at an infinite loop in control rendering code.Brietta
Do you have a recursive function in one of your applications? I have experienced the same problem few months back and found the actual problem is with my own code not the server (i.e. an unexpected condition resulted in endless loop). nothingisnecessary's answer seems to be correct.Putrescible
Our web application, including business logic DLL's and in-house libraries, is over 900,000 lines of code. It does contain a small amount of recursive code for certain specific tasks, but that code is well tested, has not changed recently, and is not running randomly on every web request.Brietta
So after rolling back .NET 4.6 (along with a slew of unrelated Windows Updates) on all servers, there have been zero crashes. The servers have been rolled back for between 2 and 3 days. Our application code is unchanged. This pretty definitively proves that either .NET 4.6 or the other Windows Updates caused the intermittent crashing, and not our code, because w3wp.exe was previously crashing several times per day. We are now trying to prove this to Microsoft Support, but it's an uphill battle because the issue was random, intermittent, and we could not reproduce it reliably.Brietta
Thanks for the detailed info. Agree that your evidence may suggest an emergent issue with that combination of environments (.NET 4.6 and Server 2012). We're also looking to move to RyuJIT and 4.6 for a 64-bit web app and so I wanted to be aware of cause of this issue beforehand. However, we have not noticed this prob in our test environments, and so I wonder if introducing more load will trigger it.. I'll keep you posted.Interfluve
I'm running into something similar on Server 2008R2. It's not just on ThreadAbortExceptions as far as I can tell; any time any unhandled exception is thrown by the application (DotNetNuke in this case), it's bringing down the entire AppPool.Waverley
Sounds like bug in DotNetNuke; unhandled exception in the thread belonging to a Request should not terminate the Application. Try debugging the Global Application_Error handler to make sure DNN is not doing something stupid in this case. Or post in DotNetNuke bug forums.Interfluve
@Waverley have you tried rolling back .NET 4.6 to .NET 4.5.x? Or have you tried reproducing the problem by just manually throwing an exception in a specific page? You may be able to isolate it further. (In my case, which I believe is different from yours, it did not reproduce.) Or maybe there is an IIS/ASP.NET setting that controls this type of exception handling behavior.Brietta
@Jordan I was hesitant to revert since this would involve manually reinstalling 4.5.x again, which if I'm not mistaken would have taken the websites offline for the duration. There was a MS hotfix for the issue that fixed the issue for me. See my answer here: #32335520Waverley
@Waverley Interesting. I will see if we can schedule a time to install and test this hotfix. It seems to be a slightly different issue than mine, although possibly related, because in my case the app pool is 32 bit and therefore not running RyuJIT. Also, in my case a simple ThreadAbortException does not trigger the issue, in fact I cannot reproduce it at all except in production.Brietta
O
3

@Jordan Rieger, this bug should be fixed in .NET 4.6.1 Can you please confirm whether the problem is fixed in the new framework? Or if it still persists? Thanks.

Orchitis answered 9/3, 2016 at 22:31 Comment(1)
It appears that .NET 4.6.1 did address the issue as we have had it installed for several weeks without encountering this problem. Rolling back from .NET 4.6 to 4.5 also fixed it for us temporarily, but I am happy to now be on the latest stable version.Brietta
L
0

4.6 is unstable ( http://nickcraver.com/blog/2015/07/27/why-you-should-wait-on-dotnet-46/ ), revert back to 4.5.x if possible.

Legislate answered 10/10, 2015 at 16:33 Comment(2)
We are looking at reverting to 4.5.x or disabling RyuJIT, but according to Microsoft, they have addressed the issue found by Nick Craver and Marc Gravel, and as I mentioned in my question, we have that update installed: blogs.msdn.com/b/dotnet/archive/2015/07/28/….Brietta
Addressed, but no proof currently that 4.6 is stable, e.g. not having another issue.Legislate

© 2022 - 2024 — McMap. All rights reserved.