WinDbg : Hunting exceptions that have caused a .net service to crash
Asked Answered
S

3

8

Have been having issues with an x64 dotnet service crashing intermittently on an application server. The service can run for hours, days, or weeks with no issue, but then fall over with not much info.

The service is run in a cluster (3 x service each server) across two servers - and crashes have been seen of any of the services on both servers. A replicated environment shows the same behaviour so I've 'exhausted' the idea of a configuration issue.

The error originally pulled from the application server's event log was:

Error message from event log on server XXXX

Application: MySvc.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an internal error in the .NET Runtime
at IP 000007FEEFD8CD4C (000007FEEFC70000) with exit code 80131506

This doesn't show much detail and the best pointers I've found online is to 'cross fingers'...

Application Crashes With "Internal Error In The .NET Runtime"

http://www.jamesewelch.com/2010/09/30/troubleshooting-internal-error-in-the-net-runtime/

Finally after a month of running with AdPlus debugger attached we have a string of failures and some crash dumps. Now that I have the dumps I'm having trouble gaining anything of use from them.

I've previously investigated a few 'hang' dumps with plenty of success and read much of Tess Ferrandez's blogs among others, but the 'crash' dumps I have are proving to be deadends. Most objects, exceptions etc all marked for garbage collection & only the main thread remaining - I'm probably missing something.

I'll add the detail from !analyze -v and also the dump logs - which do show exceptions.

So - the real question on this is : can someone give me some pointers of where to head from here. The exceptions in the dump log don't match what I can see in the actual dump.

DUMP 1 Log avail here: http://pastebin.com/Eg5YCqww

DUMP 1 Analyze: (I have a symbols issue I can't resolve..)

0:000> !analyze -v
***
FAULTING_IP: 
+112c9440
00000000`00000000 ??              ???

EXCEPTION_RECORD:  ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 0000000000000000
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 0

FAULTING_THREAD:  00000000000011f8

PROCESS_NAME:  MySvc.exe

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid

MOD_LIST: <ANALYSIS/>

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

MANAGED_STACK: 
(TransitionMU)
000000000022EBB0 000007FEF40CB1AB System_ServiceProcess_ni!DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr)+0x3b
000000000022EC70 000007FEF40CD20D System_ServiceProcess_ni!System.ServiceProcess.ServiceBase.Run(System.ServiceProcess.ServiceBase[])+0x26d
000000000022EDA0 000007FF00170227 MySvc!Ax.Remoting.MySvc.Main()+0x107
(TransitionUM)

MANAGED_STACK_COMMAND:  _EFN_StackTrace

BUGCHECK_STR:  APPLICATION_FAULT_WRONG_SYMBOLS_FILL_PATTERN_ffffffff

PRIMARY_PROBLEM_CLASS:  WRONG_SYMBOLS_FILL_PATTERN_ffffffff

DEFAULT_BUCKET_ID:  WRONG_SYMBOLS_FILL_PATTERN_ffffffff

LAST_CONTROL_TRANSFER:  from 000007fefd8810ac to 000000007760f6fa

STACK_TEXT:  
00000000`0022e818 000007fe`fd8810ac : 00000000`007541f0 000007fe`f40ce089 00000000`0022e9c0 00000000`00000000 : ntdll!ZwWaitForSingleObject+0xa
00000000`0022e820 000007fe`fe7daffb : 00000000`ffffffff 000007fe`fe7d344c 00000000`00000000 00000000`0000032c : KERNELBASE!WaitForSingleObjectEx+0x79
00000000`0022e8c0 000007fe`fe7d9d61 : 00000000`01d47ff0 00000000`0000032c 00000000`00000000 00000000`00000000 : sechost!ScSendResponseReceiveControls+0x13b
00000000`0022e9b0 000007fe`fe7d9c16 : 00000000`0022eb18 00000000`00000000 00000000`00000000 000007fe`00000000 : sechost!ScDispatcherLoop+0x121
00000000`0022eac0 000007fe`f19017c7 : 00000000`11213890 00000000`01d635c0 00000000`00000000 00000000`00000000 : sechost!StartServiceCtrlDispatcherW+0x14e
00000000`0022eb10 000007fe`f40cb1ab : 00000000`01d63680 00000000`0022ebe8 000007fe`f40a5b50 0000bf6c`4589127e : clr!DoNDirectCall__PatchGetThreadCall+0x7b
00000000`0022ebb0 000007fe`f40cd20d : 00000000`01d63680 00000000`00000000 00000000`01d63698 00000000`00000000 : System_ServiceProcess_ni+0x2b1ab
00000000`0022ec70 000007ff`00170227 : 00000000`10ff1ac8 00000000`10ff1af0 00000000`10ff1af0 00000000`10ff1af0 : System_ServiceProcess_ni+0x2d20d
00000000`0022eda0 000007fe`f196dc54 : 00000000`0022ee80 000007fe`f1904e65 ffffffff`fffffffe 00000000`0022f3a0 : 0x7ff`00170227
00000000`0022ee30 000007fe`f196dd69 : 000007ff`000551f8 00000000`00000001 00000000`00000000 00000000`00000000 : clr!CallDescrWorker+0x84
00000000`0022ee70 000007fe`f196dde5 : 00000000`0022ef88 00000000`00000000 00000000`0022ef90 00000000`0022f168 : clr!CallDescrWorkerWithHandler+0xa9
00000000`0022eef0 000007fe`f1a214c5 : 00000000`00000000 00000000`0022f178 00000000`00000000 00000000`00000000 : clr!MethodDesc::CallDescr+0x2a1
00000000`0022f120 000007fe`f1a215fc : 00000000`000ad7c0 00000000`000ad7c0 00000000`00000000 00000000`00000000 : clr!ClassLoader::RunMain+0x228
00000000`0022f370 000007fe`f1a213b2 : 00000000`0022f970 00000000`00000200 00000000`000b7a80 00000000`00000200 : clr!Assembly::ExecuteMainMethod+0xac
00000000`0022f620 000007fe`f1ac6d66 : 00000000`00000000 00000000`10fd0000 00000000`00000000 00000000`00000000 : clr!SystemDomain::ExecuteMainMethod+0x452
00000000`0022fbd0 000007fe`f1ac6c83 : 00000000`10fd0000 00000000`00000000 00000000`00000000 00000000`00000000 : clr!ExecuteEXE+0x43
00000000`0022fc30 000007fe`f1a2c515 : 00000000`000ad7c0 ffffffff`ffffffff 00000000`00000000 00000000`00000000 : clr!CorExeMainInternal+0xc4
00000000`0022fca0 000007fe`f8973309 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`0022fc88 : clr!CorExeMain+0x15
00000000`0022fce0 000007fe`f8a05b21 : 000007fe`f1a2c500 000007fe`f89732c0 00000000`00000000 00000000`00000000 : mscoreei!CorExeMain+0x41
00000000`0022fd10 00000000`773bf56d : 000007fe`f8970000 00000000`00000000 00000000`00000000 00000000`00000000 : mscoree!CorExeMain_Exported+0x57
00000000`0022fd40 00000000`775f2cc1 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`0022fd70 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d


STACK_COMMAND:  ~0s; .ecxr ; kb

FOLLOWUP_IP: 
sechost!ScSendResponseReceiveControls+13b
000007fe`fe7daffb 85c0            test    eax,eax

SYMBOL_STACK_INDEX:  2

SYMBOL_NAME:  sechost!ScSendResponseReceiveControls+13b

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: sechost

IMAGE_NAME:  sechost.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  4a5be05e

FAILURE_BUCKET_ID:  WRONG_SYMBOLS_FILL_PATTERN_ffffffff_80000003_sechost.dll!ScSendResponseReceiveControls

BUCKET_ID:  X64_APPLICATION_FAULT_WRONG_SYMBOLS_FILL_PATTERN_ffffffff_sechost!ScSendResponseReceiveControls+13b

UPDATE 1 (29 Dec):

Reconstructed one of the CLR exceptions from the dump log, call stack follows. Looks like exception occurs when calling the db (via ODAC)

clr!RaiseTheExceptionInternalOnly+0x363
clr!IL_Throw+0x146
gm.a(System.String, System.String, Int32, System.String, XXBase, Int32, XXDataParameter[])
gm.b(XXBase, XXBase, Boolean, Boolean, Boolean, Int32)
gm.b(XXBase, XXBase)
od.a(XXGridQueue, TaskStatus, ProcessResult, Int32, Int32, Int32)
od.b(XXGridQueue)
he.b(XXBaseCollection)
he.a(Boolean ByRef)
XX.MySvc.tmr_Elapsed(System.Object)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

Reconstructed the Access violation exception call stack. Error is thrown when calling the garbage collector after the ODAC library is called.

(1330.1074): Access violation - code c0000005 (first chance)
FirstChance_av_AccessViolation

clr!WKS::gc_heap::plan_phase+0x5ac
clr!WKS::gc_heap::gc1+0xbb
clr!WKS::gc_heap::garbage_collect+0x276
clr!WKS::GCHeap::GarbageCollectGeneration+0x14e
clr!WKS::gc_heap::try_allocate_more_space+0x25f
clr!WKS::GCHeap::Alloc+0x7e
clr!FastAllocatePrimitiveArray+0xc5
clr!JIT_NewArr1+0x389
System.Decimal.GetBits(System.Decimal)
Oracle.DataAccess.Types.DecimalConv.GetDecimal(IntPtr)
Oracle.DataAccess.Client.OracleDataReader.GetDecimal(Int32)
Oracle.DataAccess.Client.OracleDataReader.GetValue(Int32)
Oracle.DataAccess.Client.OracleDataReader.GetValues(System.Object[])
jr.a(System.Data.IDataReader, Boolean, ku, Boolean, DbTypeEnum, System.Type[])
ls.a(System.Data.IDataReader, Boolean, ku, Boolean, DbTypeEnum, System.Type[])
ba.a(System.String, System.Data.IDataReader, Boolean, ku, Boolean, System.Type[])
...
XX.MySvc.tmr_Elapsed(System.Object)

Possible similar issues (from new info): http://markmail.org/message/yy3mvbngula4i3mu#query:+page:1+mid:l546gn5sfxtxxm5i+state:results http://social.msdn.microsoft.com/Forums/en/clr/thread/33920b39-690c-42c8-b04a-0f1f7176835a

UPDATE 2 (23 Feb):

The ODAC components were upgraded to the correct version for Dotnet 4.0 (or that listed as compatible on the Oracle website) and the issue has still re-occurred. It is still re-occuring in a very intermittent fashion, every one or two weeks. The service that is occurs on is cycled every day.

Have some more dumps from the most recent crashes, and these still point to heap corruption - although are not full dumps (Access violation). Actually it seems creating the full dump failed.

Creating d:\dumps\2xx_Crash_Mode\FULLDUMP_FirstChance_epr_Process_Shut_Down_MySvc.exe__0344.dmp - mini user dump
WriteFullMemory.Memory.Read(0x262c000, 0x1000) failed 0x8007012b, ABORT.
Dump creation failed, Win32 error 0n299
    "Only part of a ReadProcessMemory or WriteProcessMemory request was completed."

Additionally a custom managed (dotnet) library is loaded into the application, and this too seems to be throwing an exception, although it is only a 'first chance', and doesn't seem to cause process failure (I'm guessing it may be a factor though). It's actually our library too, so I'm able to verify that it's not calling managed code. The error is:

EXCEPTION_RECORD:  ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 000007fefcffaa7d (KERNELBASE!RaiseException+0x0000000000000039)
ExceptionCode: c0000006 (In-page I/O error)
ExceptionFlags: 00000000
NumberParameters: 3
Parameter[0]: 0000000000000000
Parameter[1]: 000000006d34aca0
Parameter[2]: 00000000c00000c4
Inpage operation failed at 000000006d34aca0, due to I/O error 00000000c00000c4

PROCESS_NAME:  MySvc.exe

ERROR_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x.

EXCEPTION_OBJECT: !pe 1a8106a8
Exception object: 000000001a8106a8
Exception type:   System.Runtime.InteropServices.SEHException
Message:          External component has thrown an exception.
InnerException:   <none>
StackTrace (generated):
SP               IP               Function
000000002C77B980 0000000000000000 ...
000000002C77BA50 000007FF01DCBA51 ...

StackTraceString: <none>
HResult: 80004005

MANAGED_OBJECT: !dumpobj 148306f8
Name:        System.String
MethodTable: 000007feed9a6870
EEClass:     000007feed52ed58
Size:        112(0x70) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      External component has thrown an exception.
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
0000000000000000  4000103        8         System.Int32  1 instance               43 m_stringLength
0000000000000000  4000104        c          System.Char  1 instance               45 m_firstChar
000007feed9a6870  4000105       10        System.String  0   shared           static Empty
                             >> Domain:Value  00000000002a69f0:NotInit  000000000dd738d0:NotInit  <<

EXCEPTION_MESSAGE:  External component has thrown an exception.

MANAGED_OBJECT_NAME:  System.Runtime.InteropServices.SEHException

MANAGED_STACK_COMMAND:  !pe 1a8106a8

LAST_CONTROL_TRANSFER:  from 000007fef47e8fc1 to 000007fefcffaa7d

ADDITIONAL_DEBUG_TEXT:  Followup set based on attribute [Is_ChosenCrashFollowupThread] from Frame:[0] on thread:[PSEUDO_THREAD] ; Followup set based on attribute [ip_is_call_value_Arch_si] from Frame:[23] on thread:[162c]

FAULTING_THREAD:  ffffffffffffffff

BUGCHECK_STR:  APPLICATION_FAULT__SYSTEM.RUNTIME.INTEROPSERVICES.SEHEXCEPTION_APPLICATION_FAULT_CALL

PRIMARY_PROBLEM_CLASS:  _SYSTEM.RUNTIME.INTEROPSERVICES.SEHEXCEPTION_CALL

DEFAULT_BUCKET_ID:  _SYSTEM.RUNTIME.INTEROPSERVICES.SEHEXCEPTION_CALL

STACK_TEXT:  
00000000`2c77b980 00000000`00000000 ...
00000000`2c77ba50 00000000`ffffffff ...

Any one with any ideas of how to pursue this further in an expedient fashion. Am keen to get some more full dumps - but of course need to find answers sooner than the next failure!!

Sasha answered 29/12, 2011 at 13:3 Comment(6)
Tess's blog is the place to look for information. Did you not get anything from it?Roosevelt
Are you using any component/DLL with unmanaged code ?Earnest
This is almost certainly going to be unmanaged code screwing up your memory space. It'll probably be a right pain to track down I'm afraid. Have you updated a dll from a third party, or altered any of your own unmanaged code recently? Does it run happily on a 32bit system?Sub
You may find that engaging directly with Microsoft Support may help: after all, this is where Tess used to work, so there's a fair chance of reaching someone equally skilled. It could be ... (shudder) a bug in the runtime.Astrix
Hi all - thanks for the comments. This is what I feared - I've been recreating some of the stack traces on exceptions per Tess's blog but it not making much sense...Sasha
@Sasha from what I gather I would bet that some unsafe / native code is playing havoc on your memory management... see blogs.msdn.com/b/tess/archive/2010/01/14/… for steps to take to narrow it down... another angle would be some interaction with bad drivers (esp. devices using DMA).Earnest
S
0

The cause of your crash (break point hit) indicates a heap corruption in the process. Heap corruption failures are reported by heap management functions by issuing a debug break.

Judging from the logged errors, .net runtime is unprepared to handle these (i might be wrong and there might be a better explanation). The usual way to track heap corruptions are enabling (full) page heap which helps locate the offending component by crashing the process closer to the point of corruption.

Hunting down a heap corruption is a real pain to say the least, but if the memory consumption allows it, i'd go with full page heap being the most effective for apps with moderate memory requirements.

Hope it helps.

Sandpit answered 29/12, 2011 at 15:52 Comment(2)
thank you. Additionally there is a good blog on this here: blogs.msdn.com/b/tess/archive/2006/02/09/…Sasha
Interesting - use ODAC 11.1.0.7 / x64 / .NET 4 - looks like from the documentation this should be the newer 11.2.0.1.2 version - but can only find one example (above link) of this causing similar errors.Sasha
M
0

The GC of x64 .NET 4.0 has a bug. It could be that you are affected by this. MS recommends to disable concurrent GC until they get a hotfix out. Alternatively you could enable server GC to get one GC thread per core which is possible if you have more than one core.

Otherwise the server gc flag will have no effect.

Here is the link to the KB article.

Mcavoy answered 8/3, 2012 at 7:40 Comment(0)
C
0

Couple of things 1. Make sure you are running latest version of clr 2. For native heap corruption pageheap is a good option and for managed may be you can try GCStress How to turn GCStress on in Windows 7? 3.To validate heap corruption on managed heap you can use verifyheap which is part of SOS https://msdn.microsoft.com/en-us/library/bb190764(v=vs.110).aspx "VerifyHeap Checks the garbage collector heap for signs of corruption and displays any errors found. Heap corruptions can be caused by platform invoke calls that are constructed incorrectly. "

Commend answered 26/4, 2017 at 6:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.