I'm instrumenting my .NET 4.5 applications to emit ETW events using the EventSource
class. The goal is to be able to capture some of these events (the Error level events) for error logging.
After doing some reading and testing, I am concerned about the reliability of this approach to error logging, specifically regarding the possibility of dropped or missing events. If my error logging isn't working I need the application to shut down (in my case it is unsafe for it to run with unreported errors). When using ETW and EventSource
, how can I be certain that my errors are getting properly recorded?
Obviously part of the answer will depend on what is listening to the events. In my case I plan to use the "Semantic Logging Application Block" from the latest MS Enterprise Library.
Here is one source where Microsoft talks about the possible causes of missed events: About Event Tracing
There they list these possible causes of Missing Events
The total event size is greater than 64K. This includes the ETW header plus the data or payload. A user has no control over these missing events since the event size isconfigured by the application.
The ETW buffer size is smaller than the total event size. A user has no control over these missing events since the event size is configured by the application logging the events.
For real-time logging, the real-time consumer is not consuming events fast enough or is not present altogether and then the backing file is filling up. This can result if the Event Log service is stopped and started when events are being logged. A user has no control over these missing events.
When logging to a file, the disk is too slow to keep up with the logging rate.
To see if these concerns were somehow mitigated using the EventSource class (like, does it truncate large payloads somehow) I did some testing. I tried to log long strings and it failed for me between 30,000 and 35,000 characters (right in line with the 64KB maximum event payload). It just silently does nothing from what I can tell for the too-large strings, no events at all in my Semantic Logging Application Block log. Events before and after were written like usual.
So anytime I have a string in my payload I have to pass it through some truncator? Will I need to manually avoid generating events "too fast" as well (and how would that be possible)?
Microsoft Patterns and Practices are supposed to lead us to good ... patterns and practices ... so maybe I'm just missing something here.
Update:
Well apparently there is some notice in the consuming application for the "Events too fast" condition. I received this today for the first time:
Level : Warning, Message : Some events will be lost because of buffer overruns or schema synchronization delays in trace session: Microsoft-SemanticLogging-Etw-svcRuntime
And then when closing the session:
Level : Warning, Message : The loss of 1 events was detected in trace session 'Microsoft-SemanticLogging-Etw-svcRuntime'.
Update2:
The Enterprise Library Developers Guide describes the behavior I just mentioned.
You should monitor the log messages generated by the Semantic Logging Application Block for any indication that the have buffers overflowed and that you have lost messages. For example, log messages with event ids 900 and 901 indicate that a sink’s internal buffers have overflowed; in the out-of-process scenario, event ids 806 and 807 indicate that the ETW buffers have overflowed. You can modify the buffering configuration options for the sinks to reduce the chance that the buffers overflow with your typical workloads.
My question remains, can I use semantic logging while ensuring my application does not run if errors are being dropped? Normal tracing events could be dropped ...
My current thought is to log "critical" errors with a separate class using old-fashioned logging techniques, and keep less critical errors (as well as debug type events) going through the ETW pipeline. That wouldn't be too bad really ... I might post that as a solution if I can't find a better suggestion.
Update 3:
The "missing events" warning that I received had nothing to do with buffer overruns, turns out this is the message you get if you pass a null string
as a payload value.
AggregateException
from aParallel.Foreach
, for example, but even that's unlikely. At this point I am logging "errors" to two sources ... out of process ETW and a traditional logging pipeline that is in process and blocks until the message is written. Altogether this is fine ... and I might be able to do all of that with the Semantic Logging block (using an out of process component for various logging/tracing, and an in-process event listener for actual Errors). – Exert