Risk of missing events from ETW logging with EventSource

Asked 22/6, 2013 at 0:48 Answered 21/3, 2016 at 16:42

I'm instrumenting my .NET 4.5 applications to emit ETW events using the EventSource class. The goal is to be able to capture some of these events (the Error level events) for error logging.

After doing some reading and testing, I am concerned about the reliability of this approach to error logging, specifically regarding the possibility of dropped or missing events. If my error logging isn't working I need the application to shut down (in my case it is unsafe for it to run with unreported errors). When using ETW and EventSource, how can I be certain that my errors are getting properly recorded?

Obviously part of the answer will depend on what is listening to the events. In my case I plan to use the "Semantic Logging Application Block" from the latest MS Enterprise Library.

Here is one source where Microsoft talks about the possible causes of missed events: About Event Tracing

There they list these possible causes of Missing Events

The total event size is greater than 64K. This includes the ETW header plus the data or payload. A user has no control over these missing events since the event size isconfigured by the application.

The ETW buffer size is smaller than the total event size. A user has no control over these missing events since the event size is configured by the application logging the events.

For real-time logging, the real-time consumer is not consuming events fast enough or is not present altogether and then the backing file is filling up. This can result if the Event Log service is stopped and started when events are being logged. A user has no control over these missing events.

When logging to a file, the disk is too slow to keep up with the logging rate.

To see if these concerns were somehow mitigated using the EventSource class (like, does it truncate large payloads somehow) I did some testing. I tried to log long strings and it failed for me between 30,000 and 35,000 characters (right in line with the 64KB maximum event payload). It just silently does nothing from what I can tell for the too-large strings, no events at all in my Semantic Logging Application Block log. Events before and after were written like usual.

So anytime I have a string in my payload I have to pass it through some truncator? Will I need to manually avoid generating events "too fast" as well (and how would that be possible)?

Microsoft Patterns and Practices are supposed to lead us to good ... patterns and practices ... so maybe I'm just missing something here.

Update:

Well apparently there is some notice in the consuming application for the "Events too fast" condition. I received this today for the first time:

Level : Warning, Message : Some events will be lost because of buffer overruns or schema synchronization delays in trace session: Microsoft-SemanticLogging-Etw-svcRuntime

And then when closing the session:

Level : Warning, Message : The loss of 1 events was detected in trace session 'Microsoft-SemanticLogging-Etw-svcRuntime'.

Update2:

The Enterprise Library Developers Guide describes the behavior I just mentioned.

You should monitor the log messages generated by the Semantic Logging Application Block for any indication that the have buffers overflowed and that you have lost messages. For example, log messages with event ids 900 and 901 indicate that a sink’s internal buffers have overflowed; in the out-of-process scenario, event ids 806 and 807 indicate that the ETW buffers have overflowed. You can modify the buffering configuration options for the sinks to reduce the chance that the buffers overflow with your typical workloads.

My question remains, can I use semantic logging while ensuring my application does not run if errors are being dropped? Normal tracing events could be dropped ...

My current thought is to log "critical" errors with a separate class using old-fashioned logging techniques, and keep less critical errors (as well as debug type events) going through the ETW pipeline. That wouldn't be too bad really ... I might post that as a solution if I can't find a better suggestion.

Update 3:

The "missing events" warning that I received had nothing to do with buffer overruns, turns out this is the message you get if you pass a null string as a payload value.

Exert answered 22/6, 2013 at 0:48 Comment(5)

Isn't it better to log only small messages, and send larger messages to a file using some other mechanism, of your own creation? It seems that "large messsages" and "high performance" are contradictory goals. I was wondering what kind of errors need such large messages. – Bugger 11/7, 2013 at 17:43

Probably nothing would be that big ... I mean I have seen "pretty big" stack traces, especially with an AggregateException from a Parallel.Foreach, for example, but even that's unlikely. At this point I am logging "errors" to two sources ... out of process ETW and a traditional logging pipeline that is in process and blocks until the message is written. Altogether this is fine ... and I might be able to do all of that with the Semantic Logging block (using an out of process component for various logging/tracing, and an in-process event listener for actual Errors). – Exert 11/7, 2013 at 22:46

I should say too that I ultimately probably need a non ETW in-process component for error handling to properly respond to transient error conditions and implement re-try and cooldown logic. So its not too much extra burden to have some extra error logging in that context as well. It means some errors are created and maintained twice in code (as ETW events and also in my other Error Logging classes) but oh well. – Exert 11/7, 2013 at 22:52

I think you are correct in concluding that the verbose messages are best in ETW, where you can use interesting tools to analyze them, while the less common, critical errors must be at least duplicated in a reliable log. The first class of messages might be considered performance or extra deep analysis messages, while the second are errors. It does seem the ETW could be used for both, although not in the same channel. – Bugger 11/7, 2013 at 22:57

Your thinking has helped me in my own project... I have a similar situation. – Bugger 11/7, 2013 at 23:0

The EventSource class comes in two versions, one included with the .NET Framework and another in the NuGet package Microsoft EventSource Library. I assume that you use the NuGet package because it contains newer code.

The constructor for the EventSource base class has an overload that takes a boolean argument throwOnEventWriteErrors with the following documentation (NuGet package version 1.0.26.0):

By default calling the 'WriteEvent' methods do NOT throw on errors (they silently discard the event). This is because in most cases users assume logging is not 'precious' and do NOT wish to have logging failures crash the program. However for those applications where logging is 'precious' and if it fails the caller wishes to react, setting 'throwOnEventWriteErrors' will cause an exception to be thrown if WriteEvent fails. Note the fact that EventWrite succeeds does not necessarily mean that the event reached its destination only that operation of writing it did not fail.

Unfortunately, the last sentence contains a caveat emptor but if you look into the source code for the EventSource you can see that the underlying return codes from the OS calls are used to throw different exceptions for NoFreeBuffers and EventTooBig (and other errors).

So if you turn on throwOnEventWriteErrors you will get exceptions if the EventSource class is unable to deliver the event to ETW. However, if ETW fails for another reason you wont get any exception but if you ensure that your ETW channels are configured correctly that should rarely if ever happen. However, as you cannot tolerate losing any error events you should probably test extreme error cases to make sure that ETW behaves as you expect.

Yurikoyursa answered 21/8, 2014 at 7:13 Comment(3)

Thanks Martin ... I was not using the nuget package at the time because there were no differences... I was not aware that the eventsource class was updated with this parameter, and will start using the nuget package and be satisfied with this error logging for most situations ... perhaps a few exceptional situations will require extra attention but this should be more to my liking for usual logging ... I consider logging to be pretty precious indeed in almost everything I do hah ... maybe because so much of it is unattended windows services and the like – Exert 21/8, 2014 at 17:7

@TCC: If you start using the NuGet package you will perhaps also use the EventRegister tool to create the ETW manifest during build. Unfortunately, the size parameters for the channels are not really suitable for high volume/big packet scenarios where you will lose data. I have added an XSLT transformation to my build to modify the channels in the machine generated manifest because writing a manifest by hand is tedious and error prone. – Yurikoyursa 21/8, 2014 at 17:47

Note that the EventSource protected constructor overload with the throwOnEventWriteErrors is available without the NuGet package as of .NET 4.5.1. – Essary 15/6, 2015 at 20:48

One thing that there are two important points that are not made clear in the discussion above.

ALL the issues associated with dropped events have to do with ETW (Event Tracing for Windows), not EventSource. That is logically EventSOurces talk to EventListeners, and there is built-in listener that forwards to ETW. Obviously when you talk about dropped events, the constraint of ANY link in the chain will affect data flowing through the chain. Thus one way of guaranteeing complete reliability is to use an EventListener that does not use ETW but goes directly to wherever you want the data to go. I believe that the (Semantic Logging Application Block) has such listener.
ETW has been successfully used to reliably forward events, but you do have to live within the constraints mentioned above (the size of the events has to be kept < 64K and you have to keep the event rate under control. Note that if the rate is too high you will know this because the WriteEvent will fail, so you can retry (after pausing), and thus make something that fully reliable (at the expense of slowing down the program). Note that this kind of data loss is just not an interesting issue if you are truly talking about errors (which should not be happening at a huge rate, and if they are happening at a high rate, they are likely to be redundant (the same thing firing rapidly).

So in conclusion EventSource supports reliable events by default, ETW does not support it by default but can be made to support it, but often, ETW's defaults are more than fine.

Flam answered 21/3, 2016 at 16:42 Comment(1)

Thanks for an authoritative viewpoint and the clarification, Vance. I haven't had any problems with ETW or EventSource in production code with their default settings. – Exert 22/3, 2016 at 4:7

-1

Try to take a look at Semantic Log (MS Enterprise Library 6) http://msdn.microsoft.com/en-us/library/dn440729(v=pandp.60).aspx

You can use Event Source and create listener to record your log to the event viewer or a file or db ( or create a custom solution )

Update: I catch Event ID 806 / 807 even on an IoC scenario. In an interceptor, there was a part of code that instantiates my EventSource class: if you miss the reference of first instance all the others failed on constructor and raise event ids 806 / 807 when write events

For logging big data it is possible to apply message splitting tecniques

Peale answered 17/1, 2014 at 10:30 Comment(1)

yes i specifically mention in my question that i am using that library – Exert 17/1, 2014 at 16:45

Recommended topics

Hot tags