.NetCore - FileSystemWatcher on a network drive, unsafe code Win32 API crash
Asked Answered
R

2

7

I have a small Dotnet core program (3.1.8), with some FileWatchers. They watch folders on a network drive. With some load (200 - 250 files maximum here), the program crashes unexpectedly. These files come at the same time, moved by another process on another server thanks to a Biztalk app, I don't think it's relevant here but I wanted to mention it.

The filewatchers initialization:

private void InitializeInnerFilewatcher(List<string> filters)
{
        _watcher = new FileSystemWatcher(WatchPath);
        _watcher.InternalBufferSize = 65536;
        if (filters.Count > 1)
        {
            _watcher.Filter = FILTER_ALL; // *.*
            _customFilters = filters;
        }
        else
            _watcher.Filter = filters.First();
        _watcher.NotifyFilter = NotifyFilters.LastWrite | NotifyFilters.FileName;
        _watcher.Changed += new FileSystemEventHandler(FileCreatedOrChanged);
        _watcher.Created += new FileSystemEventHandler(FileCreatedOrChanged);
        _watcher.Renamed += new RenamedEventHandler(FileRenamed);
        _watcher.Error += Watcher_Error;
        _watcher.EnableRaisingEvents = true;
}

And here we have, the "process" part for each event triggered by the filewatcher:

private void TryHandle(FileSystemEventArgs arg)
{
        if (!File.Exists(arg.FullPath))
            return;

        if (!_customFilters.Any() || _customFilters.Any(x => PatternMatcher.MatchPattern(x, arg.Name)))
            _memoryCache.AddOrGetExisting(arg.FullPath, arg, _cacheItemPolicy);
 }

I tried to avoid any real process on triggered file system events, so I push the file path in the memoryCache and later I send it to a ServiceBus queue for processing the file by any consumer.

All this stuff seem to work pretty fine during all day, no high CPU no high Memory during all day. We're already logging all the application metrics in ApplicationInsights.

It's a 'real' crash so we don't have any logs, only a poor event in the Event Viewer and a dump file.

Event viewer : Faultinq module name: coreclr.dll, version: 470020.41105, time stamp: Ox5f3397ec

We can see, thanks to dotnet-dump, the error catched in the dump file:

> clrstack
OS Thread Id: 0xfd4c (27)
        Child SP               IP Call Site
00000022D55BE150 00007ffccc46789f [FaultingExceptionFrame: 00000022d55be150]
00000022D55BE650 00007FFC6D7A49D4 System.IO.FileSystemWatcher.ParseEventBufferAndNotifyForEach(Byte[]) [/_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.Win32.cs @ 249]
00000022D55BE6F0 00007FFC6D7A48E6 System.IO.FileSystemWatcher.ReadDirectoryChangesCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [/_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.Win32.cs @ 242]
00000022D55BE750 00007FFC6D6F189C System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
00000022D55BE7C0 00007FFC6D7359B5 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [/_/src/System.Private.CoreLib/src/System/Threading/Overlapped.cs @ 59]
00000022D55BE8F0 00007ffccc336ba3 [GCFrame: 00000022d55be8f0]
00000022D55BEAB0 00007ffccc336ba3 [DebuggerU2MCatchHandlerFrame: 00000022d55beab0]
> pe
Exception object: 000001e580001198
Exception type:   System.ExecutionEngineException
Message:          <none>
InnerException:   <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 80131506

As you can see, the error seems to happen directly on the FileSystemWatcher, in the Win32 API. I can't reproduce it, and it happens only on our Production environment, so no need to tell you I'm in an "emergency mode".

WinDbg is maybe a bit more detailed

enter image description here

Rules answered 14/9, 2020 at 9:26 Comment(12)
Now that Microsoft is coding everything unsafe, "real crashes" will be much more frequent ... github.com/dotnet/runtime/blob/… you should post an issue to dotnet/runtimeMultifold
@SimonMourier That .Net Core implementation is broadly the same as the .Net Framework implementation, which also uses unsafe code.Canoe
@MatthewWatson It seems to be. But the code seems to be a bit different (i made the comparison actually), and i have the same code (the nuget package is in .net standard) used in a .Net Framework app which is running for more than a year without any crash..Rules
@SimonMourier Thanks, i already did it. I'm just trying another issue on SO, i tried everything on my side, and didn't find out anything..Rules
It does look like a bug in the library - error 80131506 is a generic "something when wrong" error!Canoe
@MatthewWatson I actually expected this kind of answer.. I think it is too, but i don't know what way i have to take here to "fix" it. Do i have to rebuild this app in .net Framework ? Which is a bit sad. Do I have to build my own FileSystemWatcher (Polling with Timers and handling all i can for reproduce the filewatcher behavior) ? Any suggestion ?Rules
If I were you, can't wait for Microsoft, and since source code is available, I would definitely recompile this class and check what's really going wrong. Maybe it's your context that causes this error and this could help you find out if you can do something about it.Multifold
@SimonMourier As i said, unfortunately, I can't reproduce it on my non-production environments. Even if i try with 1000+ files at the same time.. It's also the point of this thread, if one of you has an idea on the workflow to "force" this error, i'll be very happy to try it.Rules
@Rules you may look at .net core sources to see what happens in linked cs filesSahara
@PavelAnikhouski Already try. I already read all source code but can't find out the reason of my issue. Seems to be the file events buffer that seems to throw for an AccessViolationException or smth like this. But I can't understand how that buffer can become "unavailable" from my app. PS : .Net Core sources seems to be broken, it doesn't display the .Win32.cs version.Rules
@Rules ExecutionEngineException can be throws by FailFast method. From my experience, it might be invoked from Assert method or something like that, when CLR detects something really wrongSahara
@PavelAnikhouski Thanks for the tips. The StackTrace seems to show the events buffer loop that enumerates all filesystem events available. I just added a WinDbg dump analysis, that show the "AccessViolationException". The only possibility i can imagine, it's the server that has its memory swapped for a weird reason and the buffer allocation at the start of the FileSystemWatcher is not "available" anymoreRules
R
2

Just a quick update, because I'm still on the way to fix it.

I created a MS Support issue. After many try we just succeed to reproduce it. We had to "play" with the network and simulating some "disturbances". It seems that FileSystemWatcher events didn't been sent as it should be (It's sent by TCP protocol, SMB way). Our team is still working on finding how it can happen..

MS agreed that this shouldn't crash the FileSystemWatcher in some unsafe code no matter if there was a real network issue. So, they just made a PR to add some security around it.

I'm still following the PR but it should be fix in .Net 5 and backported in .Net Core 3.1(.9).

Thanks for the help.

Rules answered 21/9, 2020 at 7:35 Comment(0)
J
1

This issue has been fixed in master (6.0) and backported to 5.0 and 3.1.

Joelynn answered 25/9, 2020 at 18:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.