Edit: I started asking this as a PowerShell / .Net question and couldn't find any reference to it on the internet. With feedback, it appears to be a WINAPI issue so this is an edit/rewrite/retag, but much of the tests and background reference .Net for that reason.
Summary
WINAPI ping function IcmpSendEcho2
appears to have a timing bug if the ping timeout parameter is set below 1000ms (1 second). This causes it to return intermittent false timeout errors. Rather than a proportional "lower timeout = more fails" behaviour, it appears to be a cutoff of >=1000ms+ is expected behaviour, <=999ms triggers false timeouts, often in an alternating success/fail/success/fail pattern.
I call them false timeouts because I have WireShark packet capture showing a reply packet coming back well within the timeout, and partly because the 1ms change shouldn't be a significant amount of time when the replies normally have 500-800ms of headroom, and partly because I can run two concurrent sets of pings with different timeouts and see different behavior between the two.
In the comments of my original .Net question, @wOxxOm has:
- located the Open Source .Net code where System.Net.NetworkInformation.Ping() wraps the WinAPI and there appears to be no specific handling of timeouts there, it's passed down to the lower layer directly - possibly line 675 with a call to
UnsafeNetInfoNativeMethods.IcmpSendEcho2()
and @Lieven Keersmaekers has investigated and found things beyond my skill level to interpret:
"I can second this being an underlying WINAPI problem. Following a success and timedout call into
IPHLPAPI!IcmpSendEcho2Ex
: the000003e7
parameter is on the stack, both set up an event and both return intoIPHLPAPI!IcmpEchoRequestComplete
with the difference of the success call's eax register containing00000000
and the timedout call's eax register containing00000102
(WAIT_TIMEOUT)"Compiling a 64bit C# version, there's no more calls into
IPHLPAPI
. A consistent thing that shows up isclr.dll GetLastError()
returningWSA_QOS_ADMISSION_FAILURE
for timeouts. Also consistent in my sample is the order of thread executions between a success and a timeout call being slightly different."- This StackOverflow question hints that the
WSA_QOS_ADMISSION_FAILURE
might be a misleading error, and is actuallyIP_REQ_TIMED_OUT
.
- This StackOverflow question hints that the
Testing steps:
Pick a distant host and set some continuous pings running. (The IP in my examples belongs to Baidu.cn (China) and has ping replies around ~310ms to me). Expected behaviour for all of them: almost entirely ping replies, with occasional drops due to network conditions.
PowerShell / .Net, with 999ms timeout, actual result is bizarre reply/drop/reply/drop patterns, far more drops than expected:
$Pinger = New-Object -TypeName System.Net.NetworkInformation.Ping
while($true) {
$Pinger.Send('111.13.101.208', 999)
start-sleep -Seconds 1
}
command prompt ping.exe with 999ms timeout, actual result is more reliable (edit: but later findings call this into question as well):
ping 111.13.101.208 -t -w 999
PowerShell / .Net, with 1000ms timeout, actual result is as expected:
$Pinger = New-Object -TypeName System.Net.NetworkInformation.Ping
while($true) {
$Pinger.Send('111.13.101.208', 1000)
start-sleep -Seconds 1
}
It's repeatable with C# as well, but I've edited that code out now it seems to be a WinAPI problem.
Example screenshot of these running side by side
- On the left, .Net with 999ms timeout and 50% failure.
- Center, command prompt, almost all replies.
- On the right, .Net with 1000ms timeout, almost all replies.
Earlier investigations / Background
I started with a 500ms timeout, and the quantity of fake replies seems to vary depending on the ping reply time of the remote host:
- pinging something 30ms away reports TimedOut around 1 in 10 pings
- pinging something 100ms away reports TimedOut around 1 in 4 pings
- pinging something 300ms away reports TimedOut around 1 in 2 pings
From the same computer, on the same internet connection, sending the same amount of data (32 byte buffer) with the same 500ms timeout setting, with no other heavy bandwidth use. I run no antivirus networking filter outside Windows 10 defaults, two other people I know have confirmed this frequent TimedOut behaviour on their computers (now two more have in the comments), with more or fewer timeouts, so it ought not to be my network card or drivers or ISP.
WireShark packet capture of ping reply which is falsely reported as a timeout
I ran the ping by hand four times to a ~100ms away host, with a 500ms timeout, with WireShark capturing network traffic. PowerShell screenshot:
WireShark screenshot:
Note that the WireShark log records 4 requests leaving, 4 replies coming back, each with a time difference of around 0.11s (110 ms) - all well inside the timeout, but PowerShell wrongly reports the last one as a timeout.
Related questions
Googling shows me heaps of issues with System.Net.NetworkInformation.Ping
but none which look the same, e.g.
- System.Net.NetworkInformation.Ping crashing - it crashes if allocated/destroyed in a loop in .Net 3.5 because its internals get wrongly garbage collected. (.Net 4 here and not allocating in a loop)
- Blue screen when using Ping - 6+ years of ping being able to BSOD Windows (not debugging an Async ping here)
- https://github.com/dotnet/corefx/issues/15989 - it doesn't timeout if you set a timeout to 1ms and a reply comes back in 20ms, it still succeeds. False positive, but not false negative.
The documentation for Ping() warns about the low-timeout-might-still-say-success but I can't see it warns that timeout might falsely report failure if set below 1 second.
Edit: Now that I'm searching for ICMPSendEcho2, I have found exactly this problem documented before in a blog post in May 2015 here: https://www.frameflow.com/ping-utility-flaw-in-windows-api-creating-false-timeouts/ - finding the same behavior, but having no further explanation. They say that ping.exe
is affected, when I originally thought it wasn't. They also say:
- "In our tests we could not reproduce it on Windows Server 2003 R2 nor on Windows Server 2008 R2. However it was seen consistently in Windows Server 2012 R2 and in the latest builds of Windows 10."
Why? What's wrong with the timeout handling that makes it ignore ping repsonses coming into the network stack completely? Why is 1000ms a significant cutoff?
ping
invocation chain in the source code doesn't reveal anything suspicious so the bug is probably in the underlying WINAPI Icmp6SendEcho2. – ReliquaryIPHLPAPI!IcmpSendEcho2Ex
: the000003e7
parameter is on the stack, both set up an event and both return intoIPHLPAPI!IcmpEchoRequestComplete
with the difference of the success call'seax
register containing00000000
and the timedout call's eax register containing00000102
(WAIT_TIMEOUT) – HarcourtIPHLPAPI
. A consistent thing that shows up isclr.dll GetLastError()
returningWSA_QOS_ADMISSION_FAILURE
for timeouts. Also consistent in my sample is the order of thread executions between a success and a timeout call being slightly different. All nice and well but it brings us nothing closer to a solution or a reason why. – HarcourtIP_REQ_TIMED_OUT
but it doesn't offer a solution other than "I'm afraid that only some hardcore kernel debugging could reveal the real cause.". I now how to start a usermode debugging session and sometimes find some nuggets but kernelmode debugging (other than livekd) is beyond me. Ideally, some real WinDbg experts chime in (Thomas, Blabb, Hans, Raymond, ...) and get interested enough trying to find a cause (at least it's reproducable!). – Harcourt