Introduction
First of all, I've tried all recommendations from C# DNS-related SO threads and other internet articles - messing with ServicePointManager/ServicePoint settings, setting automatic request connection close via HTTP headers, changing connection lease times - nothing helped. It seems like all those settings are intended for fixing DNS issues in long-running processes (like web services). It even makes sense if a process would have it's own DNS cache to minimize DNS queries or OS DNS cache reading. But it's not my case.
The problem
Our production infrastructure uses HA (high availability) DNS for swapping server nodes during maintenance or functional problems. And it's built in a way that in some places we have multiple CNAME-records which in fact point to the same HA A-record like that:
- eu.site1.myprodserver.com (CNAME) > eu.ha.myprodserver.com (A)
- eu.site2.myprodserver.com (CNAME) > eu.ha.myprodserver.com (A)
The TTL of all these records is 60 seconds. So when the European node is in trouble or maintenance, the A-record switches to the IP address of some other node.
Then we have a monitoring utility which is executed once in 5 minutes and uses both site1 and site2. For it to work properly both names must point to the same DC, because data sync between DCs doesn't happen that fast. Since both CNAMEs are in fact linked to the same A-record with short TTL at a first glance it seems like nothing can go wrong. But it turns out it can.
The utility is written in C# for .NET Framework 4.7.2 and uses HttpClient class for performing requests to both sites. Yeah, it's him again.
We have noticed that when a server node switch occurs the utility often starts acting as if site1 and site2 were in different DCs. The pattern of its behavior in such moments is strictly determined, so it's not like it gets confused somewhere in the middle of the process - it incorrecly resolves one or both of these addresses from the very start.
I've made another much simpler utility which just sends one GET-request to site1 and then started intentionally switching nodes on and off and running this utility to see which DC would serve its request. And the results were very frustrating.
Despite the Windows DNS cache already being updated (checked via ipconfig
and Get-DnsClientCache
cmdlet) and despite the overall records' TTL of 60 seconds the HttpClient keeps sending requests to the old IP address sometimes for another 15-20 minutes. Even when I've completely shut down the "outdated" application server - the utility kept trying to connect to it, so even connection failures don't wake it up.
It becomes even more frustrating if you start running ipconfig /flushdns
in between utility runs. Sometimes after flushdns the utility realizes that the IP has changed. But as soon as you make another flushdns (or this is even not needed - I haven't 100% clearly figured this out) and run the utility again - it goes back to the old address! Unbelievable!
And add even more frustration. If you resolve the IP address from within the same utility using Dns.GetHostEntry method (which uses cache as per this comment) right before calling HttpClient, the resolve result would be correct... But the HttpClient would anyway make a connection to an IP address of seemengly his own independent choice. So HttpClient somehow does not seem to rely on built-in .NET Framework DNS resolving.
So the questions are:
- Where does a newly created .NET Framework process take those cached DNS results from?
- Even if there is some kind of a mystical global .NET-specific DNS cache, then why does it absolutely ignore TTL?
- How is it possible at all that it reverts to the outdated old IP address after it has already once "understood" that the address has changed?
P.S. I have worked this all around by implementing a custom HttpClientHandler which performs DNS queries on each hostname's first usage thus it's independent from external DNS caches (except for caching at intermediate DNS servers which also affects things to some extent). But that was a little tricky in terms of TLS certificates validation and the final solution does not seem to be production ready - but we use it for monitoring only so for us it's OK. If anyone is interested in this, I'll show the class code which somewhat resembles this answer's example.
Update 2021-10-08
The utility works from behind a corporate proxy. In fact there are multiple proxies for load balancing. So I am now also in process of verifying this:
- If the DNS resolving is performed by the proxies and they don't respect the TTL or if they cache (keep alive) TCP connections by hostnames - this would explain the whole problem
- If it's possible that different proxies handle HTTP requests on different runs of the utility - this would answer the most frustrating question #3
Update 2021-10-15
The answer to "Does .NET Framework has an OS-independent global DNS cache?" is NO. HttpClient class or .NET Framework in general had nothing to do with all of this. Posted my investigation results as an accepted answer.
Dns.GetHostAddresses
, which calls out the operating system. It's probably an intermediate resolver which is caching and not adhering to the TTL – PunctualityDns.GetHostEntry
method from within the utility, and it returned correct (actual) results - just like Windows utilities did. I'll have a look at whetherDns.GetHostAddresses
returns something incorrect. – PermissibleDns.GetHostAddresses
returns a single correct IP address as soon as HA-DNS switches it. Meanwhile HttpClient keeps using the old one. – Permissible