Does .NET Framework have an OS-independent global DNS cache?
Asked Answered
P

1

8

Introduction

First of all, I've tried all recommendations from C# DNS-related SO threads and other internet articles - messing with ServicePointManager/ServicePoint settings, setting automatic request connection close via HTTP headers, changing connection lease times - nothing helped. It seems like all those settings are intended for fixing DNS issues in long-running processes (like web services). It even makes sense if a process would have it's own DNS cache to minimize DNS queries or OS DNS cache reading. But it's not my case.

The problem

Our production infrastructure uses HA (high availability) DNS for swapping server nodes during maintenance or functional problems. And it's built in a way that in some places we have multiple CNAME-records which in fact point to the same HA A-record like that:

  • eu.site1.myprodserver.com (CNAME) > eu.ha.myprodserver.com (A)
  • eu.site2.myprodserver.com (CNAME) > eu.ha.myprodserver.com (A)

The TTL of all these records is 60 seconds. So when the European node is in trouble or maintenance, the A-record switches to the IP address of some other node.

Then we have a monitoring utility which is executed once in 5 minutes and uses both site1 and site2. For it to work properly both names must point to the same DC, because data sync between DCs doesn't happen that fast. Since both CNAMEs are in fact linked to the same A-record with short TTL at a first glance it seems like nothing can go wrong. But it turns out it can.

The utility is written in C# for .NET Framework 4.7.2 and uses HttpClient class for performing requests to both sites. Yeah, it's him again.

We have noticed that when a server node switch occurs the utility often starts acting as if site1 and site2 were in different DCs. The pattern of its behavior in such moments is strictly determined, so it's not like it gets confused somewhere in the middle of the process - it incorrecly resolves one or both of these addresses from the very start.

I've made another much simpler utility which just sends one GET-request to site1 and then started intentionally switching nodes on and off and running this utility to see which DC would serve its request. And the results were very frustrating.

Despite the Windows DNS cache already being updated (checked via ipconfig and Get-DnsClientCache cmdlet) and despite the overall records' TTL of 60 seconds the HttpClient keeps sending requests to the old IP address sometimes for another 15-20 minutes. Even when I've completely shut down the "outdated" application server - the utility kept trying to connect to it, so even connection failures don't wake it up.

It becomes even more frustrating if you start running ipconfig /flushdns in between utility runs. Sometimes after flushdns the utility realizes that the IP has changed. But as soon as you make another flushdns (or this is even not needed - I haven't 100% clearly figured this out) and run the utility again - it goes back to the old address! Unbelievable!

And add even more frustration. If you resolve the IP address from within the same utility using Dns.GetHostEntry method (which uses cache as per this comment) right before calling HttpClient, the resolve result would be correct... But the HttpClient would anyway make a connection to an IP address of seemengly his own independent choice. So HttpClient somehow does not seem to rely on built-in .NET Framework DNS resolving.

So the questions are:

  1. Where does a newly created .NET Framework process take those cached DNS results from?
  2. Even if there is some kind of a mystical global .NET-specific DNS cache, then why does it absolutely ignore TTL?
  3. How is it possible at all that it reverts to the outdated old IP address after it has already once "understood" that the address has changed?

P.S. I have worked this all around by implementing a custom HttpClientHandler which performs DNS queries on each hostname's first usage thus it's independent from external DNS caches (except for caching at intermediate DNS servers which also affects things to some extent). But that was a little tricky in terms of TLS certificates validation and the final solution does not seem to be production ready - but we use it for monitoring only so for us it's OK. If anyone is interested in this, I'll show the class code which somewhat resembles this answer's example.

Update 2021-10-08

The utility works from behind a corporate proxy. In fact there are multiple proxies for load balancing. So I am now also in process of verifying this:

  • If the DNS resolving is performed by the proxies and they don't respect the TTL or if they cache (keep alive) TCP connections by hostnames - this would explain the whole problem
  • If it's possible that different proxies handle HTTP requests on different runs of the utility - this would answer the most frustrating question #3

Update 2021-10-15

The answer to "Does .NET Framework has an OS-independent global DNS cache?" is NO. HttpClient class or .NET Framework in general had nothing to do with all of this. Posted my investigation results as an accepted answer.

Permissible answered 6/10, 2021 at 11:26 Comment(10)
DNS is generally handled by the OS, not by .NET.Abortionist
@Abortionist Oh, I wish this was always true. As I said in the post, I did verify the Windows' DNS client cache and it worked as expected. At least that's the way ipconfig and Get-DnsClientCache showed it to me: the TTL of records in Windows cache has always been <= the TTL at DNS servers. And as soon as the DNS servers were changing the IP address - Windows immideately fetched the changes.Permissible
If you follow the code down, you get to Dns.GetHostAddresses, which calls out the operating system. It's probably an intermediate resolver which is caching and not adhering to the TTLPunctuality
@Punctuality as I said in the post I have also tested calling Dns.GetHostEntry method from within the utility, and it returned correct (actual) results - just like Windows utilities did. I'll have a look at whether Dns.GetHostAddresses returns something incorrect.Permissible
@Punctuality checked it up. Dns.GetHostAddresses returns a single correct IP address as soon as HA-DNS switches it. Meanwhile HttpClient keeps using the old one.Permissible
This is a nice post with great detail of the problem, but I don't see how your comment about DNS resolution happening by a proxy would be correct! However, in my limited experience, I do have seen proxies messing up a lot of things and creating unnecessary problems like the one you listed above.Sportswear
@Sportswear thanks for your kind words! In fact my main guess is the second one - that proxies keep alive and then reuse TCP connections and they map this cache by hostnames. If that's true, then DNS resolution only takes place during connection establishing which obviously happens less often than once per minute.Permissible
@StanislavBakharev - Exactly! I am almost convinced that there is a possibility of proxy appliance maintaining HTTP keep-alive connections as you said. These type of problems have definitely nothing to do with the implementation of DNS in Windows environment, in my opinion. Is it possible for you to try reproducing this problem by somehow removing the proxy from the scenario?Sportswear
@Sportswear yes, I did try it out without proxy and the issue does not reproduce. I've even found some indirect proofs of proxy servers messing things up. And I am now actively cooperating with our proxy servers admins to figure out what exactly is going on there. I'll post an update (and hopefully a detailed accepted answer to my own post) as soon as something shows up.Permissible
This is truly an interesting question that I ever thought of, yet I have not encountered a realistic problem like this.Adon
P
5

HttpClient, please forgive me! It was not your fault!

Well, this investigation was huge. And I'll have to split the answer into two parts since there turned out to be two unconnected problems.

1. The proxy server problem

As I said, the utility was being tested from behind a corporate proxy. In case if you haven't known (like I haven't till the latest days) when using a proxy server it's not your machine performing DNS queries - it's the proxy server doing this for you.

I've made some measurements to understand for how long does the utility keep connecting to the wrong DC after the DNS record switch. And the answer was the fantastic exact 30 minutes. This experiment has also clearly shown that local Windows DNS cache has nothing to do with it: those 30 minutes were starting exactly at the point when the proxy server was waking up (was finally starting to send HTTP requests to the right DC).

The exact number of 30 minutes has helped one of our administrators to finally figure out that the proxy servers have a configuration parameter of minimal DNS TTL which is set to 1800 seconds by default. So the proxies have their own DNS cache. These are hardware Cisco proxies and the admin has also noted that this parameter is "hidden quite deeply" and is not even mentioned in the user manual.

As soon as the minimal proxies' DNS TTL was changed from 1800 seconds to 1 second (yeah, admins have no mercy) the issue stopped reproducing on my machine.

But what about "forgetting" the just-understood correct IP address and falling back to the old one?

Well. As I also said there are several proxies. There is a single corporate proxy DNS name, but if you run nslookup for it - it shows multiple IPs behind it. Each time the proxy server's IP address is resolved (for example when local cache expires) - there's quite a bit of a chance that you'll jump onto another proxy server.

And that's exactly what ipconfig /flushdns has been doing to me. As soon as I started playing around with proxy servers using their direct IP addresses instead of their common DNS name I found that different proxies may easily route identical requests to different DCs. That's because some of them have those 30-minutes-cached DNS records while others have to perform resolving.

Unfortunately, after the proxies theory has been proven, another news came in: the production monitoring servers are placed outside of the corporate network and they do not use any proxy servers. So here we go...

2. The short TTL and public DNS servers problem

The monitoring servers are configured to use 8.8.8.8 and 8.8.4.4 Google's DNS servers. Resolve responses for our short-lived DNS records from these servers are somewhat weird:

  • The returned TTL of CNAME records swings at around 1 hour mark. It gradually decreases for several minutes and then jumps back to 3600 seconds - and so on.
  • The returned TTL of the root A-record is almost always exactly 60 seconds. I was occasionally receiving various numbers less than 60 but there was no any obvious humanly-percievable logic. So it seems like these IP addresses in fact point to balancers that distribute requests between multiple similar DNS servers which are not synced with each other (and each of them has its own cache).

Windows is not stupid and according to my experiments it doesn't care about CNAME's TTL and only cares about the root A-record TTL, so its client cache even for CNAME records is never assigned a TTL higher than 60 seconds.

But due to the inconsistency (or in some sense over-consistency?) of the A-record TTL which Google's servers return (unpredictable 0-60 seconds) the Windows local cache gets confused. There were two facts which demonstrated it:

  • Multiple calls to Resolve-DnsName for site1 and site2 over several minutes with random pauses between them have eventually led to Get-ClientDnsCache showing the local cache TTLs of the two site names diverged on up to 15 seconds. This is a big enough difference to sometimes mess the things up. And that's just my short experiment, so I'm quite sure that it might actually get bigger.
  • Executing Invoke-WebRequest to each of the sites one right after another once in every 3-5 seconds while switching the DNS records has let me twicely face a situation when the requests went to different DCs.

The latter experiment had one strange detail I can't explain. Calling Get-DnsClientCache after Invoke-WebRequest shows no records appear in the local cache for the just-requested site names. But anyway the problem clearly has been reproduced.

Conclusion?

It would take time to see whether my workaround with real-time DNS resolving would bring any improvement. Unfortunately, I don't believe it will - the DNS servers used at production (which would eventually be used by the monitoring utility for real-time IP resolving) are public Google DNS which are not reliable in my case.

And one thing which is worse than an intermittently failing monitoring utility is that real-world users are also relying on public DNS servers and they definitely do face problems during our maintenance works or significant failures.

So have we learned anything out of all this?

  • Maybe a short DNS TTL is generally a bad practice?
  • Maybe we should install additional routers, assign them static IPs, attach the DNS names to them and then route traffic internally between our DCs to finally stop relying on DNS records changing?
  • Or maybe public DNS servers are doing a bad job?
  • Or maybe the technological singularity is closer than we think?

I have no idea. But its quite possible that "yes" is the right answer to all of these questions.

However there is one thing we surely have learned: network hardware manufacturers shall write their documentation better.

Permissible answered 14/10, 2021 at 21:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.