Clock synchronization quality on Windows Azure?

Asked 26/5, 2011 at 13:5 Answered 31/12, 2017 at 8:14

I am looking for quantitative estimates on clock offsets between VMs on Windows Azure - assuming that all VMs are hosted in the same datacenter. I am guesstimating that average clock offset between one VM and another is below 10 seconds, but I am not even sure it's guaranteed property of the Azure cloud.

Has anybody some quantitative measurements on that matter?

Palecek answered 26/5, 2011 at 13:5 Comment(8)

+1 - Good question. I'd "expect" the answer to be much closer than 10 seconds - within a data center NTP should be able to provide <0.01 second, but I haven't seen any numbers quoted anywhere. – Fatherly 26/5, 2011 at 16:5

Windows Azure is useless for any serious application because of this (and not only Azure... all "real cloud hosting" platforms I've tried have this issue). What I seriously do not understand is how this came to be and when "we'll provide you with hosting platform on which you have >10sec time drift between servers" became valid option. I mean, it would be as if I developed website in which you can login only during odd seconds and said - it's not a bug, it's meant to work that way for security purposes. – Ardehs 14/12, 2011 at 18:24

@kape123: I don't understand what the problem is. Any distributed system I can think of from the last 40 years has had to cope with clock drift in one manner or another. A few seconds of clock drift doesn't preclude the existence of serious solutions. – Centrality 1/2, 2015 at 8:51

@GregD Distributed systems from 40 years ago are much different from distributed systems of today in terms of concurrent users and accessibility (thanks, Internet). Take bidding applications for example - if you have multiple web servers (for handling huge load) imagine if their clocks are off by several seconds - you won't be able to tell which bid came when. Obviously the problem can be somewhat solved by keeping time critical parts of your system on single server (database for example), but that doesn't solve problem robustly (any important DateTime.Now code needs to be on that server). – Ardehs 2/2, 2015 at 8:0

Yet somehow globe-spanning distributed industrial control systems exist and work today in spite of clock drift being a problem that needs to be solved. I'm not saying that the public cloud is the solution for every problem, that would be silly. It's equally silly to claim that no serious application can exist in the public cloud because of clock drift, though. – Centrality 2/2, 2015 at 20:27

@GregD If you want to nitpick and twist my words - sure, let's kick off discussion on how much of an impact clock drift has on "serious" applications. I say that without something like NetTime any "serious" application dependent on DateTime.Now is impossible to properly run. Can you build Search Engine and host it on multiple servers without caring much about clock drift? Sure. Can you build a bidding system? Of course not. And next time you choose to criticize 4 year old comment because of one word, keep in mind that comments can't be edited. – Ardehs 13/2, 2015 at 4:42

@kape123 you're trolling right ? you don't think "bidding sites" like ebay haven't had to deal with the problems you describe ? – Infect 9/4, 2015 at 13:15

@EoinCampbell I've built few bidding sites myself, and I was using that as an obvious example of something you can't easily host on Azure, because of clock drift. eBay - same - it doesn't use Azure... so what you say is irrelevant. Read again what I've wrote. – Ardehs 9/4, 2015 at 16:15

I have finally settled to do some experiments on my own.

A few facts concerning the experiment protocol:

Instead of looking for offset to an reference clock, I have simply checked clock differences between Azure VMs and the Azure Storage.
Clock time of the Azure Storage has been retrieved using the HTTP hack pasted below.
Measurements have been done within the North Europe datacenter of Azure with 250 small VMs.
Latency between storage and VMs measured with Stopwatch was always lower than 1ms for minimalistic unauthenticated requests (basically HTTP requests were coming back with 400 errors, but still with Date: available in the HTTP headers).

Results:

About 50% of the VMs have a clock offset to the storage greater than 1s.
About 5% of the VMs have a clock offset to the storage greater than 2s.
Less than 1% observations for clock offsets close 3s.
A handfew outliers close to 4s.
The clock offset between a single VM and the storage typically vary of +1/-1s from one request to the next.

So technically, we are not too far from the 2s tolerance target, although for intra-data-center sync, you don't have to push the experiment far to observe close to 4s offset. If we assume a normal (aka Gaussian) distribution for the clock offsets, then I would say that relying on any clock threshold lower than 6s is bound to lead to scheduling issues.

/// <summary>
/// Substitute for proper NTP (Network Time Protocol) 
/// when UDP is not available, as on Windows Azure.
/// </summary>
public class HttpTimeChecker
{
    public static DateTime GetUtcNetworkTime(string server)
    {
        // HACK: we can't use WebClient here, because we get a faulty HTTP response
        // We don't care about HTTP error, the only thing that matter is the presence
        // of the 'Date:' HTTP header
        var tc = new TcpClient();
        tc.Connect(server, 80);

        string response;
        using (var ns = tc.GetStream())
        {
            var sw = new StreamWriter(ns);
            var sr = new StreamReader(ns);

            string req = "";
            req += "GET / HTTP/1.0\n";
            req += "Host: " + server + "\n";
            req += "\n";

            sw.Write(req);
            sw.Flush();

            response = sr.ReadToEnd();
        }

        foreach(var line in response.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries))
        {
            if(line.StartsWith("Date: "))
            {
                return DateTime.Parse(line.Substring(6)).ToUniversalTime();
            }
        }

        throw new ArgumentException("No date to be retrieved among HTTP headers.", "server");
    }
}

Palecek answered 1/6, 2011 at 11:55 Comment(1)

Great work! Can we draw any conclusions from this about the maximum expected clock corrections? For example, does a 2-second offset also imply that the clock will be corrected by 2 seconds at some point? – Repugnant 22/6, 2020 at 7:43

I've been in conversation with someone from the Azure product team regarding clock synchronisation recently, more out of interest than anything else. The most recent reply I've received is:

The VMs and services take their time directly from the underlying Hyper-V platform upon boot and from that point forward the clock is maintained by the service. In order to have true time sync across a distributed system you will need to do this at the application layer and/or with a service referencing an singular time server.

Sacrilegious answered 25/7, 2014 at 9:54 Comment(0)

This is the classic problem of both distributed systems and virtual machines - clock skew.

One possible solution would be to use the Azure scheduler to ping an endpoint on each of your VM that would reset your clock - or at least tell you what the diff would be. That way, your skew would not grow, and you may even be able to calculate an offset for the communication delay. This way, you'd get to within milliseconds and not seconds.

Ofcourse, you could also go the other way, and have a service on the VM that periodically manages the clock by pinging out to some time server. I'm not sure if the hypervisor will let you mess with it's clock, but all you really need is an offset for your apps to consume.

Overall... never trust the clock on a VM, and certainly not over a distributed system. Note that this clock issue is part of active research in many universities. ie. https://scholar.google.com/scholar?hl=en&q=distributed+system+clock&btnG=&as_sdt=1%2C48&as_sdtp=

Alva answered 11/3, 2015 at 1:19 Comment(0)

Based on my experience, I would not rely on the system clock of the Azure VMs for anything critical. I have occasionally seen differences up to several minutes, which does fly in the face of what you'd expect.

Hebetate answered 29/5, 2011 at 21:13 Comment(3)

Interesting :-) That's precisely this sort of situations that I would like to understand better. – Palecek 30/5, 2011 at 13:30

Unfortunately, this is also what I am experiencing. I have an open API where i rely on the DateTime.UtcNow, and on the 3 instances i use (extra small), there is a difference varying up till 10 seconds. This is unacceptable, considering all my dev servers are in sync, why cant Microsoft have their instances in sync? – Fidelafidelas 5/11, 2011 at 1:29

A follow up to my last comment; it would appear, that after a warm up period of aprox. 1 hour, the instances starts to sync up. As of now, there are +1/-1 as Joannes also point out. Still odd, that it has to take this long. – Fidelafidelas 5/11, 2011 at 1:33

I've tried to search for an answer to this specific question - but haven't succeeded!

Some references I have found about the "Windows Time Service" - W32Time - reference that the design for the Windows service targets a tolerance of 2 seconds - e.g.

In practice within the Azure network I expect that the synchronisation achieved should be much better than this - but my search turned up no referenced guarantees on this.

Fatherly answered 27/5, 2011 at 6:52 Comment(0)

You can never trust clocks synchronization if you are building distributed system unless special hardware measures are used as for example in Google Spanner. Even there a special algorithm is used to resolve possible clock skew conflicts. However, there are many algorithms, which allow to solve this problem in distributed systems: logical clocks, vector clocks, Lamport timestamps to name a few. See classical book "Distributed Systems: Principles and Paradigms" by Andrew Tanenbaum.

Narrow answered 31/12, 2017 at 8:14 Comment(0)

Recommended topics

Hot tags