Is mfence for rdtsc necessary on x86_64 platform?

Asked 22/1, 2017 at 3:13 Answered 22/1, 2017 at 14:56

unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
    "mfence;rdtsc" : "=a"(lo), "=d"(hi) : : "memory"
);

mfence in the above code, is it necessary?

Based on my test, cpu reorder is not found.

The fragment of test code is included below.

inline uint64_t clock_cycles() {
    unsigned int lo = 0;
    unsigned int hi = 0;
    __asm__ __volatile__ (
        "rdtsc" : "=a"(lo), "=d"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

unsigned t1 = clock_cycles();
unsigned t2 = clock_cycles();
assert(t2 > t1);

Bean answered 22/1, 2017 at 3:13 Comment(1)

mfence should be issues after rdtsc for proper utilization. – Luce 22/1, 2017 at 3:20

What you need to perform a sensible measurement with rdtsc is a serializing instruction.

As it is well known, a lot of people use cpuid before rdtsc.
rdtsc needs to be serialized from above and below (read: all instructions before it must be retired and it must be retired before the test code starts).

Unfortunately the second condition is often neglected because cpuid is a very bad choice for this task (it clobbers the output of rdtsc).
When looking for alternatives people think that instructions that have a "fence" in their names will do, but this is also untrue. Straight from Intel:

MFENCE does not serialize the instruction stream.

An instruction that is almost serializing and will do in any measurement where previous stores don't need to complete is lfence.

Simply put, lfence makes sure that no new instructions start before any prior instruction completes locally. See this answer of mine for a more detailed explanation on locality.
It also doesn't drain the Store Buffer like mfence does and doesn't clobbers the registers like cpuid does.

So lfence / rdtsc / lfence is a better crafted sequence of instructions than mfence / rdtsc, where mfence is pretty much useless unless you explicitly want the previous stores to be completed before the test begins/ends (but not before rdstc is executed!).

If your test to detect reordering is assert(t2 > t1) then I believe you will test nothing.
Leaving out the return and the call that may or may not prevent the CPU from seeing the second rdtsc in time for a reorder, it is unlikely (though possible!) that the CPU will reorder two rdtsc even if one is right after the other.

Imagine we have a rdtsc2 that is exactly like rdtsc but writes ecx:ebx¹.

Executing

rdtsc
rdtsc2

is highly likely that ecx:ebx > edx:eax because the CPU has no reason to execute rdtsc2 before rdtsc.
Reordering doesn't mean random ordering, it means look for other instruction if the current one cannot be executed.
But rdtsc has no dependency on any previous instruction, so it's unlikely to be delayed when encountered by the OoO core.
However peculiar internal micro-architectural details may invalidate my thesis, hence the likely word in my previous statement.

¹ We don't need this altered instruction: register renaming will do it, but in case you are not familiar with it, this will help.

Likable answered 22/1, 2017 at 14:56 Comment(1)

It turns out mfence is serializing like lfence on most CPUs, unfortunately. Are loads and stores the only instructions that gets reordered?. See also Hadi's answer on clflush to invalidate cache line via C function for some actual experiments using lfence; rdtsc; lfence. – Beamon 18/8, 2018 at 14:51

mfence is there to force serialization in CPU before rdtsc.

Usually you will find cpuid there (which is also serializing instruction).

Quote from Intel manuals about using rdtsc will make it clearer

Starting with the Intel Pentium processor, most Intel CPUs support out-of-order execution of the code. The purpose is to optimize the penalties due to the different instruction latencies. Unfortunately this feature does not guarantee that the temporal sequence of the single compiled C instructions will respect the sequence of the instruction themselves as written in the source C file. When we call the RDTSC instruction, we pretend that that instruction will be executed exactly at the beginning and at the end of code being measured (i.e., we don’t want to measure compiled code executed outside of the RDTSC calls or executed in between the calls themselves). The solution is to call a serializing instruction before calling the RDTSC one. A serializing instruction is an instruction that forces the CPU to complete every preceding instruction of the C code before continuing the program execution. By doing so we guarantee that only the code that is under measurement will be executed in between the RDTSC calls and that no part of that code will be executed outside the calls.

TL;DR version - without serializing instruction before rdtsc you have no idea when that instruction started to execute making measurements possibly incorrect.

HINT - use rdtscp when possible.

Based on my test, cpu reorder is not found.

Still no guarantee that it may happen - that's why original code had "memory" to indicate possible memory clobber preventing compiler from reordering it.

Marentic answered 22/1, 2017 at 4:13 Comment(4)

As is mentioned above, mfence does not force serialization. Also note that the "memory" prevents compiler reordering, but does not say anything about CPU reordering. – Superorder 24/1, 2017 at 14:39

@MattG mfence does the same and more as lfence. Neither are full serialization instructions; but it seems unlikely that mfence won't wait till all previous instructions have finished locally as well. On top of that, it waits longer until also the stores have flushed to memory. Indeed it seems that the manual doesn't repeat the remark that lfence waits for all instructions to be finished locally for mfence; but it would be extremely odd if that wasn't the case. Normally one doesn't NEED to wait till stores are flushed to memory to continue here though as that is only relevant for other cores. – Gigigigli 13/5, 2018 at 16:38

@CarloWood: That's only an implementation detail for mfence, but guaranteed by Intel's manuals for lfence and thus is future-proof. There's some evidence that Skylake used to run mfence similar to a locked instruction, only being a memory barrier without being an out-of-order execution barrier. But after an erratum was discovered, a ucode update strengthened mfence to include lfence. See Are loads and stores the only instructions that gets reordered?. So once Intel gets this right, mfence won't block rdtsc. – Beamon 18/8, 2018 at 14:48

@PeterCordes I think my remark was just speculation. In the meantime I wrote a full-blown benchmark lib for myself (ok, maybe not just for myself; this assembly stuff made its way into cwds, see benchmark::Stopwatch) and I wouldn't dream about using anything else than lfence, just because. Your remarks are of course correct. – Gigigigli 18/8, 2018 at 22:4

Recommended topics

Hot tags