CPUID is serializing, preventing out-of-order execution of RDTSC.
These days you can safely use LFENCE instead. It's documented as serializing on the instruction stream (but not stores to memory) on Intel CPUs, and now also on AMD after their microcode update for Spectre.
https://hadibrais.wordpress.com/2018/05/14/the-significance-of-the-x86-lfence-instruction/ explains more about LFENCE.
See also https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf for a way to use RDTSCP that keeps CPUID (or LFENCE) out of the timed region:
LFENCE ; (or CPUID) Don't start the timed region until everything above has executed
RDTSC ; EDX:EAX = timestamp
mov ebx, eax ; low 32 bits of start time
code under test
RDTSCP ; built-in one way barrier stops it from running early
LFENCE ; (or CPUID) still use a barrier after to prevent anything weird
sub eax, ebx ; low 32 bits of end-start
See also Get CPU cycle count? for more about RDTSC caveats, like constant_tsc and nonstop_tsc.
As a bonus, RDTSCP gives you a core ID. You could use RDTSCP for the start time as well, if you want to check for core migration. But if your CPU has the constant_tsc
features, all cores in the package should have their TSCs synced so you typically don't need this on modern x86.
You could get the core ID from CPUID instead, as @Tony's answer points out.