Getting TSC rate from x86 kernel
Asked Answered
B

3

11

I have an embedded Linux system running on an Atom, which is a new enough CPU to have an invariant TSC (time stamp counter), whose frequency the kernel measures on startup. I use the TSC in my own code to keep time (avoiding kernel calls), and my startup code measures the TSC rate, but I'd rather just use the kernel's measurement. Is there any way to retrieve this from the kernel? It's not in /proc/cpuinfo anywhere.

Bloem answered 1/2, 2016 at 5:13 Comment(1)
Related problem: obtaining the TSC value preciselyImpiety
I
20

BPFtrace

As root, you can retrieve the kernel's TSC rate with bpftrace:

# bpftrace -e 'BEGIN { printf("%u\n", *kaddr("tsc_khz")); exit(); }' 2>/dev/null | grep '^[1-9]'

(tested it on CentOS 7 and Fedora 37)

That is the value that is defined, exported and maintained/calibrated in arch/x86/kernel/tsc.c.

Drgn

Another way is to use drgn - a Python-programmable debugger:

# python -c 'import drgn; p=drgn.program_from_kernel(); print(p["tsc_khz"].value_())'

Or using the drgn shell:

# cat tsc_khz.py 
#!/usr/bin/env drgn

print(prog["tsc_khz"].value_())
# drgn tsc_khz.py

This also requires root, unless you use drgn on a core file.

GDB

Alternatively, also as root, you can also read it from /proc/kcore, e.g.:

# gdb /dev/null /proc/kcore -ex 'x/uw 0x'$(grep '\<tsc_khz\>' /proc/kallsyms \
    | cut -d' ' -f1) -batch 2>/dev/null | tail -n 1 | cut -f2

(tested it on CentOS 7 and Fedora 29)

SystemTap

If the system doesn't have bpftrace nor debuggers available but SystemTap you can get it like this (as root):

# cat tsc_khz.stp 
#!/usr/bin/stap -g

function get_tsc_khz() %{ /* pure */
    THIS->__retvalue = tsc_khz;
%}
probe oneshot {
    printf("%u\n", get_tsc_khz());
}
# ./tsc_khz.stp

Kernel Module

Of course, you can also write a small kernel module that provides access to tsc_khz via the /sys pseudo file system. Even better, somebody already did that and a tsc_freq_khz module is available on GitHub. With that the following should work:

# modprobe tsc_freq_khz
$ cat /sys/devices/system/cpu/cpu0/tsc_freq_khz

(tested on Fedora 29, reading the sysfs file doesn't require root)

Kernel Messages

In case nothing of the above is an option you can parse the TSC rate from the kernel logs. But this gets ugly fast because you see different kinds of messages on different hardware and kernels, e.g. on a Fedora 29 i7 system:

$ journalctl -k --grep '^tsc:'  | cut -d' ' -f5-
kernel: tsc: Detected 2800.000 MHz processor
kernel: tsc: Detected 2808.000 MHz TSC

But on a Fedora 29 Intel Atom just:

kernel: tsc: Detected 2200.000 MHz processor

While on a CentOS 7 i5 system:

kernel: tsc: Fast TSC calibration using PIT
kernel: tsc: Detected 1895.542 MHz processor
kernel: tsc: Refined TSC clocksource calibration: 1895.614 MHz

Perf Values

The Linux Kernel doesn't provide an API to read the TSC rate, yet. But it does provide one for getting the mult and shift values that can be used to convert TSC counts to nanoseconds. Those values are derived from tsc_khz - also in arch/x86/kernel/tsc.c - where tsc_khz is initialized and calibrated. And they are shared with userspace.

Example program that uses the perf API and accesses the shared page:

#include <asm/unistd.h>
#include <inttypes.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>

static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
           int cpu, int group_fd, unsigned long flags)
{
    return syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
}

The actual code:

int main(int argc, char **argv)
{
    struct perf_event_attr pe = {
        .type = PERF_TYPE_HARDWARE,
        .size = sizeof(struct perf_event_attr),
        .config = PERF_COUNT_HW_INSTRUCTIONS,
        .disabled = 1,
        .exclude_kernel = 1,
        .exclude_hv = 1
    };
    int fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        perror("perf_event_open failed");
        return 1;
    }
    void *addr = mmap(NULL, 4*1024, PROT_READ, MAP_SHARED, fd, 0);
    if (!addr) {
        perror("mmap failed");
        return 1;
    }
    struct perf_event_mmap_page *pc = addr;
    if (pc->cap_user_time != 1) {
        fprintf(stderr, "Perf system doesn't support user time\n");
        return 1;
    }
    printf("%16s   %5s\n", "mult", "shift");
    printf("%16" PRIu32 "   %5" PRIu16 "\n", pc->time_mult, pc->time_shift);
    close(fd);
}

Tested in on Fedora 29 and it works also for non-root users.

Those values can be used to convert a TSC count to nanoseconds with a function like this one:

static uint64_t mul_u64_u32_shr(uint64_t cyc, uint32_t mult, uint32_t shift)
{
    __uint128_t x = cyc;
    x *= mult;
    x >>= shift;
    return x;
}

CPUID/MSR

Another way to obtain the TSC rate is to follow DPDK's lead.

DPDK on x86_64 basically uses the following strategy:

  1. Read the 'Time Stamp Counter and Nominal Core Crystal Clock Information Leaf' via cpuid intrinsics (doesn't require special privileges), if available
  2. Read it from the MSR (requires the rawio capability and read permissions on /dev/cpu/*/msr), if possible
  3. Calibrate it in userspace by other means, otherwise

FWIW, a quick test shows that the cpuid leaf doesn't seem to be that widely available, e.g. an i7 Skylake and a goldmont atom don't have it. Otherwise, as can be seen from the DPDK code, using the MSR requires a bunch of intricate case distinctions.

However, in case the program already uses DPDK, getting the TSC rate, getting TSC values or converting TSC values is just a matter of using the right DPDK API.


In case one is targeting only relatively recent CPUs, then something equivalent to the following might be sufficient:

cpuid --one-cpu |
    awk '/TSC\/clock ratio/ { gsub("/", " "); num=$5; den=$6; }
         /nominal core crystal clock/ { art_hz=$6; }
         END { if (!art_hz)
                   print "Could not find ART frequency!";
               tsc_khz=art_hz/1000*num/den;
               print tsc_khz;
         }'

See also:

Impiety answered 7/9, 2019 at 16:39 Comment(4)
With speed step disabled, is "performance" as the scaling_governor for all the cores, I have 2 processes communicating over localhost TCP/IP stack send messages to each other. The msgs travel from main thread process 1 to main thread process 2. both main threads have single different cores as affinity and no other processes/threads run on those cores. We measure TSC on core 1, and TSC on core2. Is there any mechanism by which we can calculate "true" nanoseconds elapsed time, by ignoring effects like NTP adjustments, and obviously core frequency changes? Xeon gold 6154Uninspired
@Uninspired well, since TSC should always run on constant frequency (i.e. independent of CPU frequency changes and the system clock) and should be synchronized between the different cores you can calculate TSC_core_2 - TSC_core_1 and convert the result (i.e. the number of TSC ticks) to ns ones you obtained the true TSC frequency (or rate).Impiety
but the 2 cores may not have the same initial value of TSC at CPU_RESET?Uninspired
@Uninspired on modern systems the TSC should be synced between the different cores on the same CPU package and even other packages (in a multi socket system). Perhaps related and worth mentioning in that context, Linux tries to run on the TSC clocksource, by default - if the TSC looks like it has all the nice properties and after that monitors the TSC for conflicts. Cf. 'invariant TSC' and e.g. https://mcmap.net/q/19854/-rdtsc-accuracy-across-cpu-cores.Impiety
A
3

I had a brief look and there doesn't seem to be a built-in way to directly get this information from the kernel.

However, the symbol tsc_khz (which I'm guessing is what you want) is exported by the kernel. You could write a small kernel module that exposes a sysfs interface and use that to read out the value of tsc_khz from userspace.

If writing a kernel module is not an option, it may be possible to use some Dark Magic™ to read out the value directly from the kernel memory space. Parse the kernel binary or System.map file to find the location of the tsc_khz symbol and read it from /dev/{k}mem. This is, of course, only possible provided that the kernel is configured with the appropriate options.

Lastly, from reading the kernel source comments, it looks like there's a possibility that the TSC may be unstable on some platforms. I don't know much about the inner workings of the x86 arch but this may be something you want to take into consideration.

Algetic answered 1/2, 2016 at 5:33 Comment(2)
TSC varies with current CPU clock speed if processor has speed step and does not have "constant_tsc". All modern x86 processors have constant_tsc [as OP mentioned his does]. BTW, the necessary information is [as outlined in my answer] in /proc/cpuinfoInfestation
I like the idea of writing a kernel module, but since I've never done that, I think I'm inclined to trust the bogomips/2 value. I can recheck what that produces if I switch to a different motherboard.Bloem
I
2

The TSC rate is directly related to "cpu MHz" in /proc/cpuinfo. Actually, the better number to use is "bogomips". The reason is that while the freq for TSC is the max CPU freq, the current "cpu Mhz" can vary at time of your invocation.

The bogomips value is computed at boot. You'll need to adjust this value by number of cores and processor count (i.e. the number of hyperthreads) That gives you [fractional] MHz. That is what I use to do what you want to do.

To get the processor count, look for the last "processor: " line. The processor count is <value> + 1. Call it "cpu_count".

To get number of cores, any "cpu cores: " works. number of cores is <value>. Call it "core_count".

So, the formula is:

smt_count = cpu_count;
if (core_count)
    smt_count /= core_count;
cpu_freq_in_khz = (bogomips * scale_factor) / smt_count;

That is extracted from my actual code, which is below.


Here's the actual code I use. You won't be able to use it directly because it relies on boilerplate I have, but it should give you some ideas, particularly with how to compute

// syslgx/tvtsc -- system time routines (RDTSC)

#include <tgb.h>
#include <zprt.h>

tgb_t systvinit_tgb[] = {
    { .tgb_val = 1, .tgb_tag = "cpu_mhz" },
    { .tgb_val = 2, .tgb_tag = "bogomips" },
    { .tgb_val = 3, .tgb_tag = "processor" },
    { .tgb_val = 4, .tgb_tag = "cpu_cores" },
    { .tgb_val = 5, .tgb_tag = "clflush_size" },
    { .tgb_val = 6, .tgb_tag = "cache_alignment" },
    TGBEOT
};

// _systvinit -- get CPU speed
static void
_systvinit(void)
{
    const char *file;
    const char *dlm;
    XFIL *xfsrc;
    int matchflg;
    char *cp;
    char *cur;
    char *rhs;
    char lhs[1000];
    tgb_pc tgb;
    syskhz_t khzcpu;
    syskhz_t khzbogo;
    syskhz_t khzcur;
    sysmpi_p mpi;

    file = "/proc/cpuinfo";

    xfsrc = fopen(file,"r");
    if (xfsrc == NULL)
        sysfault("systvinit: unable to open '%s' -- %s\n",file,xstrerror());

    dlm = " \t";

    khzcpu = 0;
    khzbogo = 0;

    mpi = &SYS->sys_cpucnt;
    SYSZAPME(mpi);

    // (1) look for "cpu MHz : 3192.515" (preferred)
    // (2) look for "bogomips : 3192.51" (alternate)
    // FIXME/CAE -- on machines with speed-step, bogomips may be preferred (or
    // disable it)
    while (1) {
        cp = fgets(lhs,sizeof(lhs),xfsrc);
        if (cp == NULL)
            break;

        // strip newline
        cp = strchr(lhs,'\n');
        if (cp != NULL)
            *cp = 0;

        // look for symbol value divider
        cp = strchr(lhs,':');
        if (cp == NULL)
            continue;

        // split symbol and value
        *cp = 0;
        rhs = cp + 1;

        // strip trailing whitespace from symbol
        for (cp -= 1;  cp >= lhs;  --cp) {
            if (! XCTWHITE(*cp))
                break;
            *cp = 0;
        }

        // convert "foo bar" into "foo_bar"
        for (cp = lhs;  *cp != 0;  ++cp) {
            if (XCTWHITE(*cp))
                *cp = '_';
        }

        // match on interesting data
        matchflg = 0;
        for (tgb = systvinit_tgb;  TGBMORE(tgb);  ++tgb) {
            if (strcasecmp(lhs,tgb->tgb_tag) == 0) {
                matchflg = tgb->tgb_val;
                break;
            }
        }
        if (! matchflg)
            continue;

        // look for the value
        cp = strtok_r(rhs,dlm,&cur);
        if (cp == NULL)
            continue;

        zprt(ZPXHOWSETUP,"_systvinit: GRAB/%d lhs='%s' cp='%s'\n",
            matchflg,lhs,cp);

        // process the value
        // NOTE: because of Intel's speed step, take the highest cpu speed
        switch (matchflg) {
        case 1:  // genuine CPU speed
            khzcur = _systvinitkhz(cp);
            if (khzcur > khzcpu)
                khzcpu = khzcur;
            break;

        case 2:  // the consolation prize
            khzcur = _systvinitkhz(cp);

            // we've seen some "wild" values
            if (khzcur > 10000000)
                break;

            if (khzcur > khzbogo)
                khzbogo = khzcur;
            break;

        case 3:  // remember # of cpu's so we can adjust bogomips
            mpi->mpi_cpucnt = atoi(cp);
            mpi->mpi_cpucnt += 1;
            break;

        case 4:  // remember # of cpu cores so we can adjust bogomips
            mpi->mpi_corecnt = atoi(cp);
            break;

        case 5:  // cache flush size
            mpi->mpi_cshflush = atoi(cp);
            break;

        case 6:  // cache alignment
            mpi->mpi_cshalign = atoi(cp);
            break;
        }
    }

    fclose(xfsrc);

    // we want to know the number of hyperthreads
    mpi->mpi_smtcnt = mpi->mpi_cpucnt;
    if (mpi->mpi_corecnt)
        mpi->mpi_smtcnt /= mpi->mpi_corecnt;

    zprt(ZPXHOWSETUP,"_systvinit: FINAL khzcpu=%d khzbogo=%d mpi_cpucnt=%d mpi_corecnt=%d mpi_smtcnt=%d mpi_cshalign=%d mpi_cshflush=%d\n",
        khzcpu,khzbogo,mpi->mpi_cpucnt,mpi->mpi_corecnt,mpi->mpi_smtcnt,
        mpi->mpi_cshalign,mpi->mpi_cshflush);

    if ((mpi->mpi_cshalign == 0) || (mpi->mpi_cshflush == 0))
        sysfault("_systvinit: cache parameter fault\n");

    do {
        // use the best reference
        // FIXME/CAE -- with speed step, bogomips is better
#if 0
        if (khzcpu != 0)
            break;
#endif

        khzcpu = khzbogo;
        if (mpi->mpi_smtcnt)
            khzcpu /= mpi->mpi_smtcnt;
        if (khzcpu != 0)
            break;

        sysfault("_systvinit: unable to obtain cpu speed\n");
    } while (0);

    systvkhz(khzcpu);

    zprt(ZPXHOWSETUP,"_systvinit: EXIT\n");
}

// _systvinitkhz -- decode value
// RETURNS: CPU freq in khz
static syskhz_t
_systvinitkhz(char *str)
{
    char *src;
    char *dst;
    int rhscnt;
    char bf[100];
    syskhz_t khz;

    zprt(ZPXHOWSETUP,"_systvinitkhz: ENTER str='%s'\n",str);

    dst = bf;
    src = str;

    // get lhs of lhs.rhs
    for (;  *src != 0;  ++src, ++dst) {
        if (*src == '.')
            break;
        *dst = *src;
    }

    // skip over the dot
    ++src;

    // get rhs of lhs.rhs and determine how many rhs digits we have
    rhscnt = 0;
    for (;  *src != 0;  ++src, ++dst, ++rhscnt)
        *dst = *src;

    *dst = 0;

    khz = atol(bf);
    zprt(ZPXHOWSETUP,"_systvinitkhz: PRESCALE bf='%s' khz=%d rhscnt=%d\n",
        bf,khz,rhscnt);

    // scale down (e.g. we got xxxx.yyyy)
    for (;  rhscnt > 3;  --rhscnt)
        khz /= 10;

    // scale up (e.g. we got xxxx.yy--bogomips does this)
    for (;  rhscnt < 3;  ++rhscnt)
        khz *= 10;

    zprt(ZPXHOWSETUP,"_systvinitkhz: EXIT khz=%d\n",khz);

    return khz;
}

UPDATE:

Sigh. Yes.

I was using "cpu MHz" from /proc/cpuinfo prior to the introduction of processors with "speed step" technology, so I switched to "bogomips" and the algorithm was derived empirically based on that. When I derived it, I only had access to hyperthreaded machines. However, I've found an old one that is not and the SMT thing isn't valid.

However, it appears that bogomips is always 2x the [maximum] CPU speed. See http://www.clifton.nl/bogo-faq.html That hasn't always been my experience on all kernel versions over the years [IIRC, I started with 0.99.x], but it's probably a reliable assumption these days.

With "constant TSC" [which all newer processors have], denoted by constant_tsc in the flags: field in /proc/cpuinfo, the TSC rate is the maximum CPU frequency.

Originally, the only way to get the frequency information was from /proc/cpuinfo. Now, however, in more modern kernels, there is another way that may be easier and more definitive [I had code coverage for this in other software of mine, but had forgotten about it]:

/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq

The contents of this file is the maximum CPU frequency in kHz. There are analogous files for the other CPU cores. The files should be identical for most sane motherboards (e.g. ones that are composed of the same model chip and don't try to mix [say] i7s and atoms). Otherwise, you'd have to keep track of the info on a per-core basis and that would get messy fast.

The given directory also has other interesting files. For example, if your processor has "speed step" [and some of the other files can tell you that], you can force maximum performance by writing performance to the scaling_governor file. This will disable use of speed step.

If the processor did not have constant_tsc, you'd have to disable speed step [and run the cores at maximum rate] to get accurate measurements

Infestation answered 1/2, 2016 at 5:25 Comment(5)
On my quad core 1.6GHz Atom, TSC counts at 1.6GHz but bogomips says 3.2GHz. On my quad core 3.5GHz i7-4770K, TSC counts at 3.5GHz but bogomips says 7GHz.Bloem
I've added more explanation [as well as the code]. Try it with the adjustments. I've been using the code snippet for 2 decades, so it's [usually :-)] correct. You need the "cores" line and the last "processor" line value to calc the SMT [hyperthread] count. Bogomips is based on hyperthreading (e.g. with clock freq of 3 and 2 hyperthreads, bogo will 3*2 or 6), so we need to divide bogo by hyperthread count to get freqInfestation
That's a nice theory, but my quad core atom doesn't do hyperthreading, and still shows a bogomips value that is twice the clock rate.Bloem
On a "Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz" system, /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq contains the max turbo speed, not the processor base frequency, which is what the TSC appears to be running at. I've fallen back to parsing (from /proc/cpuinfo) the model name string, but it would certainly be nice to have somewhere unambiguous to grab this other than dmesg, where "tsc: Refined TSC clocksource calibration: 2399.997 MHz" is reported... SL7 / 3.10.0-327.18.2.el7.x86_64Glossal
You write 'With "constant TSC" the TSC rate is the maximum CPU frequency' and this is incorrect, in general. I have checked several constant_tsc systems (Skylake i7, Haswell i5, AMD Phenom 2), and the TSC rate as determined and used by the Linux kernel was lower than the CPU base clock rate, on all of them.Impiety

© 2022 - 2024 — McMap. All rights reserved.