How can I programmatically find the CPU frequency with C

Asked 29/7, 2012 at 4:7 Answered 6/3, 2017 at 17:19

I'm trying to find out if there is anyway to get an idea of the CPU frequency of the system my C code is running on.

To clarify, I'm looking for an abstract solution, (one that will not be tied to a specific architecture or OS) which can give me an idea of the operating frequency of the computer that my code is executing on. I don't need to be exact, but I'd like to be in the ball park (ie. I have a 2.2GHz processor, I'd like to be able to tell in my program that I'm within a few hundred MHz of that)

Does anyone have an idea use standard C code?

Unbeliever answered 29/7, 2012 at 4:7 Comment(9)

Don't reinvent the wheel. The operating system manages hardware and already has this functionality, so find a way to detect which OS the program is executing on and then extract the CPU frequency accordingly. – Lheureux 29/7, 2012 at 4:14

This is basically meaningless. Say you have a program, running under a modern multitasking operating system, installed on a virtual cloud server. What is the meaning of clock speed? Even running bare-metal on a micro-controller with interrupts disabled, off of zero wait state internal memory, of what relevance is "clock speed" without knowing the instructions your program is compiled to and how many clock cycles each requires? – Mongoloid 29/7, 2012 at 4:31

this topic may inspire you: stackoverflow.com/questions/2814569/… Regards. – Misericord 29/7, 2012 at 4:45

Alex W, good point. I suppose it's best just to detect the OS/arch and work from there. I was hoping there was something I'm missing, but sounds like everyone agrees, that's the best way. – Unbeliever 30/7, 2012 at 3:36

You can't. Standard C (defined by some normative document in English) is not even supposed to be run on hardware -you can run in an emulator, or unethically using a team of human slaves to interpret your code. So the very notion of CPU and its frequency is meaning less in Standard C. Of course, for some given operating system and API, there are some specific answers. (On Linux, read sequentially /proc/cpuinfo) – Infantilism 20/8, 2014 at 8:39

@Mike, I provided an answer to your question. This is possible to do and it's not meaningless. In my case I want to calculate the peak flops of my processor and I need to know the frequency to do this. The OS won't tell you this due to turbo boost and/or overclocking. You could use CPU-Z but then you have to measure and write down the result and input it into your code. I run on several different system (many of which don't have Windows/CPU-z) and this is a pain. Now I finally have a way to do this in code. – Variolous 20/8, 2014 at 8:50

@Basile - It's been over 2 years since I asked this and I don't remember what the goal was at the time, but I have to disagree with your comment (or maybe I just don't understand it). "Standard C is not even supposed to be run on hardware". What does that even mean? I can use C code, without any OS calls, to write bare metal code which toggles an LED on my RaspberryPI... how can you say that is not meant to run on hardware? – Unbeliever 20/8, 2014 at 11:48

You can run C code in your head or, as I mentioned in a bad joke, using (but this is unethical) a team of human slaves, and more realistically in an emulator, an interpretor, a simulator, ... The C language specification does not even imply the existence of hardware, only of some very abstract runtime environment. Read carefully some recent C standard document. – Infantilism 20/8, 2014 at 11:59

I think many of the people on SO would fail the Turing test. Anything that's slightly ambiguous returns syntax error. I found a solution to find the operating frequency for a real Intel processor (not a virtual one) using C/C++ with intrinsics and people debate what Standard C is. Will I ever understand programmers? Does anyone care about hardware anymore? – Variolous 20/8, 2014 at 15:29

How you find the CPU frequency is both architecture AND OS dependent, and there is no abstract solution.

If we were 20+ years ago and you were using an OS with no context switching and the CPU executed the instructions given it in order, you could write some C code in a loop and time it, then based on the assembly it was compiled into compute the number of instructions at runtime. This is already making the assumption that each instruction takes 1 clock cycle, which is a rather poor assumption ever since pipelined processors.

But any modern OS will switch between multiple processes. Even then you can attempt to time a bunch of identical for loop runs (ignoring time needed for page faults and multiple other reasons why your processor might stall) and get a median value.

And even if the previous solution works, you have multi-issue processors. With any modern processor, it's fair game to re-order your instructions, issue a bunch of them in the same clock cycle, or even split them across cores.

Iridaceous answered 29/7, 2012 at 4:30 Comment(1)

Yeah, that's pretty much what I figured. I was just crossing my fingers that I missed something stupid. Some way to prevent task switching, force the CPU to run single context, and make a measurement.. or something along those lines. Asking for too much. Thanks for the input. – Unbeliever 30/7, 2012 at 3:38

For the sake of completeness, already there is a simple, fast, accurate, user mode solution with a huge drawback: it works on Intel Skylake, Kabylake and newer processors only. The exact requirement is the CPUID level 16h support. According to the Intel Software Developer's Manual 325462 release 59, page 770:

CPUID.16h.EAX = Processor Base Frequency (in MHz);
CPUID.16h.EBX = Maximum Frequency (in MHz);
CPUID.16h.ECX = Bus (Reference) Frequency (in MHz).

Visual Studio 2015 sample code:

#include <stdio.h>
#include <intrin.h>

int main(void) {
    int cpuInfo[4] = { 0, 0, 0, 0 };
    __cpuid(cpuInfo, 0);
    if (cpuInfo[0] >= 0x16) {
        __cpuid(cpuInfo, 0x16);

        //Example 1
        //Intel Core i7-6700K Skylake-H/S Family 6 model 94 (506E3)
        //cpuInfo[0] = 0x00000FA0; //= 4000 MHz
        //cpuInfo[1] = 0x00001068; //= 4200 MHz
        //cpuInfo[2] = 0x00000064; //=  100 MHz

        //Example 2
        //Intel Core m3-6Y30 Skylake-U/Y Family 6 model 78 (406E3)
        //cpuInfo[0] = 0x000005DC; //= 1500 MHz
        //cpuInfo[1] = 0x00000898; //= 2200 MHz
        //cpuInfo[2] = 0x00000064; //=  100 MHz

        //Example 3
        //Intel Core i5-7200 Kabylake-U/Y Family 6 model 142 (806E9)
        //cpuInfo[0] = 0x00000A8C; //= 2700 MHz
        //cpuInfo[1] = 0x00000C1C; //= 3100 MHz
        //cpuInfo[2] = 0x00000064; //=  100 MHz

        printf("EAX: 0x%08x EBX: 0x%08x ECX: %08x\r\n", cpuInfo[0], cpuInfo[1], cpuInfo[2]);
        printf("Processor Base Frequency:  %04d MHz\r\n", cpuInfo[0]);
        printf("Maximum Frequency:         %04d MHz\r\n", cpuInfo[1]);
        printf("Bus (Reference) Frequency: %04d MHz\r\n", cpuInfo[2]);
    } else {
        printf("CPUID level 16h unsupported\r\n");
    }
    return 0;
}

Psychologize answered 21/9, 2016 at 14:57 Comment(2)

Does it report current frequency anywhere in CPUID? – Jahnke 14/3, 2017 at 5:42

This only reports the "written on the box" frequency, which can be somewhat different than the true frequency since "reference frequency" is only a nominal value (e.g., the actual BCLK can vary by a significant amount from the reference of say 100 MHz), and can be wildly different due to automatic frequency scaling (turbo, speedstep, etc) or manual frequency limits (e.g., imposed by th OS due to power saving) and several other reasons. – Croce 27/12, 2017 at 21:34

It is possible to find a general solution which gets the operating frequency correctly for one thread or many threads. This does not need admin/root privileges or access to model specific registers. I have tested this on Linux and Windows on Intel processors including, Nahalem, Ivy Bridge, and Haswell with one socket up to four sockets (40 threads). The results all deviate less than 0.5% from the correct answers. Before I show you how to do this let me show the results (from GCC 4.9 and MSVC2013):

Linux:    E5-1620 (Ivy Bridge) @ 3.60GHz    
1 thread: 3.789, 4 threads: 3.689 GHz:  (3.8-3.789)/3.8 = 0.3%, 3.7-3.689)/3.7 = 0.3%

Windows:  E5-1620 (Ivy Bridge) @ 3.60GHz
1 thread: 3.792, 4 threads: 3.692 GHz: (3.8-3.789)/3.8 = 0.2%, (3.7-3.689)/3.7 = 0.2%

Linux:  4xE7-4850 (Nahalem) @ 2.00GHz
1 thread: 2.390, 40 threads: 2.125 GHz:, (2.4-2.390)/2.4 = 0.4%, (2.133-2.125)/2.133 = 0.4%

Linux:    i5-4250U (Haswell) CPU @ 1.30GHz
1 thread: within 0.5% of 2.6 GHz, 2 threads wthin 0.5% of 2.3 GHz

Windows: 2xE5-2667 v2 (Ivy Bridge) @ 3.3 GHz
1 thread: 4.000 GHz, 16 threads: 3.601 GHz: (4.0-4.0)/4.0 = 0.0%, (3.6-3.601)/3.6 = 0.0%

I got the idea for this from this link http://randomascii.wordpress.com/2013/08/06/defective-heat-sinks-causing-garbage-gaming/

To do this you you first do what you do from 20 years ago. You write some code with a loop where you know the latency and time it. Here is what I used:

static int inline SpinALot(int spinCount)
{
    __m128 x = _mm_setzero_ps();
    for(int i=0; i<spinCount; i++) {
        x = _mm_add_ps(x,_mm_set1_ps(1.0f));
    }
    return _mm_cvt_ss2si(x);
}

This has a carried loop dependency so the CPU can't reorder this to reduce the latency. It always takes 3 clock cycles per iteration. The OS won't migrate the thread to another core because we will bind the threads.

Then you run this function on each physical core. I did this with OpenMP. The threads must be bound for this. In linux with GCC you can use export OMP_PROC_BIND=true to bind the threads and assuming you have ncores physical core do also export OMP_NUM_THREADS=ncores. If you want to programmatically bind and find the number of physical cores for Intel processors see this programatically-detect-number-of-physical-processors-cores-or-if-hyper-threading and thread-affinity-with-windows-msvc-and-openmp.

void sample_frequency(const int nsamples, const int n, float *max, int nthreads) {
    *max = 0;
    volatile int x = 0;
    double min_time = DBL_MAX;
    #pragma omp parallel reduction(+:x) num_threads(nthreads)
    {
        double dtime, min_time_private = DBL_MAX;
        for(int i=0; i<nsamples; i++) {
             #pragma omp barrier
             dtime = omp_get_wtime();
             x += SpinALot(n);
             dtime = omp_get_wtime() - dtime;
             if(dtime<min_time_private) min_time_private = dtime;
        }
        #pragma omp critical
        {
            if(min_time_private<min_time) min_time = min_time_private;
        }
    }
    *max = 3.0f*n/min_time*1E-9f;
}

Finally run the sampler in a loop and print the results

int main(void) {
    int ncores = getNumCores();
    printf("num_threads %d, num_cores %d\n", omp_get_max_threads(), ncores);       
    while(1) {
        float max1, median1, max2, median2;
        sample_frequency(1000, 1000000, &max2, &median2, ncores);
        sample_frequency(1000, 1000000, &max1, &median1,1);          
        printf("1 thread: %.3f, %d threads: %.3f GHz\n" ,max1, ncores, max2);
    }
}

I have not tested this on AMD processors. I think AMD processors with modules (e.g Bulldozer) will have to bind to each module not each AMD "core". This could be done with export GOMP_CPU_AFFINITY with GCC. You can find a full working example at https://bitbucket.org/zboson/frequency which works on Windows and Linux on Intel processors and will correctly find the number of physical cores for Intel processors (at least since Nahalem) and binds them to each physical core (without using OMP_PROC_BIND which MSVC does not have).

This method has to be modified a bit for modern processors due to different frequency scaling for SSE, AVX, and AVX512.

Here is a new table I get after modifying my method (see the code after table) with four Xeon 6142 processors (16 cores per processor).

        sums  1-thread  64-threads
SSE        1       3.7         3.3
SSE        8       3.7         3.3
AVX        1       3.7         3.3
AVX        2       3.7         3.3
AVX        4       3.6         2.9
AVX        8       3.6         2.9
AVX512     1       3.6         2.9
AVX512     2       3.6         2.9
AVX512     4       3.5         2.2
AVX512     8       3.5         2.2

These numbers agree with the frequencies in this table https://en.wikichip.org/wiki/intel/xeon_gold/6142#Frequencies

The interesting thing is that I need to now do at least 4 parallel sums to achieve the lower frequencies. The latency for addps on Skylake is 4 clock cycles. These can go to two ports (with AVX512 ports 0 and 1 fuse to count and one AVX512 port and the other AVX512 operations goes to port 5).

Here is how I did eight parallel sums.

static int inline SpinALot(int spinCount) {
  __m512 x1 = _mm512_set1_ps(1.0);
  __m512 x2 = _mm512_set1_ps(2.0);
  __m512 x3 = _mm512_set1_ps(3.0);
  __m512 x4 = _mm512_set1_ps(4.0);
  __m512 x5 = _mm512_set1_ps(5.0);
  __m512 x6 = _mm512_set1_ps(6.0);
  __m512 x7 = _mm512_set1_ps(7.0);
  __m512 x8 = _mm512_set1_ps(8.0);
  __m512 one = _mm512_set1_ps(1.0);
  for(int i=0; i<spinCount; i++) {
    x1 = _mm512_add_ps(x1,one);
    x2 = _mm512_add_ps(x2,one);
    x3 = _mm512_add_ps(x3,one);
    x4 = _mm512_add_ps(x4,one);
    x5 = _mm512_add_ps(x5,one);
    x6 = _mm512_add_ps(x6,one);
    x7 = _mm512_add_ps(x7,one);
    x8 = _mm512_add_ps(x8,one);
  }
  __m512 t1 = _mm512_add_ps(x1,x2);
  __m512 t2 = _mm512_add_ps(x3,x4);
  __m512 t3 = _mm512_add_ps(x5,x6);
  __m512 t4 = _mm512_add_ps(x7,x8);
  __m512 t6 = _mm512_add_ps(t1,t2);
  __m512 t7 = _mm512_add_ps(t3,t4);
  __m512  x = _mm512_add_ps(t6,t7);
  return _mm_cvt_ss2si(_mm512_castps512_ps128(x));
}

Variolous answered 20/8, 2014 at 8:36 Comment(17)

bitbucket seems to have a "login-wall". would you mind cloning your repo to somwhere with login-free access? – Brunswick 22/2, 2016 at 14:45

This has broken in Skylake which changed the latency of addps – Costin 18/6, 2016 at 21:9

@harold, yeah it's 4 with Sklake instead of 3. Maybe there is one that is constant for Core2 through Skylake otherwise an exception has to be made for Skylake. – Variolous 19/6, 2016 at 13:47

You might be better off with a simple instruction that has a latency of 1 cycle, since that's not likely to get worse (or better!) in the future. For example a chain of dependent adds. One problem is that you'll have to ensure you either disable vectorization or make your loop unsuitable for vectorization for it to be reliable... – Croce 28/5, 2017 at 3:52

@BeeOnRope, yeah, I thought of exactly that but I have not implemented it yet. What I do now is manually change it to 4 for skylake and kabby lake and 3 otherwise. It also needs some adjustment for KNL which I do by hand as well. I don't use this utility often enough net to care. I will put it on github and let you know. – Variolous 29/5, 2017 at 9:29

@Zboson - I use a loop of dependent add instructions to measure CPU speed here and it seems to work well, usually withing 1 or 2 MHz on my machine. Of course turbo needs to be disabled as well as DVFS or all bets are off. – Croce 2/6, 2017 at 0:3

@BeeOnRope, why does Turbo have to be disabled? My method gets the correct result with Turbo. It returns the frequency for one core under load and the frequency for all cores under load. As you can see from my answer it gets the correct values for several different systems. What is DVFS (dynamic voltage frequency scaling)? – Variolous 2/6, 2017 at 6:43

Because it causes the frequency to fluctuate. There is no specific frequency for "one core under load", because another core becoming active even for a short time reduces the maximum frequency of the other core based on the multiplier rules for turbo-boost. The same is true of dynamic voltage frequency scaling, which I'm just using as shorthand for the scaling below nominal frequency in the same way that turbo is scaling above. Since the frequency in either scenario isn't constant, you can get errors at frequency measurement time, and also at "code under test" time. – Croce 2/6, 2017 at 7:12

That's not to say you can't get semi-reasonable results even with this those things on; in particular, the sub-nominal frequencies are mostly eliminated once you have run something at 100% CPU for a few 100 ms. Turbo you can never solve, however (unless you change the multipliers my writing msr regs), and in my case disabling turbo increases the stability of measurements by two orders of magnitude. I can get timing-based measurements down to about 0.1% or 0.01% without turbo, but with turbo the error is usually more than 1%. @Zboson – Croce 2/6, 2017 at 7:15

@BeeOnRope: Yeah, scalar ADD is probably the best bet. Fibonacci a+=b; b+=a; is short and has a serial dependency that can't autovectorize. Use scalar unsigned int to avoid UB on wrapping, and to get consistent 1-cycle latency for scalar add even on CPUs where paddd has 2-cycle latency. (K8/K10, Bulldozer-family, KNL, Netburst). As a bonus, the unroll by 2 means you won't bottleneck on front-end effects that interfere with running tiny loops at 1 iteration per clock. I guess you need a separate iteration counter inside the same loop; hopefully that doesn't delay the critical path – Summons 27/12, 2017 at 19:42

Well scalar add doesn't work for many Netburst CPUs due to the double-pumped ALU which will double the apparent frequency (but it seems possible that it will work for all 64-bit models since they may have dropped the double pumped operations), but it's definitely a good approach. Fibonacci is a an interesting idea keep the compiler from optimizing the loop. It seems like it is still subject to optimization with one add and one lea (implementing b = a + 2*b) which would cut the time in half, but no compilers I checked implemented it. – Croce 27/12, 2017 at 22:0

Of course, you can extend that optimization to calculate several iterations of the loop at once using mul since I guess the new values of a' and b' after some iterations always have the form X * a + Y * b. In C for calibration loops I have just been using a store-loop (to a volatile) which should run at one store per cycle on more hardware I'm aware of (AFAIK including double-pumped netburst). If I can use asm I use a simple sub loop, where the sub doubles as the loop counter. @PeterCordes – Croce 27/12, 2017 at 22:2

@BeeOnRope: Good point about double-pumped ALUs on Netburst pre-Prescott / Nocona (64-bit capable). I guess 1 store / clock isn't particularly better. But I don't think that optimization works; it's not that easy to stride through Fibonacci 2 steps at a time, is it? Maybe I'm not seeing exactly how you're thinking of using it; I made test loops (godbolt.org/g/o3xvuM) so I could call with constants and have gcc tell me what the result was. With b = a + 2*b I'm getting results that aren't Fibonacci numbers (en.wikipedia.org/wiki/…) – Summons 27/12, 2017 at 22:28

@PeterCordes - 1 store/clock isn't particularly better in what respect? The double-pumped ALU? I think it is still 1 store/clock there. Or in the chance that it is optimized away? There I am mostly relying on volatile which at least seems to be pretty consistently not optimized away everywhere. My main fear with the store loop is that it might not be optimized enough - e.g., the compiler does something dumb around the store that results in a second store. About optimizing fib, I could be full of it, I didn't actually try it. – Croce 27/12, 2017 at 22:29

@BeeOnRope: Oops, I meant to say add isn't better than using 1 store / clock. It seems like a reasonable microarchitectural throughput bottleneck. – Summons 27/12, 2017 at 22:33

@PeterCordes - here's what I mean for the fast fib. fib2 calculates the fib sequence (verified via the printfs), but the dependency chains are half as long. Note that gcc "undoes" my good work and uses plain add but clang and gcc use lea. Note that I unrolled the loop another 2x to express the two phases of the calculation, but you could also do it with a temporary a variable. I added it as fib3 which is more along the lines of the version you were writing. The compilers don't do great with it though. – Croce 27/12, 2017 at 22:44

... but that's just a 2x optimization. You can just stride by however much you want by calculating how many (original) a and b values end up in the value N iterations later, and it will be a simple multiplication like C*a + D*b so you can stride by whatever value for a fixed cost (a couple multiplies). It's not hard to imagine compilers picking up on this: if they unrolled the loop (like clang does), they can see that all but the last couple of values are dead and then they could come up with the closed form for the value for 1 (unrolled) iteration... but they don't (yet). – Croce 27/12, 2017 at 23:5

How you find the CPU frequency is both architecture AND OS dependent, and there is no abstract solution.

Iridaceous answered 29/7, 2012 at 4:30 Comment(1)

The CPU frequency is a hardware related thing, so there's no general method that you can apply to get it, it also depend on the OS you are using.

For example if you are using Linux, you can either read the file /proc/cpuinfo or you can parse the dmesg boot log to get this value or if you want you can see how linux kernel handle this stuff here and try to customize the code to meet your need :

https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/proc.c

Regards.

Misericord answered 29/7, 2012 at 4:59 Comment(0)

I guess one way to get clock frequency from software is by hard coding knowledge of Hardware Reference Manual(HRM) into software. You can read the clock configuration registers from the software. Assuming you know the source clock frequency, software can use the multiplier and divisor values from the clock registers and apply appropriate formulas as mentioned in HRM to derive clock frequency.

Scoliosis answered 6/3, 2017 at 17:19 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags