It is possible to find a general solution which gets the operating frequency correctly for one thread or many threads. This does not need admin/root privileges or access to model specific registers. I have tested this on Linux and Windows on Intel processors including, Nahalem, Ivy Bridge, and Haswell with one socket up to four sockets (40 threads). The results all deviate less than 0.5% from the correct answers. Before I show you how to do this let me show the results (from GCC 4.9 and MSVC2013):
Linux: E5-1620 (Ivy Bridge) @ 3.60GHz
1 thread: 3.789, 4 threads: 3.689 GHz: (3.8-3.789)/3.8 = 0.3%, 3.7-3.689)/3.7 = 0.3%
Windows: E5-1620 (Ivy Bridge) @ 3.60GHz
1 thread: 3.792, 4 threads: 3.692 GHz: (3.8-3.789)/3.8 = 0.2%, (3.7-3.689)/3.7 = 0.2%
Linux: 4xE7-4850 (Nahalem) @ 2.00GHz
1 thread: 2.390, 40 threads: 2.125 GHz:, (2.4-2.390)/2.4 = 0.4%, (2.133-2.125)/2.133 = 0.4%
Linux: i5-4250U (Haswell) CPU @ 1.30GHz
1 thread: within 0.5% of 2.6 GHz, 2 threads wthin 0.5% of 2.3 GHz
Windows: 2xE5-2667 v2 (Ivy Bridge) @ 3.3 GHz
1 thread: 4.000 GHz, 16 threads: 3.601 GHz: (4.0-4.0)/4.0 = 0.0%, (3.6-3.601)/3.6 = 0.0%
I got the idea for this from this link
http://randomascii.wordpress.com/2013/08/06/defective-heat-sinks-causing-garbage-gaming/
To do this you you first do what you do from 20 years ago. You write some code with a loop where you know the latency and time it. Here is what I used:
static int inline SpinALot(int spinCount)
{
__m128 x = _mm_setzero_ps();
for(int i=0; i<spinCount; i++) {
x = _mm_add_ps(x,_mm_set1_ps(1.0f));
}
return _mm_cvt_ss2si(x);
}
This has a carried loop dependency so the CPU can't reorder this to reduce the latency. It always takes 3 clock cycles per iteration. The OS won't migrate the thread to another core because we will bind the threads.
Then you run this function on each physical core. I did this with OpenMP. The threads must be bound for this. In linux with GCC you can use export OMP_PROC_BIND=true
to bind the threads and assuming you have ncores
physical core do also export OMP_NUM_THREADS=ncores
. If you want to programmatically bind and find the number of physical cores for Intel processors see this programatically-detect-number-of-physical-processors-cores-or-if-hyper-threading and thread-affinity-with-windows-msvc-and-openmp.
void sample_frequency(const int nsamples, const int n, float *max, int nthreads) {
*max = 0;
volatile int x = 0;
double min_time = DBL_MAX;
#pragma omp parallel reduction(+:x) num_threads(nthreads)
{
double dtime, min_time_private = DBL_MAX;
for(int i=0; i<nsamples; i++) {
#pragma omp barrier
dtime = omp_get_wtime();
x += SpinALot(n);
dtime = omp_get_wtime() - dtime;
if(dtime<min_time_private) min_time_private = dtime;
}
#pragma omp critical
{
if(min_time_private<min_time) min_time = min_time_private;
}
}
*max = 3.0f*n/min_time*1E-9f;
}
Finally run the sampler in a loop and print the results
int main(void) {
int ncores = getNumCores();
printf("num_threads %d, num_cores %d\n", omp_get_max_threads(), ncores);
while(1) {
float max1, median1, max2, median2;
sample_frequency(1000, 1000000, &max2, &median2, ncores);
sample_frequency(1000, 1000000, &max1, &median1,1);
printf("1 thread: %.3f, %d threads: %.3f GHz\n" ,max1, ncores, max2);
}
}
I have not tested this on AMD processors. I think AMD processors with modules (e.g Bulldozer) will have to bind to each module not each AMD "core". This could be done with export GOMP_CPU_AFFINITY
with GCC. You can find a full working example at https://bitbucket.org/zboson/frequency which works on Windows and Linux on Intel processors and will correctly find the number of physical cores for Intel processors (at least since Nahalem) and binds them to each physical core (without using OMP_PROC_BIND
which MSVC does not have).
This method has to be modified a bit for modern processors due to different frequency scaling for SSE, AVX, and AVX512.
Here is a new table I get after modifying my method (see the code after table) with four Xeon 6142 processors (16 cores per processor).
sums 1-thread 64-threads
SSE 1 3.7 3.3
SSE 8 3.7 3.3
AVX 1 3.7 3.3
AVX 2 3.7 3.3
AVX 4 3.6 2.9
AVX 8 3.6 2.9
AVX512 1 3.6 2.9
AVX512 2 3.6 2.9
AVX512 4 3.5 2.2
AVX512 8 3.5 2.2
These numbers agree with the frequencies in this table
https://en.wikichip.org/wiki/intel/xeon_gold/6142#Frequencies
The interesting thing is that I need to now do at least 4 parallel sums to achieve the lower frequencies. The latency for addps on Skylake is 4 clock cycles. These can go to two ports (with AVX512 ports 0 and 1 fuse to count and one AVX512 port and the other AVX512 operations goes to port 5).
Here is how I did eight parallel sums.
static int inline SpinALot(int spinCount) {
__m512 x1 = _mm512_set1_ps(1.0);
__m512 x2 = _mm512_set1_ps(2.0);
__m512 x3 = _mm512_set1_ps(3.0);
__m512 x4 = _mm512_set1_ps(4.0);
__m512 x5 = _mm512_set1_ps(5.0);
__m512 x6 = _mm512_set1_ps(6.0);
__m512 x7 = _mm512_set1_ps(7.0);
__m512 x8 = _mm512_set1_ps(8.0);
__m512 one = _mm512_set1_ps(1.0);
for(int i=0; i<spinCount; i++) {
x1 = _mm512_add_ps(x1,one);
x2 = _mm512_add_ps(x2,one);
x3 = _mm512_add_ps(x3,one);
x4 = _mm512_add_ps(x4,one);
x5 = _mm512_add_ps(x5,one);
x6 = _mm512_add_ps(x6,one);
x7 = _mm512_add_ps(x7,one);
x8 = _mm512_add_ps(x8,one);
}
__m512 t1 = _mm512_add_ps(x1,x2);
__m512 t2 = _mm512_add_ps(x3,x4);
__m512 t3 = _mm512_add_ps(x5,x6);
__m512 t4 = _mm512_add_ps(x7,x8);
__m512 t6 = _mm512_add_ps(t1,t2);
__m512 t7 = _mm512_add_ps(t3,t4);
__m512 x = _mm512_add_ps(t6,t7);
return _mm_cvt_ss2si(_mm512_castps512_ps128(x));
}
/proc/cpuinfo
) – Infantilism"Standard C is not even supposed to be run on hardware"
. What does that even mean? I can use C code, without any OS calls, to write bare metal code which toggles an LED on my RaspberryPI... how can you say that is not meant to run on hardware? – Unbeliever