Why program execution time differs running the same program multiple times?
Asked Answered
M

0

2

Consider a nodejs cpu-bound program that generate primes numbers:

// generatePrimes.js
// long running / CPU-bound calculation

function generatePrimes(start, range) {
  
  const primes = []
  let isPrime = true
  let end = start + range
  
  for (let i = start; i < end; i++) {
    for (let j = start; j < Math.sqrt(end); j++) {
      if (i !== j && i%j === 0) {
        isPrime = false
        break
      }
    }
    if (isPrime) {
      primes.push(i)
    }
    isPrime = true
  }

  return primes
}


function main() {

  const min = 2
  const max = 1e7

  console.log( generatePrimes(min, max) )

}  


if (require.main === module) 
  main()

module.exports = { generatePrimes }

My HW/OS configuration: Laptop with Linux Ubuntu 20.04.2 LTS desktop environment, with 8 cores:

$ inxi -C -M
Machine:   Type: Laptop System: HP product: HP Laptop 17-by1xxx v: Type1ProductConfigId serial: <superuser/root required> 
           Mobo: HP model: 8531 v: 17.16 serial: <superuser/root required> UEFI: Insyde v: F.32 date: 12/14/2018 
CPU:       Topology: Quad Core model: Intel Core i7-8565U bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 700 MHz min/max: 400/4600 MHz Core speeds (MHz): 1: 700 2: 700 3: 700 4: 700 5: 700 6: 700 7: 700 8: 700     
$ echo "CPU threads: $(grep -c processor /proc/cpuinfo)"
CPU threads: 8

Now, let measure the elapsed time:

$ /usr/bin/time -f "%e" node generatePrimes.js 
[
    2,   3,   5,   7,  11,  13,  17,  19,  23,  29,  31,  37,
   41,  43,  47,  53,  59,  61,  67,  71,  73,  79,  83,  89,
   97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151,
  157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223,
  227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281,
  283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359,
  367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433,
  439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503,
  509, 521, 523, 541,
  ... 664479 more items
]
7.99

OK, if i run once the program, the elapsed time is ~8 seconds.


But, consider the bash script testGeneratePrimes.sh to measure elapsed times, running the same program in sequence or in parallel 6 times:

#!/bin/bash 

# Get number of runs, as command line argument
if [ $# -eq 0 ]
  then
    echo
    echo -e "  run n processes, in sequence and in parallel."
    echo
    echo -e "  usage:"
    echo -e "    $0 <number of runs>"
    echo
    echo -e "  examples:"
    echo -e "    $0 6"
    echo -e "    run 6 times"
    echo
    exit 1
fi

numRuns=$1

# run a single instance of the process 'node generatePrimes'
runProcess() {
  /usr/bin/time -f "%e" node generatePrimes.js > /dev/null
}

echo
echo "SEQUENCE TEST: running generatePrimes, $numRuns successive sequential times" 
echo

for i in $(seq $numRuns); do
  runProcess
done

echo
echo "PARALLEL TEST: running generatePrimes, $numRuns in parallel (background processes)" 
echo

for i in $(seq $numRuns); do
  runProcess &
done

wait < <(jobs -p)

Running the script (6 processes):

$ ./testGeneratePrimes.sh 6

SEQUENCE TEST: running generatePrimes, 6 successive sequential times

8.16
9.09
11.44
11.57
12.93
12.00

PARALLEL TEST: running generatePrimes, 6 in parallel (background processes)

25.99
26.16
30.51
30.64
31.60
31.60

I see that:

  • in sequence test, elapsed times increase for each run, from ~8 second to ~12 seconds ?!
  • in the "parallel" tests, elapsed increase from ~25 seconds to ~31 seconds ?!

Thats's insane.

I do not understand why! Maybe it's a linux scheduler limitation? A CPU hardware limitation/issue?

I also tried commands:

  • nice -20 /usr/bin/time -f "%e" node generatePrimes.js or
  • taskset -c 0-7 /usr/bin/time -f "%e" node generatePrimes.js

but without any significant difference in the described behavior.

Questions:

  1. Why elapsed times varying so much?
  2. There is any way to config the linux behaviour to do not limit per-process cpu usage?

BTW, related question: running multiple nodejs worker threads: why of such a large overhead/latency?

UPDATE (MORE TESTS)

Following Nate Eldredge suggestion (see comments), here some info/sets using cpupower and sensors:

$ sudo cpupower -c all info 
analyzing CPU 0:
perf-bias: 6
analyzing CPU 1:
perf-bias: 6
analyzing CPU 2:
perf-bias: 6
analyzing CPU 3:
perf-bias: 6
analyzing CPU 4:
perf-bias: 6
analyzing CPU 5:
perf-bias: 6
analyzing CPU 6:
perf-bias: 6
analyzing CPU 7:
perf-bias: 6

# set perf-bias to max performance
$ sudo cpupower set -b 0

$ sudo cpupower -c all info -b
analyzing CPU 0:
perf-bias: 0
analyzing CPU 1:
perf-bias: 0
analyzing CPU 2:
perf-bias: 0
analyzing CPU 3:
perf-bias: 0
analyzing CPU 4:
perf-bias: 0
analyzing CPU 5:
perf-bias: 0
analyzing CPU 6:
perf-bias: 0
analyzing CPU 7:
perf-bias: 0
$ sudo cpupower monitor
    | Nehalem                   || Mperf              || Idle_Stats                                                   
 CPU| C3   | C6   | PC3  | PC6   || C0   | Cx   | Freq  || POLL | C1   | C1E  | C3   | C6   | C7s  | C8   | C9   | C10   
   0|  0,04|  2,53|  0,00|  0,00||  4,23| 95,77|   688||  0,00|  0,00|  0,08|  0,10|  2,38|  0,00| 24,65|  0,78| 67,90
   4|  0,04|  2,53|  0,00|  0,00||  3,68| 96,32|   675||  0,00|  0,00|  0,02|  0,06|  0,76|  0,00| 10,83|  0,03| 84,75
   1|  0,03|  1,94|  0,00|  0,00||  6,39| 93,61|   656||  0,00|  0,00|  0,04|  0,06|  1,88|  0,00| 16,19|  0,00| 75,63
   5|  0,04|  1,94|  0,00|  0,00||  1,35| 98,65|   689||  0,00|  0,02|  1,19|  0,04|  0,40|  0,33|  3,89|  0,82| 92,02
   2|  0,56| 25,49|  0,00|  0,00|| 12,88| 87,12|   673||  0,00|  0,00|  0,84|  0,74| 28,61|  0,03| 34,48|  3,44| 19,81
   6|  0,56| 25,48|  0,00|  0,00||  4,30| 95,70|   676||  0,00|  0,00|  0,03|  0,09|  1,48|  0,00| 22,66|  1,11| 70,52
   3|  0,19|  3,61|  0,00|  0,00||  3,67| 96,33|   658||  0,00|  0,00|  0,02|  0,07|  1,36|  0,00| 14,85|  0,03| 80,16
   7|  0,19|  3,60|  0,00|  0,00||  6,21| 93,79|   679||  0,00|  0,00|  0,28|  0,19|  3,48|  0,76| 31,10|  1,50| 56,75

$ sudo cpupower monitor ./testGeneratePrimes.sh 6
[sudo] password for giorgio: 

SEQUENCE TEST: running generatePrimes, 6 successive sequential times

8.18
9.06
11.66
11.29
11.30
11.21

PARALLEL TEST: running generatePrimes, 6 in parallel (background processes)

20.83
20.95
28.42
28.42
28.47
28.52
./testGeneratePrimes.sh took 91,26958 seconds and exited with status 0
    | Nehalem                   || Mperf              || Idle_Stats                                                   
 CPU| C3   | C6   | PC3  | PC6   || C0   | Cx   | Freq  || POLL | C1   | C1E  | C3   | C6   | C7s  | C8   | C9   | C10   
   0|  0,20|  1,98|  0,00|  0,00|| 33,88| 66,12|  2008||  0,00|  0,04|  0,18|  0,18|  1,83|  0,02| 16,67|  0,22| 47,04
   4|  0,20|  1,98|  0,00|  0,00|| 10,46| 89,54|  1787||  0,00|  0,09|  0,32|  0,37|  4,01|  0,03| 25,21|  0,19| 59,37
   1|  0,26|  2,40|  0,00|  0,00|| 24,52| 75,48|  1669||  0,00|  0,06|  0,18|  0,20|  2,17|  0,00| 14,90|  0,17| 57,84
   5|  0,26|  2,40|  0,00|  0,00|| 32,07| 67,93|  1662||  0,00|  0,07|  0,19|  0,14|  1,40|  0,02|  9,31|  0,53| 56,33
   2|  0,93| 13,33|  0,00|  0,00|| 31,31| 68,69|  2025||  0,00|  0,05|  0,43|  1,00| 18,21|  0,01| 26,18|  1,74| 21,23
   6|  0,93| 13,33|  0,00|  0,00|| 11,98| 88,02|  1711||  0,00|  0,19|  0,31|  0,22|  2,63|  0,03| 18,87|  0,76| 65,04
   3|  0,15|  0,98|  0,00|  0,00|| 47,38| 52,62|  2627||  0,00|  0,07|  0,17|  0,13|  1,35|  0,01|  7,88|  0,10| 42,80
   7|  0,15|  0,98|  0,00|  0,00|| 59,25| 40,75|  2235||  0,00|  0,06|  0,18|  0,18|  1,58|  0,00|  9,31|  0,63| 28,91


$ sensors && testGeneratePrimes.sh 6 && sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +46.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +45.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +43.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +45.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +44.0°C  (high = +100.0°C, crit = +100.0°C)

BAT0-acpi-0
Adapter: ACPI interface
in0:          12.89 V  
curr1:         0.00 A  

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       65.49 V  
edge:        +511.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:        1.07 kW (cap =  30.00 W)

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1:        +42.0°C  

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +47.0°C  (crit = +120.0°C)
temp2:        +53.0°C  (crit = +127.0°C)


SEQUENCE TEST: running generatePrimes, 6 successive sequential times

8.36
9.76
11.35
11.38
11.22
11.24

PARALLEL TEST: running generatePrimes, 6 in parallel (background processes)

21.06
21.14
28.50
28.55
28.62
28.65

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +54.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +51.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +50.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +54.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +50.0°C  (high = +100.0°C, crit = +100.0°C)

BAT0-acpi-0
Adapter: ACPI interface
in0:          12.89 V  
curr1:         0.00 A  

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       65.49 V  
edge:        +511.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:        1.07 kW (cap =  30.00 W)

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1:        +46.0°C  

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +55.0°C  (crit = +120.0°C)
temp2:        +57.0°C  (crit = +127.0°C)

Manvell answered 22/4, 2021 at 10:13 Comment(20)
I can't replicate those results (unless I switch away from that window and start using the browser, for example): 10.64 10.68 10.42 10.51 10.59 10.46 10.48 10.65 10.49 10.47 The variance in this run is IMHO marginal an can be easily explained by multi-user, multi-tasking OS ... Mind you, I also redirected the output of the numbers to /dev/null, so another possibility is that the terminal window/scroll-buffer handling makes the difference.Abm
Yes, your values are reasonable/expected. In my case instead, I'm starting to think the problem is not in software or the operating system, but it's maybe related to hardware: I fair is a CPU BIOS "power saving" settings that does not enable full core usage. Does it make sense? Unfortunately entering the BIOS config, I can't tune any related parameter.Manvell
Just out of curiosity: what does the timing on your machine do if you don't actually output the numbers to screen? /usr/bin/time -f "%e" nodejs generatePrimes.js >/dev/nullAbm
Tried. The difference is minimal/imperceptible, eg. 8.35 with redirection to /dev/null ?! and 8.30 when maintaining stdout print. I think stdio doesn't make a real difference.Manvell
Even if you do it for all instances in the script? For a single execution screen handling/scroll-back buffers might be negligible ...Abm
no problem. I modified the script to avoid stdout. I reported updated test resultsManvell
Cool - and pity that my suggestion didn't have the desired impact. Back to the drawing board. But I still think it's not Linux per se doing this.Abm
Yes. I temporarly agree that is not a linux problem. I just tested the script on a virtual machine (with 2VCPU) and i got pretty the same CONSTANT elapsed times (~15 secs) in all cases! That let me think about an HW limitation of my current HP laptop ( :( ), probably due to some Power Saving setting, unfortunately I can't tune them on BIOS...Manvell
Uh-huh ... funny little aside; I just ran the test on my workstation at work, too, and the results are almost identical to the one at home. 10.94 10.93 10.93 10.91 10.90 10.91 10.92 10.91 10.91 10.92 ... despite completely different CPU types. At home I have a 8-core Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz; the work machine is 6-core Intel(R) Xeon(R) W-2133 CPU @ 3.60GHzAbm
On my end, everything run at around 7s $ sudo inxi -C -M Machine: Type: Desktop System: Gigabyte product: Z370P D3 v: N/A serial: N/A Mobo: Gigabyte model: Z370P D3-CF v: x.x serial: N/A UEFI [Legacy]: American Megatrends v: F14 date: 08/13/2019 CPU: Info: 8-Core model: Intel Core i7-9700K bits: 64 type: MCP L2 cache: 12 MiB Speed: 800 MHz min/max: 800/4900 MHz Core speeds (MHz): 1: 800 2: 800 3: 800 4: 800 5: 800 6: 800 7: 800 8: 800 node --version v12.18.3 Clavicembalo
thx Luis, tink. Are you both running last tests on a desktop machine. right? Your tests confirm evidences that the issue is probably related to some CPU "power saving" settings of my laptop. But I can't see how to avoid these power saving :(Manvell
Can you log CPU frequency and temperature several times a second while the test runs? One possibility is that after the first few seconds of running, the CPU gets too hot and has to reduce the clock speed to cool down.Wretch
hmmm makes sense. Do you know how can I measure, on linux cli, cpus frequency/temperature when program is running? Anyway, if your thesis is true, I guess there is no solution. But that's weird. I have 8 cores but I can really use them...Manvell
@GiorgioRobino: For temperature, the lm-sensors package. For CPU frequency, /proc/cpuinfo or cpupower.Wretch
@NateEldredge thx. I updated the question adding some tests using cpupower and sensors commands. Using cpupower I set max performance, nevertheless I didn't get expected improvement :(Manvell
Oh! Your machine has not 8 cores but 4. So presumably you have 8 logical processors due to hyperthreading. This would explain the parallel numbers pretty well: two of the jobs get their own cores so they are fast. The other four are run on the other two cores, two to a core, and are slower because they have to share their resources. Doesn't help explain the sequential test though.Wretch
As far as I know, from the programming perspective, we count the cores with by example: echo "CPU threads: $(grep -c processor /proc/cpuinfo)" So in my case I have 8 cores and the reason I run 6 processes (in parallel) is just to avoid to "saturate" the number of cores, possibly avoiding any OS enqueue/saturation limit. A common sense/safe rule is to limit parallel multithreading to num_cores -1 (reasonable but maybe not a gold rule). I'm wrong?Manvell
The point of hyperthreading is that you can program as if you truly have 8 physical cores, but you won't get the same performance as if they were all truly independent cores. It's a compromise - as you can see, the jobs that are shared on a physical core run slower, but not twice as slow (which is what you would see without hyperthreading if they had to be scheduled in alternation).Wretch
/proc/cpuinfo lists logical processors; you can use the other fields to see how they are distributed across physical cores. See unix.stackexchange.com/questions/57920/…Wretch
Ok, I have 4 real cores. Now if in testPrimeNumbers.sh I limit the number of processes (threads) to 4, avoiding any hyperthrading ambiguity, I still got: 1. sequential: 8.15 9.15 11.52 11.36 2. parallel: 20.06 20.12 20.14 20.21 I think nothing changed. I suspect a "CPU power saving" mechanism inhibits full core usage.Manvell

© 2022 - 2024 — McMap. All rights reserved.