Consider a nodejs cpu-bound program that generate primes numbers:
// generatePrimes.js
// long running / CPU-bound calculation
function generatePrimes(start, range) {
const primes = []
let isPrime = true
let end = start + range
for (let i = start; i < end; i++) {
for (let j = start; j < Math.sqrt(end); j++) {
if (i !== j && i%j === 0) {
isPrime = false
break
}
}
if (isPrime) {
primes.push(i)
}
isPrime = true
}
return primes
}
function main() {
const min = 2
const max = 1e7
console.log( generatePrimes(min, max) )
}
if (require.main === module)
main()
module.exports = { generatePrimes }
My HW/OS configuration: Laptop with Linux Ubuntu 20.04.2 LTS desktop environment, with 8 cores:
$ inxi -C -M
Machine: Type: Laptop System: HP product: HP Laptop 17-by1xxx v: Type1ProductConfigId serial: <superuser/root required>
Mobo: HP model: 8531 v: 17.16 serial: <superuser/root required> UEFI: Insyde v: F.32 date: 12/14/2018
CPU: Topology: Quad Core model: Intel Core i7-8565U bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 700 MHz min/max: 400/4600 MHz Core speeds (MHz): 1: 700 2: 700 3: 700 4: 700 5: 700 6: 700 7: 700 8: 700
$ echo "CPU threads: $(grep -c processor /proc/cpuinfo)"
CPU threads: 8
Now, let measure the elapsed time:
$ /usr/bin/time -f "%e" node generatePrimes.js
[
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37,
41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89,
97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151,
157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223,
227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281,
283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359,
367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433,
439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503,
509, 521, 523, 541,
... 664479 more items
]
7.99
OK, if i run once the program, the elapsed time is ~8 seconds.
But, consider the bash script testGeneratePrimes.sh
to measure elapsed times, running the same program in sequence or in parallel 6 times:
#!/bin/bash
# Get number of runs, as command line argument
if [ $# -eq 0 ]
then
echo
echo -e " run n processes, in sequence and in parallel."
echo
echo -e " usage:"
echo -e " $0 <number of runs>"
echo
echo -e " examples:"
echo -e " $0 6"
echo -e " run 6 times"
echo
exit 1
fi
numRuns=$1
# run a single instance of the process 'node generatePrimes'
runProcess() {
/usr/bin/time -f "%e" node generatePrimes.js > /dev/null
}
echo
echo "SEQUENCE TEST: running generatePrimes, $numRuns successive sequential times"
echo
for i in $(seq $numRuns); do
runProcess
done
echo
echo "PARALLEL TEST: running generatePrimes, $numRuns in parallel (background processes)"
echo
for i in $(seq $numRuns); do
runProcess &
done
wait < <(jobs -p)
Running the script (6 processes):
$ ./testGeneratePrimes.sh 6
SEQUENCE TEST: running generatePrimes, 6 successive sequential times
8.16
9.09
11.44
11.57
12.93
12.00
PARALLEL TEST: running generatePrimes, 6 in parallel (background processes)
25.99
26.16
30.51
30.64
31.60
31.60
I see that:
- in sequence test, elapsed times increase for each run, from ~8 second to ~12 seconds ?!
- in the "parallel" tests, elapsed increase from ~25 seconds to ~31 seconds ?!
Thats's insane.
I do not understand why! Maybe it's a linux scheduler limitation? A CPU hardware limitation/issue?
I also tried commands:
nice -20 /usr/bin/time -f "%e" node generatePrimes.js
ortaskset -c 0-7 /usr/bin/time -f "%e" node generatePrimes.js
but without any significant difference in the described behavior.
Questions:
- Why elapsed times varying so much?
- There is any way to config the linux behaviour to do not limit per-process cpu usage?
BTW, related question: running multiple nodejs worker threads: why of such a large overhead/latency?
UPDATE (MORE TESTS)
Following Nate Eldredge suggestion (see comments), here some info/sets using cpupower
and sensors
:
$ sudo cpupower -c all info
analyzing CPU 0:
perf-bias: 6
analyzing CPU 1:
perf-bias: 6
analyzing CPU 2:
perf-bias: 6
analyzing CPU 3:
perf-bias: 6
analyzing CPU 4:
perf-bias: 6
analyzing CPU 5:
perf-bias: 6
analyzing CPU 6:
perf-bias: 6
analyzing CPU 7:
perf-bias: 6
# set perf-bias to max performance
$ sudo cpupower set -b 0
$ sudo cpupower -c all info -b
analyzing CPU 0:
perf-bias: 0
analyzing CPU 1:
perf-bias: 0
analyzing CPU 2:
perf-bias: 0
analyzing CPU 3:
perf-bias: 0
analyzing CPU 4:
perf-bias: 0
analyzing CPU 5:
perf-bias: 0
analyzing CPU 6:
perf-bias: 0
analyzing CPU 7:
perf-bias: 0
$ sudo cpupower monitor
| Nehalem || Mperf || Idle_Stats
CPU| C3 | C6 | PC3 | PC6 || C0 | Cx | Freq || POLL | C1 | C1E | C3 | C6 | C7s | C8 | C9 | C10
0| 0,04| 2,53| 0,00| 0,00|| 4,23| 95,77| 688|| 0,00| 0,00| 0,08| 0,10| 2,38| 0,00| 24,65| 0,78| 67,90
4| 0,04| 2,53| 0,00| 0,00|| 3,68| 96,32| 675|| 0,00| 0,00| 0,02| 0,06| 0,76| 0,00| 10,83| 0,03| 84,75
1| 0,03| 1,94| 0,00| 0,00|| 6,39| 93,61| 656|| 0,00| 0,00| 0,04| 0,06| 1,88| 0,00| 16,19| 0,00| 75,63
5| 0,04| 1,94| 0,00| 0,00|| 1,35| 98,65| 689|| 0,00| 0,02| 1,19| 0,04| 0,40| 0,33| 3,89| 0,82| 92,02
2| 0,56| 25,49| 0,00| 0,00|| 12,88| 87,12| 673|| 0,00| 0,00| 0,84| 0,74| 28,61| 0,03| 34,48| 3,44| 19,81
6| 0,56| 25,48| 0,00| 0,00|| 4,30| 95,70| 676|| 0,00| 0,00| 0,03| 0,09| 1,48| 0,00| 22,66| 1,11| 70,52
3| 0,19| 3,61| 0,00| 0,00|| 3,67| 96,33| 658|| 0,00| 0,00| 0,02| 0,07| 1,36| 0,00| 14,85| 0,03| 80,16
7| 0,19| 3,60| 0,00| 0,00|| 6,21| 93,79| 679|| 0,00| 0,00| 0,28| 0,19| 3,48| 0,76| 31,10| 1,50| 56,75
$ sudo cpupower monitor ./testGeneratePrimes.sh 6
[sudo] password for giorgio:
SEQUENCE TEST: running generatePrimes, 6 successive sequential times
8.18
9.06
11.66
11.29
11.30
11.21
PARALLEL TEST: running generatePrimes, 6 in parallel (background processes)
20.83
20.95
28.42
28.42
28.47
28.52
./testGeneratePrimes.sh took 91,26958 seconds and exited with status 0
| Nehalem || Mperf || Idle_Stats
CPU| C3 | C6 | PC3 | PC6 || C0 | Cx | Freq || POLL | C1 | C1E | C3 | C6 | C7s | C8 | C9 | C10
0| 0,20| 1,98| 0,00| 0,00|| 33,88| 66,12| 2008|| 0,00| 0,04| 0,18| 0,18| 1,83| 0,02| 16,67| 0,22| 47,04
4| 0,20| 1,98| 0,00| 0,00|| 10,46| 89,54| 1787|| 0,00| 0,09| 0,32| 0,37| 4,01| 0,03| 25,21| 0,19| 59,37
1| 0,26| 2,40| 0,00| 0,00|| 24,52| 75,48| 1669|| 0,00| 0,06| 0,18| 0,20| 2,17| 0,00| 14,90| 0,17| 57,84
5| 0,26| 2,40| 0,00| 0,00|| 32,07| 67,93| 1662|| 0,00| 0,07| 0,19| 0,14| 1,40| 0,02| 9,31| 0,53| 56,33
2| 0,93| 13,33| 0,00| 0,00|| 31,31| 68,69| 2025|| 0,00| 0,05| 0,43| 1,00| 18,21| 0,01| 26,18| 1,74| 21,23
6| 0,93| 13,33| 0,00| 0,00|| 11,98| 88,02| 1711|| 0,00| 0,19| 0,31| 0,22| 2,63| 0,03| 18,87| 0,76| 65,04
3| 0,15| 0,98| 0,00| 0,00|| 47,38| 52,62| 2627|| 0,00| 0,07| 0,17| 0,13| 1,35| 0,01| 7,88| 0,10| 42,80
7| 0,15| 0,98| 0,00| 0,00|| 59,25| 40,75| 2235|| 0,00| 0,06| 0,18| 0,18| 1,58| 0,00| 9,31| 0,63| 28,91
$ sensors && testGeneratePrimes.sh 6 && sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +46.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +45.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +43.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +45.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +44.0°C (high = +100.0°C, crit = +100.0°C)
BAT0-acpi-0
Adapter: ACPI interface
in0: 12.89 V
curr1: 0.00 A
amdgpu-pci-0100
Adapter: PCI adapter
vddgfx: 65.49 V
edge: +511.0°C (crit = +104000.0°C, hyst = -273.1°C)
power1: 1.07 kW (cap = 30.00 W)
pch_cannonlake-virtual-0
Adapter: Virtual device
temp1: +42.0°C
acpitz-acpi-0
Adapter: ACPI interface
temp1: +47.0°C (crit = +120.0°C)
temp2: +53.0°C (crit = +127.0°C)
SEQUENCE TEST: running generatePrimes, 6 successive sequential times
8.36
9.76
11.35
11.38
11.22
11.24
PARALLEL TEST: running generatePrimes, 6 in parallel (background processes)
21.06
21.14
28.50
28.55
28.62
28.65
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +54.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +54.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +50.0°C (high = +100.0°C, crit = +100.0°C)
BAT0-acpi-0
Adapter: ACPI interface
in0: 12.89 V
curr1: 0.00 A
amdgpu-pci-0100
Adapter: PCI adapter
vddgfx: 65.49 V
edge: +511.0°C (crit = +104000.0°C, hyst = -273.1°C)
power1: 1.07 kW (cap = 30.00 W)
pch_cannonlake-virtual-0
Adapter: Virtual device
temp1: +46.0°C
acpitz-acpi-0
Adapter: ACPI interface
temp1: +55.0°C (crit = +120.0°C)
temp2: +57.0°C (crit = +127.0°C)
/dev/null
, so another possibility is that the terminal window/scroll-buffer handling makes the difference. – Abm/usr/bin/time -f "%e" nodejs generatePrimes.js >/dev/null
– Abm10.94 10.93 10.93 10.91 10.90 10.91 10.92 10.91 10.91 10.92
... despite completely different CPU types. At home I have a 8-coreIntel(R) Core(TM) i7-6700 CPU @ 3.40GHz
; the work machine is 6-coreIntel(R) Xeon(R) W-2133 CPU @ 3.60GHz
– Abm$ sudo inxi -C -M Machine: Type: Desktop System: Gigabyte product: Z370P D3 v: N/A serial: N/A Mobo: Gigabyte model: Z370P D3-CF v: x.x serial: N/A UEFI [Legacy]: American Megatrends v: F14 date: 08/13/2019 CPU: Info: 8-Core model: Intel Core i7-9700K bits: 64 type: MCP L2 cache: 12 MiB Speed: 800 MHz min/max: 800/4900 MHz Core speeds (MHz): 1: 800 2: 800 3: 800 4: 800 5: 800 6: 800 7: 800 8: 800 node --version v12.18.3
– Clavicembalolm-sensors
package. For CPU frequency,/proc/cpuinfo
orcpupower
. – Wretchecho "CPU threads: $(grep -c processor /proc/cpuinfo)"
So in my case I have 8 cores and the reason I run 6 processes (in parallel) is just to avoid to "saturate" the number of cores, possibly avoiding any OS enqueue/saturation limit. A common sense/safe rule is to limit parallel multithreading to num_cores -1 (reasonable but maybe not a gold rule). I'm wrong? – Manvell/proc/cpuinfo
lists logical processors; you can use the other fields to see how they are distributed across physical cores. See unix.stackexchange.com/questions/57920/… – WretchtestPrimeNumbers.sh
I limit the number of processes (threads) to 4, avoiding any hyperthrading ambiguity, I still got: 1. sequential: 8.15 9.15 11.52 11.36 2. parallel: 20.06 20.12 20.14 20.21 I think nothing changed. I suspect a "CPU power saving" mechanism inhibits full core usage. – Manvell