Cannot avoid context-switches on a process launched alone on a CPU
Asked Answered
L

3

10

I am investigating how run a process on a dedicated CPU in order to avoid context-switches. On my Ubuntu, I isolated two CPUs using the kernel parameters "isolcpus=3,7" and "irqaffinity=0-2,4-6". I am sure that it is correctly taken into account:

$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.8.0-27-generic root=UUID=58c66f12-0588-442b-9bb8-1d2dd833efe2 ro quiet splash isolcpus=3,7 irqaffinity=0-2,4-6 vt.handoff=7

After a reboot, I can check that everything works as expected. On a first console I run

$ stress -c 24
stress: info: [31717] dispatching hogs: 24 cpu, 0 io, 0 vm, 0 hdd

And on a second one, using "top" I can check the usage of my CPUs:

top - 18:39:07 up 2 days, 20:48, 18 users,  load average: 23,15, 10,46, 4,53
Tasks: 457 total,  26 running, 431 sleeping,   0 stopped,   0 zombie
%Cpu0  :100,0 us,  0,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu1  : 98,7 us,  1,3 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu2  : 99,3 us,  0,7 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu3  :  0,0 us,  0,0 sy,  0,0 ni,100,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu4  : 95,7 us,  4,3 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu5  : 98,0 us,  2,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu6  : 98,7 us,  1,3 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu7  :  0,0 us,  0,0 sy,  0,0 ni,100,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem :  7855176 total,   385736 free,  5891280 used,  1578160 buff/cache
KiB Swap: 15624188 total, 10414520 free,  5209668 used.   626872 avail Mem 

CPUs 3 and 7 are free while the 6 other ones are fully busy. Fine.


For the rest of my test, I will use a small application that does almost pure processing

  1. It uses two int buffers of the same size
  2. It reads one-by-one all the values of the first buffer
    • each value is a random index in the second buffer
  3. It reads the value at the index in the second buffer
  4. It sums all the values taken from the second buffer
  5. It does all the previous steps for bigger and bigger
  6. At the end, I print the number of voluntary and involuntary CPU context switches

I am now studying my application when I launch it:

  1. on a non-isolated CPU
  2. on an isolated CPU

I do it via the following command lines:

$ ./TestCpuset              ### launch on any non-isolated CPU
$ taskset -c 7 ./TestCpuset ### launch on isolated CPU 7

When launched on any CPU, the numbers of context switches change from 20 to... thousands

When launched on an isolated CPU, the number of context switches is almost constant (between 10 and 20), even if I launch in parallel a "stress -c 24".(looks quite normal)

But my question is: why isn't it 0 absolutely 0? When a switch is done on a process, it is in order to replace it by another process? But in my case there is no other process to replace with!

I have an hypothesis which is that the "isolcpus" option would isolate CPU form any process (unless the process an CPU affinity would be given, such as what is done with "taskset") but not from kernel tasks. However, I found no documentation about it

I would appreciate any help in order to reach 0 context-switches

FYI, this question is closed to another one I previously opened: Cannot allocate exclusively a CPU for my process

Here is the code of the program I am using:

#include <limits.h>
#include <iostream>
#include <unistd.h>
#include <sys/time.h>
#include <sys/resource.h>

const unsigned int BUFFER_SIZE = 4096;

using namespace std;


class TimedSumComputer
{

public:
  TimedSumComputer() :
    sum(0),
    bufferSize(0),
    valueBuffer(0),
    indexBuffer(0)
  {}


public:
  virtual ~TimedSumComputer()
  {
    resetBuffers();
  }


public:
  void init(unsigned int bufferSize)
  {
    this->bufferSize = bufferSize;
    resetBuffers();
    initValueBuffer();
    initIndexBuffer();
  }


private:
  void resetBuffers() 
  {
    delete [] valueBuffer;
    delete [] indexBuffer;
    valueBuffer = 0;
    indexBuffer = 0;
  }


  void initValueBuffer()
  {
    valueBuffer = new unsigned int[bufferSize];
    for (unsigned int i = 0 ; i < bufferSize ; i++)
    {
      valueBuffer[i] = randomUint();
    }
  }


  static unsigned int randomUint()
  {
    int value = rand() % UINT_MAX;
    return value;
  }


protected:
  void initIndexBuffer()
  {
    indexBuffer = new unsigned int[bufferSize];
    for (unsigned int i = 0 ; i < bufferSize ; i++)
    {
      indexBuffer[i] = rand() % bufferSize;
    }
  }


public:
  unsigned int getSum() const
  {
    return sum;
  }


  unsigned int computeTimeInMicroSeconds()
  {
    struct timeval startTime, endTime;

    gettimeofday(&startTime, NULL);
    unsigned int sum = computeSum();
    gettimeofday(&endTime, NULL);

    return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
  }


  unsigned int computeSum()
  {
    sum = 0;

    for (unsigned int i = 0 ; i < bufferSize ; i++)
    {
      unsigned int index = indexBuffer[i];
      sum += valueBuffer[index];
    }

    return sum;
  }


protected:
  unsigned int sum;
  unsigned int bufferSize;
  unsigned int * valueBuffer;
  unsigned int * indexBuffer;

};



unsigned int runTestForBufferSize(TimedSumComputer & timedComputer, unsigned int bufferSize)
{
  timedComputer.init(bufferSize);

  unsigned int timeInMicroSec = timedComputer.computeTimeInMicroSeconds();
  cout << "bufferSize = " << bufferSize << " - time (in micro-sec) = " << timeInMicroSec << endl;
  return timedComputer.getSum();
}



void runTest(TimedSumComputer & timedComputer)
{
  unsigned int result = 0;

  for (unsigned int i = 1 ; i < 10 ; i++)
  {
    result += runTestForBufferSize(timedComputer, BUFFER_SIZE * i);
  }

  unsigned int factor = 1;
  for (unsigned int i = 2 ; i <= 6 ; i++)
  {
    factor *= 10;
    result += runTestForBufferSize(timedComputer, BUFFER_SIZE * factor);
  }

  cout << "result = " << result << endl;
}



void printPid()
{
  cout << "###############################" << endl;
  cout << "Pid = " << getpid() << endl;
  cout << "###############################" << endl;
}



void printNbContextSwitch()
{
  struct rusage usage;
  getrusage(RUSAGE_THREAD, &usage);
  cout << "Number of voluntary context switch:   " << usage.ru_nvcsw << endl;
  cout << "Number of involuntary context switch: " << usage.ru_nivcsw << endl;
}



int main()
{
  printPid();

  TimedSumComputer timedComputer;
  runTest(timedComputer);

  printNbContextSwitch();

  return 0;
}
Lawanda answered 23/11, 2016 at 21:11 Comment(2)
Where does your data come from? Are you using more memory than your machine physically has? I would expect that accessing a paged-out section of memory is going to force an increase in the context switch counter when the process is suspended while awaiting the paging operation.Proviso
The program I am using is just a simple test-program, It only accesses buffers that are initialized with random values (cf. the rand() function)Lawanda
L
7

Today, I obtained more clues regarding my problem I realized that I had to investigate deeply what was happening in the Kernel scheduler. I found these two pages:

I enabled scheduler tracing while my application was running like that:

# sudo bash
# cd /sys/kernel/debug/tracing
# echo 1 > options/function-trace ; echo function_graph > current_tracer ; echo 1 > tracing_on ; echo 0 > tracing_max_latency ; taskset -c 7 [path-to-my-program]/TestCpuset ; echo 0 > tracing_on
# cat trace

As my program was launched on CPU 7 (taskset -c 7), I have to filter the "trace" output

# grep " 7)" trace

I can then search for transitions, from one process to another one:

# grep " 7)" trace | grep "=>"
 ...
 7)  TestCpu-4753  =>  kworker-5866 
 7)  kworker-5866  =>  TestCpu-4753 
 7)  TestCpu-4753  =>   watchdo-26  
 7)   watchdo-26   =>  TestCpu-4753 
 7)  TestCpu-4753  =>  kworker-5866 
 7)  kworker-5866  =>  TestCpu-4753 
 7)  TestCpu-4753  =>  kworker-5866 
 7)  kworker-5866  =>  TestCpu-4753 
 7)  TestCpu-4753  =>  kworker-5866 
 7)  kworker-5866  =>  TestCpu-4753 
 ...

Bingo! It seems that the context switches I am tracking are transitions to:

  • kworker
  • watchdog

I now have to find:

  • what are exactly these processes/threads? (it seems that they are handled by the kernel)
  • Can I avoid them to run on my dedicated CPUs?

For course, once again I would appreciate any help :-P

Lawanda answered 27/11, 2016 at 8:29 Comment(2)
I found this really interesting page: avoid daemon running in dedicated cpu coresLawanda
It seems that the watchdog can be disabled using the Linux kernel option nowatchdogLawanda
M
3

Potentially any syscall could involve context a switch. When you access paged out memory it may increase context switch count too. To reach 0 context switches you would need to force kernel to keep all the memory your program uses mapped to its address space, and you would need to be sure that none of syscalls you invoke entails a context switch. I believe it may be possible on kernels with RT patches, but probably hard to achieve on standard distro kernel.

Marvellamarvellous answered 27/11, 2016 at 0:3 Comment(2)
Thank you very much for that answer. I am almost sure that in my simple example (whose code is provided above) all the memory my program uses remains mapped => no page out.Lawanda
Moreover, I voluntarily have almost no syscalls except: 1) those introduced by new / delete 2) getrusage() ??? 3) cout I may be wrong but the context switches entailed by syscalls are recorded in "voluntary context switches" but on my side, the main problem is with "involuntary" onesLawanda
J
3

For the sake of those finding this via google (like me), /sys/devices/virtual/workqueue/cpumask controls where the kernel may queue works queued with WORK_CPU_UNBOUND (Don't care which cpu). As of writing this answer, it's not set to the same mask as the one isolcpus manipulates by default.

Once I changed it to not include my isolated cpus, I saw a significantly smaller (but not zero) amount of context switches to my critical threads. I assume that the works that did run on my isolated cpus must have requested it specifically, such as by using schedule_on_each_cpu.

Joslyn answered 18/10, 2017 at 11:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.