Impossible Cpu time for process / thread with rusage
Asked Answered
J

1

6

I am calculating the delta between the cpu time a process / thread is using, between the calling of a certain function.

I am getting impossible value for the process cpu time, some time's the next call is returning smaller cpu time, which is impossible.

this is the code i am using to calculate the cpu time:

  u64 CpuTime::calculateCpuTime(bool a_thread)
  {
    struct rusage l_rusage;
    int retVal;
    if(a_thread)
    {
      retVal = getrusage(1, &l_rusage); //1 = RUSAGE_THREAD
    }
    else
    {
      retVal = getrusage(0, &l_rusage); //0 = RUSAGE_SELF
    }
    ASSERT(retVal==0);

    u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+
      (static_cast<u64>(l_rusage.ru_utime.tv_usec));

    u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*1000000)+
      (static_cast<u64>(l_rusage.ru_stime.tv_usec));

    if(a_thread)
    {
      return userUSeconds + systemUSeconds;
    }
    return (userUSeconds + systemUSeconds) / ACE_OS::num_processors_online();
  }

Also the function calculating retrieving this is async, no more then 1 thread can access it at a time.

this is an example of the impossible output (process):

2016/03/23 13:39:05.187 - #2 - 15471/15055 - ThreadDeltaCpuTime = 6000 ThreadCpuTime = 46756013 ProcessCpuTime = 39820759 deltaProcessCpuTime= 250

2016/03/23 13:39:05.386 - #2 - 15470/15055 - ThreadDeltaCpuTime = 5000 ThreadCpuTime = 46663013 ProcessCpuTime = 39699759 deltaProcessCpuTime = 18446744073709430616

The delta calculation is a simple newValue - oldValue, with a unsigned int, it leads to those numbers.

Example for a thread impossible output:

2016/03/23 13:38:59.183 - #2 - 15472/15055 - ThreadDeltaCpuTime = 6000 ThreadCpuTime = 47541012 ProcessCpuTime = 39658134 deltaProcessCpuTime = 5250

2016/03/23 13:38:59.382 - #2 - 15472/15055 - ThreadDeltaCpuTime = 1844 6744073708557617 ThreadCpuTime = 46547013 ProcessCpuTime = 39663009 deltaProcessCpuTime = 4625

i have bolded the thread id, also you can see the function has been executed 200ms after the first 1.

I am looking for a clue for why this is happening, i have tried using the ACE_OS rusage as well, with same results (make sense), i could solve this by ignoring "bad" values, but i am wondering why this is happening.

EDIT:

This is where i do the actual printing:

  u64 computeCpuTime()
  {
    u64 deltaCpuTime= CpuTime::getThreadDeltaCpuTime();
    u64 CpuTime= getThreadCpuTime();
    u64 ProcessCpuTime= getProcessCpuTime();
    u64 deltaProcessCpuTime= CpuTime::getProcessDeltaCpuTime();
    Log<<value(deltaCpuTime)<<value(CpuTime)<<value(ProcessCpuTime)<<value(deltaProcessCpuTime)<<endlog;
    return deltaCpuTime;
  }

And the rest of the functions used in this calculation:

  u64 CpuTime::getThreadDeltaCpuTime()
  {
    pid_t thisThread = (pid_t) syscall (SYS_gettid);
    u64& val = m_threadsCpuMap[thisThread];
    u64 oldValue =val;
    val = calculateCpuTime(true);
    return val - oldValue;
  }

  u64 CpuTime::getProcessDeltaCpuTime()
  {
    u64 oldValue = m_processCpu;
    m_processCpu = calculateCpuTime(false);

    return m_processCpu - oldValue;
  }

  u64 getThreadCpuTime()
  {
    return CpuTime::calculateCpuTime(true);
  }

  u64 getProcessCpuTime()
  {
    return CpuTime::calculateCpuTime(false);
  }

The object doing the system call ("m_cpuTime") is a singleton, protected with a mutex, only 1 thread can access it at a time.

it holds the map of the threads cpu usage, m_threadsCpuMap (used for the delta), and also the last process cpu usage, m_processCpu.

ANOTHER EDIT:

I have simplified it with a simple test, working with a single thread, removed the division by active cpu's, and am only checking the Process Cpu time, still with impossible results.

Here is the updated functions:

Test:

for(int i = 0; i < 100000 ; i++)
{
  for(int k = 0; k < 1000000; k++)
    m = k % i;
    cpuTime = CpuTime::instance()->getProcessDeltaCpuTime();
}

Functions used in test:

  u64 CpuTime::getProcessDeltaCpuTime()
  {
    u64 oldValue = m_processCpu;
    m_processCpu = calculateCpuTime(eThisProcess);
   Log<<value(oldValue)<<value(m_processCpu)<value( m_processCpu - oldValue)<<endlog;
    return m_processCpu - oldValue;
  }

  u64 CpuTime::calculateCpuTime(int a_type)
  {
    struct rusage l_rusage;
    int retVal;
    if(a_type == eThisThread)
    {
      retVal = /*ACE_OS::*/getrusage(1, &l_rusage);
    }
    else
    {
      retVal = /*ACE_OS::*/getrusage(0, &l_rusage);
    }

    u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+
      (static_cast<u64>(l_rusage.ru_utime.tv_usec));

    u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*100000)+
      (static_cast<u64>(l_rusage.ru_stime.tv_usec));

    if(a_type == eThisThread)
    {
      return userUSeconds + systemUSeconds;
    }
    return (userUSeconds + systemUSeconds)/* / ACE_OS::num_processors_online()*/;

This is an example of the logs, i have removed all the times the delta was ~0

2016/03/29 08:07:05.198 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14797750 m_processCpu = 14798749 m_process Cpu - oldValue = 999

2016/03/29 08:07:05.199 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14798749 m_processCpu = 14799749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.200 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14799749 m_processCpu = 14800749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.201 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14800749 m_processCpu = 14801749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.202 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14801749 m_processCpu = 14802749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.203 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14802749 m_processCpu = 13903748 m_process Cpu - oldValue = 18446744073708652615

2016/03/29 08:07:05.204 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 13903748 m_processCpu = 13904748 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.205 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 13904748 m_processCpu = 13905748 m_process Cpu - oldValue = 1000

This time there was only 1 time of bad values, out of 100000 iterations. And after it all the following calculation were also smaller, e.g. it went like this: 1 2 3 4 5 6 7 8 9 10 bad value, 6 7 8 9 10... As if the process lost count, went back, and then continued normally.

YET ANOUTHER EDIT:

I have separated the code, completely, compiled it as simple as possible. and the problem did not occur, it could be a matter of probability (it was 1/100000 before). this is the code:

int main()
{
  long int n = 0;
  long int oldValue = 0;
  long int newValue = 0;
  unsigned long int deltaValue = 0;

  for(int i = 0; i < 1000000; i++)
  {
    for(long int m = 0; m <10000; m++)
      n = m + i;
    struct rusage l_rusage;
    int retVal;
    retVal = getrusage(0, &l_rusage);
    if(retVal != 0) return 0;

    long int userUSeconds = l_rusage.ru_utime.tv_sec*1000000 + l_rusage.ru_utime.tv_usec;

    long int systemUSeconds = l_rusage.ru_stime.tv_sec*100000 + l_rusage.ru_stime.tv_usec;
    oldValue = newValue;
    newValue = userUSeconds + systemUSeconds;
    deltaValue = newValue - oldValue;
    if(deltaValue != 0)
      std::cout<<"Old value: "<< oldValue <<" New Value: "<< newValue <<" Delta value: "<<deltaValue<<"\n";


  }
  std::cout<<n;
  return 0;
}

And from further investigating the earlier test, (which use our unittest framework), the problem occurs always at around the same CPU time. Which mean this is happening for a reason, i just cant figure out what happens at that exact time which could cause some thing like this. Or how some thing like this could be caused.

I am not familiar enough with linux kernel to understand how this calculation is made.

Another insight - when using clock() to get the process. it does not happen, even more - it seems like until that moment the values were around the same, after this point they are not:

2016/03/29 12:36:19.158 - #2 - 20544/20544 - CpuTime::getProcessDeltaCpuTime: oldValue = 14598780 m_processCpu = 14598780 m_processCpu - oldValue = 0 oldClockTime = 14580000 m_clockTime = 14580000 m_clockTime - oldClockTime = 0

2016/03/29 12:36:19.158 - #2 - 20544/20544 - CpuTime::getProcessDeltaCpuTime: oldValue = 14598780 m_processCpu = 13699779 m_processCpu - oldValue = 18446744073708652615 oldClockTime = 14580000 m_clockTime = 14590000 m_clockTime - oldClockTime = 10000

2016/03/29 12:36:19.158 - #2 - 20544/20544 - CpuTime::getProcessDeltaCpuTime: oldValue = 13699779 m_processCpu = 13699779 m_processCpu - oldValue = 0 oldClockTime = 14590000 m_clockTime = 14590000 m_clockTime - oldClockTime = 0

Which leads me to a different question, how is rusage calculated, and how is clock()? what could lead to the difference between them?

I have worked around the problem, measuring it using clock_gettime, here is the code if any 1 is interested:

  u64 CpuMeasure::calculateCpuTime(int a_type)
  {
    struct timespec ts;

    if(a_type == eThisThread)
    {
      if (ACE_OS::clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts) == 0)
      {
         return (u64)ts.tv_sec * 1000000 + (u64)ts.tv_nsec / 1000; //From nano to mili seconds
      }
    }
    else
    {
      if (ACE_OS::clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) == 0)
      {
         return (u64)ts.tv_sec * 1000000 + (u64)ts.tv_nsec / 1000;
      }
    }
    ASSERT(false); //we cant be here, unless some thing bad happened
    return 0;
}

I am using ACE_OS for portability, but it works with the default functions from sys/time.h

Yet, i am still in question about rusage strange behavior , what could cause it to give those values.

Jiggerypokery answered 23/3, 2016 at 14:41 Comment(23)
You mention multi-threading and async, is it possible that the returned values are being processed out of order? I'd like to see the context in which this function is called and the result calculated.Discount
Your output is not very useful without a glimpse of the code in which you compute and print out those values.Rosenarosenbaum
I notice you are not applying the same math to compute the user and system time (missing *100000) in the system time computation..Blowfly
I will add the rest, wanted to keep it a bit shorter. about the *1000000 , r_utime is in seconds, and r_stime is in microseconds, as i am returning the value in micro, i only need to multiply the secondsJiggerypokery
it doesn't look like you are invoking m_cpuTime.calculateCpuTime() each time, some times you don't pass through the singleton that should guarantee mutual exclusion, why is it?Rosenarosenbaum
Moreover, collecting 4 different values while at the same time acquiring/releasing a lock might be a source of issues, any thread could kick in and delay the execution of your current thread. This might or might not be a problem for you, I didn't really go very deep in what you're doing with the values, once you collected them :-PRosenarosenbaum
i am calling it, it just removed some of the code to make it more readable, it passes thru 2 objects in the way there, i am invoking the singleton every time. the values are just for debugging this exact problem, i was only working with thread delta, when i have noticed the error. Thank you tho :)Jiggerypokery
It's a bit hard to help you if you're not showing us the real code, lol.Rosenarosenbaum
You are right, i added some more of the calling bits, i cant put the whole thing, it would make the post unreadble and too long. The only thing i left out is the Lock, it is sychronized , the whole thing.Jiggerypokery
What type is your map m_threadsCpuMap object. I see you get a reference to the value in it and assume it is the actual value in the map, so when it gets updated you do not need to re-set the value in the map.. is this a valid assumption? it could be returning a copy to you and you are referencing/updating the copy not the original? Or is it a simple array :)?Blowfly
its is a std::map<pid_t, u64> I am doing it for the exact reason you mentioned, so after updating i dont need to re-set the value of the map.Jiggerypokery
If the key is not good (does not exist in the map) then the operator [] will create that element (with garbage value) and return that to you, this may be what you are seeing.. is it possible the pid id key is no good or you are not initializing the map properly?Blowfly
I belive it creates a value with deafult ctr, 0 for u64, however i did try anouther version, where i was checking if the key exists, if not init it with 0. That was not it :( i am lost hereJiggerypokery
Are you able to reproduce the issue in a much simpler program, one that would be short enough to post in its entirety?Escapee
Yes, i have simplified it and reproduced it, adding this to the main post.Jiggerypokery
Are you running ACE on Windows or some Unix-ish OS? If under Windows, I'll take a guess that you are interpreting ru_utime and ru_stime incorrectly (on Windows, FILETIME is a 64-bit value of 100-nanosecond intervals).Jacynth
Unix-ish, to be exact : Linux A02T444 3.0.56-k2 #1 SMP Tue Jan 1 16:05:59 IST 2013 x86_64 GNU/LinuxJiggerypokery
Please provide a minimal but complete example. Without that, the cause could be anywhere in the code you don't show. Also your question is off-topic without it.Rebellious
Interestingly, 18446744073709430616d = 1999999999996A55h. You must have a conversion problem.Grimsley
Ah, I just saw your minimal example which (surprise?) didn't show these effects. That's exactly the reason why you're supposed to extract this before even asking...Rebellious
looks like uint 64 overflowOrts
The over flow is the symptom, after the getrusage returns lower CPU usage values for the later call. which cause the newValue to be smaller then oldValue, and the overflow in the delta.Jiggerypokery
I cant do that, sadly, aside from the fact it is gigantic, it is related to a feature i am developing for a product i do not own.Jiggerypokery
S
5

Check carefully these lines: you are multiplying the system time seconds by 100 thousand instead of 1 million:

u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+
  (static_cast<u64>(l_rusage.ru_utime.tv_usec));

u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*100000)+
  (static_cast<u64>(l_rusage.ru_stime.tv_usec));

They are there in two of three examples you have shown (not in the first one).

However this would explain the odd sequence:

2016/03/29 08:07:05.201 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14800749 m_processCpu = 14801749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.202 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14801749 m_processCpu = 14802749 m_process Cpu - oldValue = 1000

2016/03/29 08:07:05.203 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14802749 m_processCpu = 13903748 m_process Cpu - oldValue = 18446744073708652615

2016/03/29 08:07:05.204 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 13903748 m_processCpu = 13904748 m_process Cpu - oldValue = 1000

at every iteration it adds 1000 us. But I guess that as soon as the system time surpass one second, only 1/10 of second is added, resulting in -0.9 seconds shift that is exactly 14.8 -> 13.9 seconds.

in fact:

 18446744073708652615 - 2^64 + 1 = -899000

that is -900000 (0.9 seconds) + 1000 us

Sharice answered 29/3, 2016 at 15:37 Comment(1)
Thank you, cant believe i actually missed that, its always the simplest mistakes.Jiggerypokery

© 2022 - 2024 — McMap. All rights reserved.