I am calculating the delta between the cpu time a process / thread is using, between the calling of a certain function.
I am getting impossible value for the process cpu time, some time's the next call is returning smaller cpu time, which is impossible.
this is the code i am using to calculate the cpu time:
u64 CpuTime::calculateCpuTime(bool a_thread)
{
struct rusage l_rusage;
int retVal;
if(a_thread)
{
retVal = getrusage(1, &l_rusage); //1 = RUSAGE_THREAD
}
else
{
retVal = getrusage(0, &l_rusage); //0 = RUSAGE_SELF
}
ASSERT(retVal==0);
u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+
(static_cast<u64>(l_rusage.ru_utime.tv_usec));
u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*1000000)+
(static_cast<u64>(l_rusage.ru_stime.tv_usec));
if(a_thread)
{
return userUSeconds + systemUSeconds;
}
return (userUSeconds + systemUSeconds) / ACE_OS::num_processors_online();
}
Also the function calculating retrieving this is async, no more then 1 thread can access it at a time.
this is an example of the impossible output (process):
2016/03/23 13:39:05.187 - #2 - 15471/15055 - ThreadDeltaCpuTime = 6000 ThreadCpuTime = 46756013 ProcessCpuTime = 39820759 deltaProcessCpuTime= 250
2016/03/23 13:39:05.386 - #2 - 15470/15055 - ThreadDeltaCpuTime = 5000 ThreadCpuTime = 46663013 ProcessCpuTime = 39699759 deltaProcessCpuTime = 18446744073709430616
The delta calculation is a simple newValue - oldValue, with a unsigned int, it leads to those numbers.
Example for a thread impossible output:
2016/03/23 13:38:59.183 - #2 - 15472/15055 - ThreadDeltaCpuTime = 6000 ThreadCpuTime = 47541012 ProcessCpuTime = 39658134 deltaProcessCpuTime = 5250
2016/03/23 13:38:59.382 - #2 - 15472/15055 - ThreadDeltaCpuTime = 1844 6744073708557617 ThreadCpuTime = 46547013 ProcessCpuTime = 39663009 deltaProcessCpuTime = 4625
i have bolded the thread id, also you can see the function has been executed 200ms after the first 1.
I am looking for a clue for why this is happening, i have tried using the ACE_OS rusage as well, with same results (make sense), i could solve this by ignoring "bad" values, but i am wondering why this is happening.
EDIT:
This is where i do the actual printing:
u64 computeCpuTime()
{
u64 deltaCpuTime= CpuTime::getThreadDeltaCpuTime();
u64 CpuTime= getThreadCpuTime();
u64 ProcessCpuTime= getProcessCpuTime();
u64 deltaProcessCpuTime= CpuTime::getProcessDeltaCpuTime();
Log<<value(deltaCpuTime)<<value(CpuTime)<<value(ProcessCpuTime)<<value(deltaProcessCpuTime)<<endlog;
return deltaCpuTime;
}
And the rest of the functions used in this calculation:
u64 CpuTime::getThreadDeltaCpuTime()
{
pid_t thisThread = (pid_t) syscall (SYS_gettid);
u64& val = m_threadsCpuMap[thisThread];
u64 oldValue =val;
val = calculateCpuTime(true);
return val - oldValue;
}
u64 CpuTime::getProcessDeltaCpuTime()
{
u64 oldValue = m_processCpu;
m_processCpu = calculateCpuTime(false);
return m_processCpu - oldValue;
}
u64 getThreadCpuTime()
{
return CpuTime::calculateCpuTime(true);
}
u64 getProcessCpuTime()
{
return CpuTime::calculateCpuTime(false);
}
The object doing the system call ("m_cpuTime") is a singleton, protected with a mutex, only 1 thread can access it at a time.
it holds the map of the threads cpu usage, m_threadsCpuMap (used for the delta), and also the last process cpu usage, m_processCpu.
ANOTHER EDIT:
I have simplified it with a simple test, working with a single thread, removed the division by active cpu's, and am only checking the Process Cpu time, still with impossible results.
Here is the updated functions:
Test:
for(int i = 0; i < 100000 ; i++)
{
for(int k = 0; k < 1000000; k++)
m = k % i;
cpuTime = CpuTime::instance()->getProcessDeltaCpuTime();
}
Functions used in test:
u64 CpuTime::getProcessDeltaCpuTime()
{
u64 oldValue = m_processCpu;
m_processCpu = calculateCpuTime(eThisProcess);
Log<<value(oldValue)<<value(m_processCpu)<value( m_processCpu - oldValue)<<endlog;
return m_processCpu - oldValue;
}
u64 CpuTime::calculateCpuTime(int a_type)
{
struct rusage l_rusage;
int retVal;
if(a_type == eThisThread)
{
retVal = /*ACE_OS::*/getrusage(1, &l_rusage);
}
else
{
retVal = /*ACE_OS::*/getrusage(0, &l_rusage);
}
u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+
(static_cast<u64>(l_rusage.ru_utime.tv_usec));
u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*100000)+
(static_cast<u64>(l_rusage.ru_stime.tv_usec));
if(a_type == eThisThread)
{
return userUSeconds + systemUSeconds;
}
return (userUSeconds + systemUSeconds)/* / ACE_OS::num_processors_online()*/;
This is an example of the logs, i have removed all the times the delta was ~0
2016/03/29 08:07:05.198 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14797750 m_processCpu = 14798749 m_process Cpu - oldValue = 999
2016/03/29 08:07:05.199 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14798749 m_processCpu = 14799749 m_process Cpu - oldValue = 1000
2016/03/29 08:07:05.200 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14799749 m_processCpu = 14800749 m_process Cpu - oldValue = 1000
2016/03/29 08:07:05.201 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14800749 m_processCpu = 14801749 m_process Cpu - oldValue = 1000
2016/03/29 08:07:05.202 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14801749 m_processCpu = 14802749 m_process Cpu - oldValue = 1000
2016/03/29 08:07:05.203 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 14802749 m_processCpu = 13903748 m_process Cpu - oldValue = 18446744073708652615
2016/03/29 08:07:05.204 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 13903748 m_processCpu = 13904748 m_process Cpu - oldValue = 1000
2016/03/29 08:07:05.205 - #2 - 24011/24011 - CpuTime::getProcessDeltaCpuTime: oldValue = 13904748 m_processCpu = 13905748 m_process Cpu - oldValue = 1000
This time there was only 1 time of bad values, out of 100000 iterations. And after it all the following calculation were also smaller, e.g. it went like this: 1 2 3 4 5 6 7 8 9 10 bad value, 6 7 8 9 10... As if the process lost count, went back, and then continued normally.
YET ANOUTHER EDIT:
I have separated the code, completely, compiled it as simple as possible. and the problem did not occur, it could be a matter of probability (it was 1/100000 before). this is the code:
int main()
{
long int n = 0;
long int oldValue = 0;
long int newValue = 0;
unsigned long int deltaValue = 0;
for(int i = 0; i < 1000000; i++)
{
for(long int m = 0; m <10000; m++)
n = m + i;
struct rusage l_rusage;
int retVal;
retVal = getrusage(0, &l_rusage);
if(retVal != 0) return 0;
long int userUSeconds = l_rusage.ru_utime.tv_sec*1000000 + l_rusage.ru_utime.tv_usec;
long int systemUSeconds = l_rusage.ru_stime.tv_sec*100000 + l_rusage.ru_stime.tv_usec;
oldValue = newValue;
newValue = userUSeconds + systemUSeconds;
deltaValue = newValue - oldValue;
if(deltaValue != 0)
std::cout<<"Old value: "<< oldValue <<" New Value: "<< newValue <<" Delta value: "<<deltaValue<<"\n";
}
std::cout<<n;
return 0;
}
And from further investigating the earlier test, (which use our unittest framework), the problem occurs always at around the same CPU time. Which mean this is happening for a reason, i just cant figure out what happens at that exact time which could cause some thing like this. Or how some thing like this could be caused.
I am not familiar enough with linux kernel to understand how this calculation is made.
Another insight - when using clock() to get the process. it does not happen, even more - it seems like until that moment the values were around the same, after this point they are not:
2016/03/29 12:36:19.158 - #2 - 20544/20544 - CpuTime::getProcessDeltaCpuTime: oldValue = 14598780 m_processCpu = 14598780 m_processCpu - oldValue = 0 oldClockTime = 14580000 m_clockTime = 14580000 m_clockTime - oldClockTime = 0
2016/03/29 12:36:19.158 - #2 - 20544/20544 - CpuTime::getProcessDeltaCpuTime: oldValue = 14598780 m_processCpu = 13699779 m_processCpu - oldValue = 18446744073708652615 oldClockTime = 14580000 m_clockTime = 14590000 m_clockTime - oldClockTime = 10000
2016/03/29 12:36:19.158 - #2 - 20544/20544 - CpuTime::getProcessDeltaCpuTime: oldValue = 13699779 m_processCpu = 13699779 m_processCpu - oldValue = 0 oldClockTime = 14590000 m_clockTime = 14590000 m_clockTime - oldClockTime = 0
Which leads me to a different question, how is rusage calculated, and how is clock()? what could lead to the difference between them?
I have worked around the problem, measuring it using clock_gettime, here is the code if any 1 is interested:
u64 CpuMeasure::calculateCpuTime(int a_type)
{
struct timespec ts;
if(a_type == eThisThread)
{
if (ACE_OS::clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts) == 0)
{
return (u64)ts.tv_sec * 1000000 + (u64)ts.tv_nsec / 1000; //From nano to mili seconds
}
}
else
{
if (ACE_OS::clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) == 0)
{
return (u64)ts.tv_sec * 1000000 + (u64)ts.tv_nsec / 1000;
}
}
ASSERT(false); //we cant be here, unless some thing bad happened
return 0;
}
I am using ACE_OS for portability, but it works with the default functions from sys/time.h
Yet, i am still in question about rusage strange behavior , what could cause it to give those values.
m_cpuTime.calculateCpuTime()
each time, some times you don't pass through the singleton that should guarantee mutual exclusion, why is it? – Rosenarosenbaum