What could cause my program to not use all cores after a while?
Asked Answered
E

2

8

I have written a program that captures and displays video from three video cards. For every frame I spawn a thread that compresses the frame to Jpeg and then puts it in queue for writing to disk. I also have other threads that read from these files and decodes them in their own threads. Usually this works fine, it's a pretty CPU intensive program using about 70-80 percent of all six CPU cores. But after a while the encoding suddenly slows down and the program can't handle the video fast enough and starts dropping frames. If I check the CPU utilization I can see that one core (usually core 5) is not doing much anymore.

When this happens, it doesn't matter if I quit and restart my program. CPU 5 will still have a low utilization and the program starts dropping frames immediately. Deleting all saved video doesn't have any effect either. Restarting the computer is the only thing that helps. Oh, and if I set the affinity of my program to use all but the semi-idling core, it works until the same happens to another core. Here is my setup:

  • AMD X6 1055T (Cool & Quiet OFF)
  • GA-790FX-UD5 motherboard
  • 4Gig RAM unganged 1333Mhz'
  • Blackmagic Decklink DUO capture cards (x2)
  • Linux - Ubuntu x64 10.10 with kernel 2.6.32.29

My app uses:

  • libjpeg-turbo
  • posix threads
  • decklink api
  • Qt
  • Written in C/C++
  • All libraries linked dynamically

It seems to me like it would be some kind of problem with the way Linux schedules threads on the cores. Or is there some way my program can mess up so bad that it doesn't help to restart the program?

Thank you for reading, any and all input is welcome. I'm stuck :)

Erdei answered 2/10, 2011 at 12:59 Comment(15)
Sounds to me like you've hit an IO bottleneck with your disk accesses.Leclaire
Can you tell what the semi-idle core is doing? Maybe it's pinned handling the disk I/O or video card interrupts?Nous
@Nous I don't know a good way of finding out what the idle core is up to. Do you have any good tips on that?Erdei
@awoodland Would that cause my jpeg encoding to slow down? (Only in RAM access)Erdei
@Erdei - I thought you were writing the results to disk? If some OS level write buffer is filling up then it would be possible to see behaviour like stalling threads.Leclaire
I would have thought that you swap to disc but the program restart theory rules that out. Could you be doing something bad to the devices?Owe
@awoodland OK, yes I write them to disk (another thread handles the writing). So is there some way to monitor the OS I/O buffers? BTW I should probably mention that I am using SSDs and the size of each frame is about 320KB (25fps). Each of the three input videos have their own drive (Though all on the same SATA controller)Erdei
I think you can use sar to monitor disk activity. Encoding might be fast while you have spare RAM for the kernel to use as a write cache, and then slower once it's delayed by real I/O?Nous
@Owe Yeah, RAM usage is pretty constant at about 1.8GiB so it shouldn't be swapping. All I'm doing with the disks is writing and reading using fread/fwrite. And it's weird that it can function great for hours and then suddenly this happens.Erdei
@Nous Thanks for the tip. I will try and monitor with sar and see if I can find anything weird.Erdei
@Nicoreh The hardware is alright temperature wise?Owe
@Owe From what I can measure everything is in order. Aftermarket cooler and no overclocking. Good ventilation in the chassi. Next time it happens, I will let the computer sit idle for a couple of minutes to make sure everything has cooled down a bit and see if it helps.Erdei
hmm..seems interesting...most probably there should be some threading bug in your code..like 1)the queue can be a bottleneck 2) are you monitoring the memory growth of the queue? 3) any shred data access problem for the queue etc etc.. I will watch this space for answers from experts :)Utoaztecan
@AronMu Yes, I've been careful in designing the queue system. The threads accessing the queues is of course mutex locked, but the queue is just pointers to dynamically allocated memory and the only access is a quick push or pop. I don't believe that's where the problem lies, but I will keep an eye on that.Erdei
Hello everyone. I have done some more testing, and I have moved the disks over to different SATA controllers and the problem is not happening as often now. It seems the theories on IO bottlenecks were on to something.Erdei
P
4

First of all, make sure it's not your program - maybe you are running into a convoluted concurrency bug, even though it's not all that likely with your program architecture and the fact that restarting the kernel helps. I've found that, usually, a good way is a post-mortem debugging. Compile with debugging symbols, kill the program with -SEGV when it is behaving strangely, and examine the core dump with gdb.

Plumcot answered 2/10, 2011 at 13:11 Comment(3)
I will look into doing that. I am not very experienced in debugging, but maybe now is the time to start digging deep :) Thank youErdei
@Erdei : Please update in case you find an answer. Just out of curiosity ;)Utoaztecan
@Nioreh: Can you check your queue size growth and any multiple thread access problem with it ? Queue in multi threaded env if not designed properly can be a bottleneck too.Utoaztecan
K
2

I would try to choose a core round-robin a when new frame processing thread is spawned and pin the thread to this core. Keep statistics on how long it takes for the thread to run. If this in in fact a bug in Linux scheduler - your threads will take roughly the same time to run on any core. If the core is actually busy with something else - your threads pinned to this core will get less CPU time.

Kelwunn answered 2/10, 2011 at 18:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.