CPU underutilized. Due to blocking I/O?
Asked Answered
D

2

11

I am trying to find where lies the bottleneck of a C# server application which underutilize CPU. I think this may be due to poor disk I/O performance and has nothing to do with the application itself but I am having trouble making a fact out of this supposition.

The application reads messages from a local MSMQ queue, does some processing on each messages and after processing the messages, sends out response messages to another local MSMQ queue.

I am using an async loop to read messages from queue, dequeuing them as fast as possible and dispatching them for processing using Task.Run to launch the processing of each messages (and do not await on this Task.Run .. just attaching a continuation only faulted on it to log error). Each messages is processed concurrently, i.e there is no need to wait for a message to be fully processed before processing the next one.

At the end of the processing of a message, I am using the Send method of MessageQueue (somehow asynchronous but not really because it has to wait on disk write before returning -see System.Messaging - why MessageQueue does not offer an asynchronous version of Send).

For the benchmarks I am queuing 100K messages in the queue (approx 100MB total size for 100K messages) and then I launch the program. On two of my personal computers (SSD HD on one and SATA2 HD on the other with i7 CPU quadcores -8 logical proc-) I reach ~95% CPU usage for the duration of the program lifecyle (dequeuing the 100K messages, processing them and sending responses). Messages are dequeued as fast a possible, processed as fast as possible (CPU involved here) and then response for each message sent to different local queue.

Now on a virtual machine running non HT dual core CPU (have no idea what is the underlying disk but seems far less performant than mines... during benchmark, with Perfmon I can see avg disk sec/write arround 10-15 ms on this VM, whereas it is arround 2ms on my personal machines) when I am running the same bench, I only reach ~55% CPU (when I am running the same bench on the machine without sending response messages to queue I reach ~90% CPU).

I don't really understand what is the problem here. Seems clear that sending message to the queue is the problem and slows down the global processing of the program (and dequeuing of messages to be processed), but why would that be considering that I am using Task.Run to launch processing of each dequeued message and ultimately response sending, I would not expect CPU to be underutilized. Unless when one thread is sending a message it blocks other threads to run on the same core while it waits for the return (disk write) in which case it would maybe make sense considering latency is much higher than on my personal computers, but a thread waiting for I/O should not prevent other threads from running.

I am really trying to understand why I am not reaching at least 95% cpu usage on this machine. I am blindly saying this is due to poorer disk i/o performance, but still I don't see why it would lead to CPU underutilization considering I am running processing concurrently using Task.Run. It could also be some system problem completely unrelated to disk, but considering that MessageQueue.Send seems to be the problem and that this method ultimately writes messages to a memory mapped file + disk, I don't see where the performance issue could come from other than disk.

It is of course for sure a system performance issue as the program maximize CPU usage on my own computers, but I need to find what the bottleneck is exactly on the VM system, and why exactly it is affecting the concurrency / speed of my application.

Any idea ?

Deserted answered 2/11, 2013 at 23:32 Comment(9)
Because you are probably reaching near 100% utilization on one or more of your disk volumes.Fuel
Task.Run() uses a application-global threadpool that has a maximum number of threads. I would guess that with your processing load all of those threads can become blocked. To check whether that happens because you're disk is not fast enough or because you don't have enough threads, maybe raise the maximum number of threads and measure again. But like RBarryYoung I'd bet on IO being maxed out.Yttriferous
@Fuel : Does not seems like it. Total execution takes arround 60 seconds on that machine, so 100MB in 60sec, I don't think disk reach 100% utilization (actually using perfmon I can see that disk is idle most of the time).Deserted
@confusopoly: Hmm interesting, the app runs with 22 logical threads on that machine when I take a look to perf counters. Anyway I thought that in .NET 4.5 the maximum number of threads in the thread pool was set by default to an insanely high value and that it was anyway impossible to reach that max value. Do you have link to some doc about what you state (low maximum number of threads by default in Thread pool) ?Deserted
Easy enough to test @confusopoly's theory (I've run into that before): ThreadPool.SetMinThreads. The default number of threads isn't that high.Jumpy
Have you tried simply profiling the process involved? Grab the trial of this. You could see some bottlenecks fairly quickly (e.g. 25% of time spent in WriteBytes). Also, the thread pool moderates the number of threads so Task.Run * 100 != 100 threads. The number of threads allocated ramps up and down as necessary, but there is a lag time for this. You could try a dedicated thread pool.Actuate
@Deserted If I recall, the default maximum number of threads in the thread pool is in part a function of the CPU cores. So it's something like 12 * Cores. Your i7 would have many more maximum threads than the VM, without any configuration.Actuate
Just tried out ThreadPool.GetMinThreads and ThreadPool.GetMaxThreads in a console app. It returns 4 min threads and 1023 max threads (workers and completion ports) on my quad core. So as I thought max thread value is pretty high already. SetMinThreads won't change anything as it is only to set the numbers of threads to create immediately on demand before switching to an internal thread pool algorithm creating a thread every half second or so instead of immediately on demand. So I don't think problem is related at all to reaching a max number of threads or not creating threads fast enoughDeserted
I don't suppose the VM has been given two cores with some % of the host's? sourceforge.net/apps/mediawiki/freedos/nfs/project/f/fr/freedos/…Actuate
A
4

To examine poor disc and or cpu utilization there is only one tool: Windows Performance Toolkit. For an example how to use it see here. You should get the latest one from the Windows 8.1 SDK (requires .NET 4.5.1) which gives you most capabilities but the one from the Windows 8 SDK is also fine.

There you get graphs % CPU Utilization and % Disc Utilization. If either one is at 100% and the other one is low then you have found the bottleneck. Since it is a system wide profiler you can check if the msmq service is using the disc badly or you or someone else (e.g. virus scanner is a common issue).

You can directly get to your call stacks and check which process and thread did wake your worker thread up which is supposed to run at full speed. Then you can jump to the readying thread and process and check what it did do before it could ready your thread. That way you can directly verify what was hindering it so long.

No more guessing. You can really see what the system is doing.

To analyze further enable in the CPU Usage Precise view the following columns:

  • NewProcess
  • NewThreadId
  • NewThreadStack(Frame Tags)
  • ReadyingProcess
  • ReadyingThreadId
  • Ready(us) Sum
  • Wait(us) Sum
  • Wait(us)
  • %CPU Usage

Then drill down for a call stack in your process to see where high Wait(us) times do occur in a thread that is supposed to run at full speed.. You can drill down to one single event until you can go no further. Then you will see values in Readying Process and ReadyingThreadId. Go to that process/thread (it can be your own) and repeat the process until you end up in some blocking operation which does either involve disc IO or sleeps or a long running device driver call (e.g virus scanner or the vm driver).

Atropos answered 3/11, 2013 at 0:3 Comment(2)
Thanks Alois. Already installed it on the machine yesterday and gave it a try. Disk utilization was not showing anything crazy, once again disk seems to sit idle most of the time. Still the tool is a wonderful one, I was amazed, and spent a few hours on it, but the learning curve is pretty steep (at least for me) and I did not really know where to look at to find the bottleneck. It's not like the program says : "Here is your bottleneck", you have so many ETW counters that I don't really know where to look at. And no idea if the value I am reading is acceptable or if it indicates a bottleneck.Deserted
Do you have acces to the real machine where the VM does run on? It may also be that the VM does page like crazy causing slowdowns in your app which you possibly cannot see in the hosted OS.Atropos
P
0

If the Disk I/O performance counters don't look abnormally high, I'd look next at the hypervisor level. Assuming you're running the exact same code, using a VM adds latency to the entire stack (CPU, RAM, Disk). You can perhaps tweak CPU Scheduling at the hypervisor level and see if this will increase CPU utilization.

I'd also consider using a RAMDisk temporarily for performance testing. This would eliminate the Disk/SAN latency and you can see if that fixes your problem.

Punkie answered 6/11, 2013 at 16:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.