Asked 9/12, 2010 at 3:36 Answered 15/12, 2010 at 19:29

Solved c++qt optimization profiling profiler

A Hacker's Tale

The date is 12/02/10. The days before Christmas are dripping away and I've pretty much hit a major road block as a windows programmer. I've been using AQTime, I've tried sleepy, shiny, and very sleepy, and as we speak, VTune is installing. I've tried to use the VS2008 profiler, and it's been positively punishing as well as often insensible. I've used the random pause technique. I've examined call-trees. I've fired off function traces. But the sad painful fact of the matter is that the app I'm working with is over a million lines of code, with probably another million lines worth of third-party apps.

I need better tools. I've read the other topics. I've tried out each profiler listed in each topic. There simply has to be something better than these junky and expensive options, or ludicrous amounts of work for almost no gain. To further complicate matters, our code is heavily threaded, and runs a number of Qt Event loops, some of which are so fragile that they crash under heavy instrumentation due to timing delays. Don't ask me why we're running multiple event loops. No one can tell me.

Are there any options more along the lines of Valgrind in a windows environment?
Is there anything better than the long swath of broken tools I've already tried?
Is there anything designed to integrate with Qt, perhaps with a useful display of events in queue?

A full list of the tools I tried, with the ones that were really useful in italics:

AQTime: Rather good! Has some trouble with deep recursion, but the call graph is correct in these cases, and can be used to clear up any confusion you might have. Not a perfect tool, but worth trying out. It might suit your needs, and it certainly was good enough for me most of the time.
Random Pause attack in debug mode: Not enough information enough of the time.
A good tool but not a complete solution.
Parallel Studios: The nuclear option. Obtrusive, weird, and crazily powerful. I think you should hit up the 30 day evaluation, and figure out if it's a good fit. It's just darn cool, too.
AMD Codeanalyst: Wonderful, easy to use, very crash-prone, but I think that's an environment thing. I'd recommend trying it, as it is free.
Luke Stackwalker: Works fine on small projects, it's a bit trying to get it working on ours. Some good results though, and it definitely replaces Sleepy for my personal tasks.
PurifyPlus: No support for Win-x64 environments, most prominently Windows 7. Otherwise excellent. A number of my colleagues in other departments swear by it.
VS2008 Profiler: Produces output in the 100+gigs range in function trace mode at the required resolution. On the plus side, produces solid results.
GProf: Requires GCC to be even moderately effective.
VTune: VTune's W7 support borders on criminal. Otherwise excellent
PIN: I'd need to hack up my own tool, so this is sort of a last resort.
Sleepy\VerySleepy: Useful for smaller apps, but failing me here.
EasyProfiler: Not bad if you don't mind a bit of manually injected code to indicate where to instrument.
Valgrind: *nix only, but very good when you're in that environment.
OProfile: Linux only.
Proffy: They shoot wild horses.

Suggested tools that I haven't tried:

XPerf:
Glowcode:
Devpartner:

Notes: Intel environment at the moment. VS2008, boost libraries. Qt 4+. And the wretched humdinger of them all: Qt/MFC integration via trolltech.

Now: Almost two weeks later, it looks like my issue is resolved. Thanks to a variety of tools, including almost everything on the list and a couple of my personal tricks, we found the primary bottlenecks. However, I'm going to keep testing, exploring, and trying out new profilers as well as new tech. Why? Because I owe it to you guys, because you guys rock. It does slow the timeline down a little, but I'm still very excited to keep trying out new tools.

Synopsis
Among many other problems, a number of components had recently been switched to the incorrect threading model, causing serious hang-ups due to the fact that the code underneath us was suddenly no longer multithreaded. I can't say more because it violates my NDA, but I can tell you that this would never have been found by casual inspection or even by normal code review. Without profilers, callgraphs, and random pausing in conjunction, we'd still be screaming our fury at the beautiful blue arc of the sky. Thankfully, I work with some of the best hackers I've ever met, and I have access to an amazing 'verse full of great tools and great people.

Gentlefolk, I appreciate this tremendously, and only regret that I don't have enough rep to reward each of you with a bounty. I still think this is an important question to get a better answer to than the ones we've got so far on SO.

As a result, each week for the next three weeks, I'll be putting up the biggest bounty I can afford, and awarding it to the answer with the nicest tool that I think isn't common knowledge. After three weeks, we'll hopefully have accumulated a definitive profile of the profilers, if you'll pardon my punning.

Take-away
Use a profiler. They're good enough for Ritchie, Kernighan, Bentley, and Knuth. I don't care who you think you are. Use a profiler. If the one you've got doesn't work, find another. If you can't find one, code one. If you can't code one, or it's a small hang up, or you're just stuck, use random pausing. If all else fails, hire some grad students to bang out a profiler.

A Longer View
So, I thought it might be nice to write up a bit of a retrospective. I opted to work extensively with Parallel Studios, in part because it is actually built on top of the PIN Tool. Having had academic dealings with some of the researchers involved, I felt that this was probably a mark of some quality. Thankfully, I was right. While the GUI is a bit dreadful, I found IPS to be incredibly useful, though I can't comfortably recommend it for everyone. Critically, there's no obvious way to get line-level hit counts, something that AQT and a number of other profilers provide, and I've found very useful for examining rate of branch-selection among other things. In net, I've enjoyed using AQTime as well, and I've found their support to be really responsive. Again, I have to qualify my recommendation: A lot of their features don't work that well, and some of them are downright crash-prone on Win7x64. XPerf also performed admirably, but is agonizingly slow for the sampling detail required to get good reads on certain kinds of applications.

Right now, I'd have to say that I don't think there's a definitive option for profiling C++ code in a W7x64 environment, but there are certainly options that simply fail to perform any useful service.

Microphone answered 9/12, 2010 at 3:36 Comment(15)

So that you don't get a lot "have you tried" answer, could you please update your question to include what you have tried. – Unpack 9/12, 2010 at 3:42

Have you looked into getting a different job? :) – Butterfish 9/12, 2010 at 3:43

Where else would I get to solve puzzles this hard? I guess I could go back to kernel hacking, but that doesn't pay as well. – Microphone 9/12, 2010 at 3:47

How come there's no gprof in your list? – Vestment 9/12, 2010 at 14:35

@Vestment I think for gprof to be of any use you have to be using the gcc toolset compiled with -pg otherwise it doesn't produce the gprof.out file. In the OP's case it sounds like he's using msvc which rules out using gprof. Then again I dont' think gprof would fair any better for him if the others on the list are failing his needs – Dealer 9/12, 2010 at 15:39

I'm not sure if this was with AQTime or not, so take this with a grain of salt. I seem to remember the ability to exclude parts of the code from the profile so you're not overwhelmed with noise. I.E. I excluded the code in a serial port based app because 90% of the samples showed the serial port read as the time hog. Just a thought... – Volding 10/12, 2010 at 1:40

That'd be the area mechanism, which I'm using extensively. I'ven't gotten to trying to use triggers for profiling yet, and frankly I'm a bit worried about doing so, but I think it's the next step. – Microphone 10/12, 2010 at 1:54

Just gotta suggest, did you try the VS2010 profiler? – Villatoro 15/12, 2010 at 19:44

Can't use VS2010 right now, for a variety of reasons that I am not at liberty to post here. It'd violate my NDA. – Microphone 15/12, 2010 at 20:37

So what was the problem? Care to give a brief synopsis? – Hurleigh 16/12, 2010 at 2:7

@Jake - re why this is wiki - there is an inbuilt flip associated with continual editing; this has been through quite a few iterations. – Convexoconvex 16/12, 2010 at 22:35

@Marc Gravell That's fair enough, I suppose.... It seems an odd heuristic to me, that the most well-maintained posts pass abruptly into the community domain, effectively producing a situation where the more you update and maintain your question or answer, the less you get out of that maintenance in the eyes of the community at large. Should I take this to meta? – Microphone 17/12, 2010 at 2:28

@Jake - note that this has been discussed before: meta.stackexchange.com/questions/8654/… - and I am reliably informed that the constraints for auto-wiki have been gradually relaxed (i.e. it is now less eager to do this than it was on day-zero) – Convexoconvex 17/12, 2010 at 7:1

That's a pretty cohesive discussion. I think I have nothing of deep worth to add at the moment. :) – Microphone 17/12, 2010 at 18:5

Does anyone want a retrospective, given what I now know about profilers? – Microphone 25/1, 2011 at 19:24

First:

Time sampling profilers are more robust than CPU sampling profilers. I'm not extremely familiar with Windows development tools so I can't say which ones are which. Most profilers are CPU sampling.

A CPU sampling profiler grabs a stack trace every N instructions.
This technique will reveal portions of your code that are CPU bound. Which is awesome if that is the bottle neck in your application. Not so great if your application threads spend most of their time fighting over a mutex.

A time sampling profiler grabs a stack trace every N microseconds.
This technique will zero in on "slow" code. Whether the cause is CPU bound, blocking IO bound, mutex bound, or cache thrashing sections of code. In short what ever piece of code is slowing your application will standout.

So use a time sampling profiler if at all possible especially when profiling threaded code.

Second:

Sampling profilers generate gobs of data. The data is extremely useful, but there is often too much to be easily useful. A profile data visualizer helps tremendously here. The best tool I've found for profile data visualization is gprof2dot. Don't let the name fool you, it handles all kinds of sampling profiler output (AQtime, Sleepy, XPerf, etc). Once the visualization has pointed out the offending function(s), jump back to the raw profile data to get better hints on what the real cause is.

The gprof2dot tool generates a dot graph description that you then feed into a graphviz tool. The output is basically a callgraph with functions color coded by their impact on the application. alt text

A few hints to get gprof2dot to generate nice output.

I use a --skew of 0.001 on my graphs so I can easily see the hot code paths. Otherwise the int main() dominates the graph.
If you're doing anything crazy with C++ templates you'll probably want to add --strip. This is especially true with Boost.
I use OProfile to generate my sampling data. To get good output I need configure it to load the debug symbols from my 3rd party and system libraries. Be sure to do the same, otherwise you'll see that CRT is taking 20% of your application's time when what's really going on is malloc is trashing the heap and eating up 15%.

Churchly answered 15/12, 2010 at 19:29 Comment(5)

While I don't know that this is the full answer to my problems, gprof2dot has entered my vast arsenal, and is rapidly assuming a favorite spot. I think that's worth a bounty! – Microphone 15/12, 2010 at 20:36

I asked this question Linux time sample based profiler. OProfile is supposed the get time based sampling eventually. They produce very high quality output, so once they add that feature I'll use them. Other than that I had a friend hack together a gdb + backtrace solution for profiling. Very hacky, but it did find the bottleneck. – Churchly 3/2, 2012 at 4:17

@deft_code: "hack together a gdb + backtrace solution for profiling. Very hacky, but it did find the bottleneck." You're confirming my constant rant :) Some people want profiling to be pretty, but if results are what you need, go with what works, not what's pretty. – Credits 23/10, 2014 at 18:44

I agree with Mike Dunlavey. Things like XPerf/WPA look very pretty and powerful, but figuring out how to use these tools takes a while, and at the end of the day random pausing is so easy and provides better information to solve the problem. More automated solutions seem to more often than not filter out critical information needed to solve the bottleneck. – Realist 24/10, 2014 at 17:38

Visual studio comes with a time sampling profiler in the box. learn.microsoft.com/en-us/visualstudio/profiling/… – Blinker 30/11, 2020 at 17:48

What happened when you tried random pausing? I use it all the time on a monster app. You said it did not give enough information, and you've suggested you need high resolution. Sometimes people need a little help in understanding how to use it.

What I do, under VS, is configure the stack display so it doesn't show me the function arguments, because that makes the stack display totally unreadable, IMO.

Then I take about 10 samples by hitting "pause" during the time it's making me wait. I use ^A, ^C, and ^V to copy them into notepad, for reference. Then I study each one, to try to figure out what it was in the process of trying to accomplish at that time.

If it was trying to accomplish something on 2 or more samples, and that thing is not strictly necessary, then I've found a live problem, and I know roughly how much fixing it will save.

There are things you don't really need to know, like precise percents are not important, and what goes on inside 3rd-party code is not important, because you can't do anything about those. What you can do something about is the rich set of call-points in code you can modify displayed on each stack sample. That's your happy hunting ground.

Examples of the kinds of things I find:

During startup, it can be about 30 layers deep, in the process of trying to extract internationalized character strings from DLL resources. If the actual strings are examined, it can easily turn out that the strings don't really need to be internationalized, like they are strings the user never actually sees.
During normal usage, some code innocently sets a Modified property in some object. That object comes from a super-class that captures the change and triggers notifications that ripple throughout the entire data structure, manipulating the UI, creating and desroying obects in ways hard to foresee. This can happen a lot - the unexpected consequences of notifications.
Filling in a worksheet row-by-row, cell-by-cell. It turns out if you build the row all at once, from an array of values, it's a lot faster.

P.S. If you're multi-threaded, when you pause it, all threads pause. Take a look at the call stack of each thread. Chances are, only one of them is the real culprit, and the others are idling.

Credits answered 9/12, 2010 at 14:20 Comment(26)

I've done all those things, and normally that's been enough in conjunction with AQTime. In fact, it's what led me to my deep suspicions regarding the QT event loop, but now I need to get a better idea of the shape and reason for the 10-15 signals we send every few milliseconds, as well as figure out how exactly the event loop is eating 80% of our time while we're in processor bound tasks. I even scribbled up a couple VS macros to do some of the pausing, but I just can't pause often enough. – Microphone 9/12, 2010 at 18:18

@Jake: If the event loop is using 80% of time (CPU bound or not), then you should spot it doing that on 80% of pauses, no matter how often or seldom you take them. Or, is it the case that that whole activity does not occur very often, and you're trying to "stab" it? In that case, what I sometimes do is either 1) wrap a loop around the fast code, to "magnify" it, or 2) setup a data-change-breakpoint, and that slows down execution (speeds me up :) by some orders of magnitude. – Credits 9/12, 2010 at 18:49

The event loop shows, but not what it is processing. It seems that QT does some pretty quirky stuff in a desperate attempt to pretend to be smalltalk, which is making it difficult to get a finer reading on what's actually costly about the event loop. I hadn't thought of actually using the Breakpoint Blues to my advantage though! :) – Microphone 9/12, 2010 at 18:52

@Jake: I see, so they're making it difficult to figure out the request-chain that gets you to where you are, by passing messages (hopefully with continuations). That means when you pause, you gotta trace back the imitation stack that they have, in order to figure out the composite reason for doing what its doing. I know that's not simple, but you shouldn't have to do it very much. (I used to read hex dumps. Not fun, but it got the job done.) – Credits 9/12, 2010 at 19:9

I'd rather read hex dumps! :) I've been trying to trace back the stack of QT-hatespew that landed us in this situation, and I was just sort of hoping and praying for a tool to do it, since I expect this won't be the only time I get badgered into this sort of madcap hunt. – Microphone 9/12, 2010 at 19:22

@Jake: Another trick is, when you pause it, then you single-step forward from that point until it "returns", hopefully bringing you back to the code that it came from, and keep doing that until it exits a few layers. Again, they're not making it easy (bless their hearts), but a tough coder should be able to figure the real reason why it's doing what it's doing at the time of the pause. – Credits 9/12, 2010 at 19:28

@Jake: I'm not QT-literate, but there's got to be something that functions as a stacktrace in their weird world. – Credits 9/12, 2010 at 19:31

I've gone ahead and manually instrumented the call backs. :| Stepping hasn't helped that much. :S – Microphone 9/12, 2010 at 19:35

@Jake: Let me know how it works. As I see it, when I take a pause, the key point is to understand "why am I here", because if the reason is not very good, that's a chance to get some speedup. If you can't find time being spent for poor reasons, you can't get the speedup. – Credits 9/12, 2010 at 20:30

The curious part is that I can't really discern why it's spending time in there in the first place. It's a fairly intricate piece of code, even by my sstandards. – Microphone 9/12, 2010 at 21:19

@Jake: I don't suppose they put any comments in there? When I try to understand somebody else's code, I sorta have to have a general idea what problem they're solving and how. (Even so, it sometimes takes me months to get into it, sadly.) Even so, when you finally see what they're doing, I bet you find it's way overcomplicated. That's my experience. – Credits 9/12, 2010 at 22:27

Comments? Comments? THIS IS SPARTA! I... Sorry, don't know where that came from. No, the code makes Klingon Opera look readable, and it's about as well documented. Actually, I think it's far less documented.... Oh god. – Microphone 9/12, 2010 at 23:10

@Jake: Yeah, there's no profiler gonna be much help on this. I fall back on good old single-stepping through the code as it does certain things, like painting a window or handling a mouse click. After wasting a bunch of time like this, some order rises out of the fog. I wish I could be more help. – Credits 10/12, 2010 at 0:36

I've been doing that, and I think I pinned it down. The QTMFC integration causes there to be at least two event loops at any given point, meaning that time spent in an event loop gets counted twice by most profilers. I'm angry, frustrated, and feeling somewhat betrayed. – Microphone 10/12, 2010 at 1:18

QTMFC integration? Oh great, you've got complicated and evil, and you haven't even gotten to the application-specific code yet. – Clausen 10/12, 2010 at 2:48

QT/MFC? Shouldn't that produce mutant children with 3 heads that rock back and forth while calling every idea they hear the stupidest idea ever? Errr... I digress... If you're using any of the MFC Socket classes, immediately rewrite your socket code and then profile. There are ALOT of places in the CSocket code that uses the message loop version of WaitForSingleObject which I've found to kill performance. For the life of me I cannot remember the name of the wait func... :/ – Volding 10/12, 2010 at 3:8

Oh god, trust me, it is exactly as screwy as you think. – Microphone 10/12, 2010 at 16:51

@Jake: Not much comfort, but that's the glory of Turing universality. Any language, no matter how high or low level, is equivalent in its unbounded ability to be misused. – Credits 10/12, 2010 at 17:10

I'm not sure that point is perfectly defensible. Some DSLs have a much more limited capacity for havoc by their very nature. An interesting thing to think about. Thanks for all the help, Mike. I'm still holding out for a good profiling mechanism as well, but my optimism has faded to a crumbly husk. – Microphone 10/12, 2010 at 19:30

@Jake: At least you're in a nice place to go outside and maybe forget the pile of goo for a while. – Credits 10/12, 2010 at 19:53

It's definitely true. I routinely take a walk to the post office. It's rather lovely. Incidentally, if you know the area, you now know where I work. :) – Microphone 10/12, 2010 at 19:55

@Ben: @Jake: I'm afraid I still have the impression that Jake has a reverse-engineering problem, not a profiling problem. Even if he knew method or line-level timings, counts, percentages, etc., it would still be incomprehensible. Once I wrote a reverse-engineering tool for Cobol apps, that could help figure out what it was doing. (It never saw daylight.) Basically, what it would do is take the source code and boil it down, keeping track of variables & their cases, expanding calls, stuff like that, and let user make notes. – Credits 13/12, 2010 at 22:2

Interestingly, it turned out to basically be a profiling problem in the end. And random pausing was, as always, useful. It wasn't enough on its own, not by a wide margin, but it was useful. – Microphone 16/12, 2010 at 4:42

@Jake: glad you got it figured out. – Credits 16/12, 2010 at 13:45

This comment will get buried, since there are so many, but for this type of profiling, on Windows, the cdb debugger would work great - you can tell it to dump all the thread stacks with one command (that's easily repeated), and there's no need to copy/paste each time - just set it up to copy its console output to a log file. – Klink 5/1, 2011 at 6:9

@Michael Burr: Thanks for the tip! I'll look into it. – Credits 5/1, 2011 at 13:57

I've had some success with AMD CodeAnalyst.

Toland answered 9/12, 2010 at 4:5 Comment(6)

Intel environment, at the moment. I'll keep it in mind though! :) – Microphone 9/12, 2010 at 4:8

@Jake: I'm not sure what you mean there. AMD CodeAnalyst doesn't require AMD chips, it should work on most x86 or x64 (aka x86-64/IA-64/AMD64) chips, including Intel chips. – Toland 9/12, 2010 at 4:13

Apparently, I'm illiterate! That's wonderful news. I'll try it out tomorrow and update the question. – Microphone 9/12, 2010 at 4:14

So far, it's very unstable when sampling at the resolutions I need. – Microphone 9/12, 2010 at 5:5

@Adam: i tried code analyst on a intel pentium IV machine recently, and it only offered time-based sampling, with no information about thread usage, nor thread related information whatsoever... the amount of information i got was really mediocre.. additionally it caused crashes in the qt integration of visual studio.. i was not satisfied :( – Phelips 9/12, 2010 at 20:24

Crashed and burned every single time, even on wider sampling resolutions like 10 ms. – Microphone 10/12, 2010 at 2:9

Do you have an MFC OnIdle function? In the past I had a near real-time app I had to fix that was dropping serial packets when set at 19.2K speed which a PentiumD should have been able to keep up with. The OnIdle function was what was killing things. I'm not sure if QT has that concept, but I'd check for that too.

Volding answered 11/12, 2010 at 21:44 Comment(2)

We actually do have an OnIdle, and thanks to our QTMFC integration, it's flowing through the QT ev..e...eve...event loop. Oh G'd. – Microphone 12/12, 2010 at 19:30

Turns out that this lead directly to our solution, so while it's not a perfect answer to the question, I think the question is unanswerable. – Microphone 17/12, 2010 at 21:6

Re the VS Profiler -- if it's generating such large files, perhaps your sampling interval is too frequent? Try lowering it, as you probably have enough samples anyway.

And ideally, make sure you're not collecting samples until you're actually exercising the problem area. So start with collection paused, get your program to do its "slow activity", then start collection. You only need at most 20 seconds of collection. Stop collection after this.

This should help reduce your sample file sizes, and only capture what is necessary for your analysis.

Hurleigh answered 9/12, 2010 at 4:27 Comment(1)

I'll give this a shot tomorrow. – Microphone 9/12, 2010 at 18:16

I have successfully used PurifyPlus for Windows. Although it is not cheap, IBM provides a trial version that is slightly crippled. All you need for profiling with quantify are pdb files and linking with /FIXED:NO. Only drawback: No support for Win7/64.

Pedantry answered 14/12, 2010 at 11:46 Comment(2)

Unfortunately, our primary target is Win7. I'll add that info to the main post. – Microphone 14/12, 2010 at 18:46

The current version of PurifyPlus supports Win7/64. – Pedantry 16/10, 2013 at 14:20

Easyprofiler - I haven't seen it mentioned here yet so not sure if you've looked at it already. It takes a slightly different approach in how it gathers metric data. A drawback to using its compile-time profile approach is you have to make changes to the code-base. Thus you'll need to have some idea of where the slow might be and insert profiling code there.

Going by your latest comments though, it sounds like you're at least making some headway. Perhaps this tool might provide some useful metrics for you. If nothing else it has some really purdy charts and pictures :P

Dealer answered 10/12, 2010 at 23:6 Comment(0)

Checkout XPerf

This is free, non-invasive and extensible profiler offered by MS. It was developed by Microsoft to profile Windows.

Lim answered 11/12, 2010 at 22:14 Comment(0)

Two more tool suggestions.

Luke Stackwalker has a cute name (even if it's trying a bit hard for my taste), it won't cost you anything, and you get the source code. It claims to support multi threaded programs, too. So it is surely worth a spin.

http://lukestackwalker.sourceforge.net/

Also Glowcode, which I've had pointed out to me as worth using:

http://www.glowcode.com/

Unfortunately I haven't done any PC work for a while, so I haven't tried either of these. I hope the suggestions are of help anyway.

Theatheaceous answered 12/12, 2010 at 0:27 Comment(0)

If you're suspicious of the event loop, could overriding QCoreApplication::notify() and dosome manual profiling (one or two maps of senders/events to counts/time)?

I'm thinking that you first log the frequency of event types, then examine those events more carefully (which object sends it, what does it contain, etc). Signals across threads are queued implicitly, so they end up in the event loop (as well explicit queued connections too, obviously).

We've done it to trap and report exceptions in our event handlers, so really, every event goes through there.

Just an idea.

Hardiness answered 13/12, 2010 at 18:32 Comment(3)

That's a lovely idea! I'm not accustomed to a QT environment, having done most of my work with pyGTK here-to-fore. Thank you! – Microphone 13/12, 2010 at 18:47

Do you have a recommended way of sourcing and resolving the nature of given signals? – Microphone 13/12, 2010 at 19:30

I've only done it for signals with QStateMachine::SignalEvent, which doesn't seem to be the same. The source should still be the QObject* object parameter. Maybe MetaCall is the type for all signals (seems likely), but I'm not sure. This goes a bit beyond my experience, but peeking into the Qt source might glean some truth. (Or, ask a more pointed question w.r.t. queued signal invocations here on SO .. :) – Hardiness 13/12, 2010 at 20:25

I use xperf/ETW for all of my profiling needs. It has a steep learning curve but is incredibly powerful. If you are profiling on Windows then you must know xperf. I frequently use this profiler to find performance problems in my code and in other people's code.

In the configuration that I use it:

xperf grabs CPU samples from every core that is executing code every ms. The sampling rate can be increased to 8 KHz and the samples include user-mode and kernel code. This allows finding out what a thread is doing while it is running
xperf records every context switch (allowing for perfect reconstruction of how much time each thread uses), plus call stacks for when threads are switched in, plus call stacks for what thread readied another thread, allowing tracing of wait chains and finding out why a thread is not running
xperf records all file I/O from all processes
xperf records all disk I/O from all processes
xperf records what window is active, the CPU frequency, CPU power state, UI delays, etc.
xperf can also record all heap allocations from one process, all virtual allocations from all processes, and much more.

That's a lot of data, all on one timeline, for all processes. No other profiler on Windows can do that.

I have blogged extensively about how to use xperf/ETW. These blog posts, and some professionally quality training videos, can be found here: http://randomascii.wordpress.com/2014/08/19/etw-training-videos-available-now/

If you want to find out what might happen if you don't use xperf read these blog posts: http://randomascii.wordpress.com/category/investigative-reporting/ These are tales of performance problems I have found in other people's code, that should have been found by the developers. This includes mshtml.dll being loaded into the VC++ compiler, a denial of service in VC++'s find-in-files, thermal throttling in a surprising number of customer machines, slow single-stepping in Visual Studio, a 4 GB allocation in a hard-disk driver, a powerpoint performance bug, and more.

Disillusionize answered 9/12, 2010 at 3:36 Comment(0)

I can tell you what I use everyday.

a) AMD Code Analyst

It is easy, and it will give you a quick overview of what is happening. It will be ok for most of the time.
With AMD CPUs, it will tell you info about the cpu pipeline, but you only need this only if you have heavy loops, like in graphic engines, video codecs, etc.

b) VTune.

It is very well integrated in vs2008
after you know the hotspots, you need to sample not only time, but other things like cache misses, and memory usage. This is very important. Setup a sampling session, and edit the properties. I always sample for time, memory read/write, and cache misses (three different runs)

But more than the tool, you need to get experience with profiling. And that means understanding how the CPU/Memory/PCI works... so, this is my 3rd option

c) Unit testing

This is very important if you are developing a big application that needs huge performance. If you cannot split the app in some pieces, it will be difficult to track cpu usage. I dont test all the cases and classes, but I have hardcoded executions and input files with important features.

My advice is using random sampling in several small tests, and try to standardise a profile strategy.

Peta answered 9/12, 2010 at 3:36 Comment(1)

AMD Code Analyst is unstable in my dev environment, and VTune explicitly does not support it. :S – Microphone 16/12, 2010 at 4:26

Edit: I see now you mentioned this in your first post. Dammit, I never thought I'd be that guy.

You can use Pin to instrument your code with finer granularity. I think Pin would let you create a tool to count how many times you enter a function or how many clockticks you spend there, roughly emulating something like VTune or CodeAnalyst. Then you could strip down which functions get instrumented until your timing issues go away.

Burgas answered 15/12, 2010 at 17:48 Comment(1)

Actually, PIN was what I first reached for. There's actually something called PIN Play that would be perfect, but it's not for release outside Intel. I'm not sure I remember enough about using PIN to bodge together something really good, but... – Microphone 15/12, 2010 at 18:14

Just to throw it out, even though it's not a full-blown profiler: if all you're after is hung event loops that take long processing an event, an ad-hoc tool is simple matter in Qt. That approach could be easily expanded to keep track of how long did each event take to process, and what those events were, and so on. It's not a universal profiler, but an event-loop-centric one.

In Qt, all cross-thread signal-slot calls are delivered via the event loop, as are timers, network and serial port notifications, and all user interaction,. Thus, observing the event loops is a big step towards understanding where the application is spending its time.

Smelly answered 9/12, 2010 at 3:36 Comment(0)

I just finished the first usable version of CxxProf, a portable manual instrumented profiling library for C++.

It fulfills the following goals:

Easy integration
Easily remove the lib during compile time
Easily remove the lib during runtime
Support for multithreaded applications
Support for distributed systems
Keep impact on a minimum

These points were ripped from the project wiki, have a look there for more details.

Disclaimer: Im the main developer of CxxProf

Ecclesiastes answered 9/12, 2010 at 3:36 Comment(0)

I use Orbit profiler, easy, open source and powerfull ! https://orbitprofiler.com/

Exogamy answered 9/12, 2010 at 3:36 Comment(0)

There are lots of profilers listed here and I've tried a few of them myself - however I ended up writing my own based on this:

http://code.google.com/p/high-performance-cplusplus-profiler/

It does of course require that you modify the code base, but it's perfect for narrowing down bottlenecks, should work on all x86s (could be a problem with multi-core boxes, i.e. it uses rdtsc, however - this is purely for indicative timing anyway - so I find it's sufficient for my needs..)

Creolacreole answered 9/12, 2010 at 3:36 Comment(1)

Copy of the project at github.com/michael-mayes/high-performance-cplusplus-profiler with description at floodyberry.wordpress.com/2009/10/07/… – Boo 9/5, 2017 at 20:6

though your os is win7,the programm cann't run under xp? how about profile it under xp and the result should be a hint for win7.

Carbonado answered 9/12, 2010 at 3:36 Comment(1)

Certainly, it could, but that would require buying a license for a product that may never support your desired dev env well, or may take years to do so. 1.5k is a lot of money to bet, plus the costs in time of imaging and deploying an xp box. – Microphone 17/12, 2010 at 19:5

DevPartner, originally developed by NuMega and now distributed by MicroFocus, was once the solution of choice for profiling and code analysis (memory and resource leaks for example). I haven't tried it recently, so I cannot assure you it will help you; but I once had excellent results with it, so that this is an alternative I do consider to re-install in our code quality process (they provide a 14 days trial)

Vanderpool answered 12/12, 2010 at 14:31 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

A Hacker's Tale

First:

Second:

Recommended topics

Hot tags