One could use a profiler, but why not just halt the program? [closed]

Asked 5/11, 2008 at 19:49 Answered 28/11, 2011 at 0:46

Solved performance logging optimization architecture profiling

If something is making a single-thread program take, say, 10 times as long as it should, you could run a profiler on it. You could also just halt it with a "pause" button, and you'll see exactly what it's doing.

Even if it's only 10% slower than it should be, if you halt it more times, before long you'll see it repeatedly doing the unnecessary thing. Usually the problem is a function call somewhere in the middle of the stack that isn't really needed. This doesn't measure the problem, but it sure does find it.

Edit: The objections mostly assume that you only take 1 sample. If you're serious, take 10. Any line of code causing some percentage of wastage, like 40%, will appear on the stack on that fraction of samples, on average. Bottlenecks (in single-thread code) can't hide from it.

EDIT: To show what I mean, many objections are of the form "there aren't enough samples, so what you see could be entirely spurious" - vague ideas about chance. But if something of any recognizable description, not just being in a routine or the routine being active, is in effect for 30% of the time, then the probability of seeing it on any given sample is 30%.

Then suppose only 10 samples are taken. The number of times the problem will be seen in 10 samples follows a binomial distribution, and the probability of seeing it 0 times is .028. The probability of seeing it 1 time is .121. For 2 times, the probability is .233, and for 3 times it is .267, after which it falls off. Since the probability of seeing it less than two times is .028 + .121 = .139, that means the probability of seeing it two or more times is 1 - .139 = .861. The general rule is if you see something you could fix on two or more samples, it is worth fixing.

In this case, the chance of seeing it in 10 samples is 86%. If you're in the 14% who don't see it, just take more samples until you do. (If the number of samples is increased to 20, the chance of seeing it two or more times increases to more than 99%.) So it hasn't been precisely measured, but it has been precisely found, and it's important to understand that it could easily be something that a profiler could not actually find, such as something involving the state of the data, not the program counter.

Schaerbeek answered 5/11, 2008 at 19:49 Comment(4)

Does "halting the program" work in multi-threaded applications? – Licit 5/11, 2008 at 21:29

Sadly no, that's more of a challenge. I usually concentrate on the code in each thread by itself. If there are messages between processes, I use a logging technique. Not easy, but it works. – Schaerbeek 5/11, 2008 at 22:20

You may be getting downvoted for two reasons. 1) "Why isn't it better known?" is hardly a question, and can't be answered. 2) You present an argumentative case for your method. "My way is great, why aren't you all on board yet?" isn't a good social tactic - it doesn't elicit a thoughtful response. – Headroom 23/11, 2008 at 1:14

Also, who doesn't try doing this before breaking out the profiler? – Hoagland 23/11, 2008 at 1:19

On Java servers it's always been a neat trick to do 2-3 quick Ctrl-Breakss in a row and get 2-3 threaddumps of all running threads. Simply looking at where all the threads "are" may extremely quickly pinpoint where your performance problems are.

This technique can reveal more performance problems in 2 minutes than any other technique I know of.

Illogicality answered 25/11, 2008 at 12:23 Comment(14)

There's still the issue of tracking down asynchronous problems, but for that other methods are needed. – Schaerbeek 25/11, 2008 at 13:25

Have you experimented by doing this programmatically with the Java 6 additions allowing for stack trace introspection? – Hildehildebrand 17/10, 2009 at 8:34

I'm not sure we're thinking about the same thing; a thread dump gives you a view of where all your threads are at any given point in time. If you're going to do this inside your code, you'll need to know where to put it, right ? – Illogicality 17/10, 2009 at 12:8

Note that this is exactly how the "cpu=samples" feature of hprofworks. It will wake up regularly (default 10ms) and record the current stack trace of every thread. Handy if you find it difficult to press ctrl-break 100 times a second ;-). – Shirl 20/9, 2010 at 15:9

@sleske: That's great that hprof samples the stack. That's a massive improvement over PC-only samplers. Some points: 1) can it be disabled while waiting for user input? 2) Does it tell you by line (not just by function) % of samples containing that line? 3) What do 1000s of samples tell you that 10 or 20 don't? 4) Does it let you examine individual samples, so you can see why the program spends its time? – Schaerbeek 7/11, 2010 at 16:57

@Mike: To the best of my knowledge: 1) no, I don't think so; 2) yes; 3) It helps if the code is sufficiently optimized that you no longer have one really big hotspot (plus you can always take less samples if you want); 4) I don't understand that question :-/ . Anyway, check out hprof, it's a nice tool (even though you may not need it). – Shirl 7/11, 2010 at 22:36

@sleske: 1) That's a problem because it dilutes with irrelevant samples. 2) Good. 3) i.e. If your biggest problem is very small, why bother optimizing? My experience is when you think there are no big problems, there actually are (#926766). 4) See point 3 of (https://mcmap.net/q/15226/-alternatives-to-gprof-closed/…). I'm sure hprof is a nice tool. So is Zoom. I need effective, not nice. – Schaerbeek 8/11, 2010 at 0:56

@MikeDunlavey 1) it’s not really a problem, as the profile contains information about the CPU time and the elapsed time, so you can distinguish between waiting time and calculation time while analyzing, in case it matters. Sometimes, the problem is avoidable waiting time. 3) ten samples can happily miss the relevant part, resp. point to the wrong part. Mind that the ten samples are misleading as soon as only four point to the right code and six point to the irrelevant part. And by asking for 1) you named another reason, a program may alternate between different operations. And I’m lazy… – Inamorata 17/9, 2019 at 10:43

@Holger: I know this may be hard to understand, but the statistics are simple. Whatever the speed-bug is, it takes a certain fraction of time, like from 1% to 99%, and that fraction is the most time-reduction you can get by removing it. If it's only 1%, why bother? If it's 99%, a single sample will almost certainly spot it. If it's 50% each sample has a 50% chance of spotting it, so the chance that 10 samples won't spot it is 1/1024. Seeing it on 1 sample doesn't mean much, but seeing it on 2 or more samples nails it. (Not measure, nail.) – Schaerbeek 18/9, 2019 at 12:36

@MikeDunlavey your mistake is thinking that seeing it implies understanding that you’ve seen it. If you take two samples and these two samples differ, what tells you which of the two is the hotspot? Seeing it on two samples out of ten isn’t nailing it, as there are eight other samples pointing somewhere else and if only three of them point to the same wrong location by chance, you already have a misguiding result. But anyway, samples may differ and still identify the same hotspot, as each sample may be a stack trace containing between 30 and 50 entries at a different depth than the others. – Inamorata 18/9, 2019 at 13:17

@Holger: Here's what you ask on each sample. Why is it spending this moment? Typical kind of answer: "It's doing I/O to read a DLL to look up a string translation." OK, why is it doing that? What is the actual string? "foobar"? Is that a string that needs translation? No? OK, we don't actually need to do this. Get another sample. If we get another sample where it's doing I/O to translate another string that's not necessary, Bingo! We've nailed a speed bug! Now, timing functions will accomplish none of this. I know this is not what kids are taught (I was a professor), but... – Schaerbeek 20/9, 2019 at 0:18

@Holger: ... no problem will escape this method, while problems can easily escape from other profilers. The problem with profilers is that they assume you need a large number of samples, and since you can't look at all those samples they assume you need summaries of various kinds, none of which answer the why question. If this is hard to understand, here's a short video. There's also plenty more material on stackoverflow explaining the statistics behind it, and why no problem can escape it, – Schaerbeek 20/9, 2019 at 0:27

@MikeDunlavey first of all, there’s nothing wrong with the ideas of your question. Sampling is a viable method, which I prefer over Instrumentation approaches. Contrary to the pushback of some older comments and answers, this has been understood and profiling tools offer that feature nowadays. I also agree that sometimes, a single sample may already help, though in that case, I rather hit “suspend” in the debugger, which helps even more. But even getting a thread dump equivalent to CTRL+BREAK is offered by the same tools, at least with VisualVM. – Inamorata 20/9, 2019 at 7:28

@MikeDunlavey so it has all you want and is not harder to use than pressing CTRL+BREAK in a console. The fun starts, when searching for the relevant, even in single thread dump with large stack traces and/or lots of threads. And you’re also talking about doing it repeatedly. So what’s wrong with a tool that does exactly that automatically? And since that tool can handle one or ten or thousand samples with ease, why should I restrict myself to ten, instead of getting more precision and confidence about what is actually relevant to the overall performance, with no additional effort? – Inamorata 20/9, 2019 at 7:45

Because sometimes it works, and sometimes it gives you completely wrong answers. A profiler has a far better record of finding the right answer, and it usually gets there faster.

Licit answered 5/11, 2008 at 19:52 Comment(12)

Beg to differ. If a statement shows up on X% of N samples, then if you can get rid of it you'll save X% of time. Of course, if N is big, you'll know X with more precision, but who cares? – Schaerbeek 5/11, 2008 at 20:18

@mike you'd be surprised by how many people care. people have full time jobs working on or using profiling software. in many firms and applications it is critical and when the landscape gets big @Paul is on the money. – Peta 5/11, 2008 at 20:25

Timing it before and after gives plenty accuracy. – Schaerbeek 5/11, 2008 at 20:31

In the sort of programs I'm working on, I'd have to hit pause about a thousand times before I got any meaningful results. – Licit 5/11, 2008 at 21:10

Paul, we should get specific about what the type of program is. I've had a few bake-offs and never disappointed. If you take lots of samples while it's running (not waiting) and you can't find anything to optimize, your code is very unusual. – Schaerbeek 5/11, 2008 at 21:20

@Mike, I'm not saying I don't find anything to optimize, I say I find the WRONG thing to optimize. First rule of Optimization Club is "Never Optimize unnecessarily." – Licit 5/11, 2008 at 21:34

Have you actually tried this? It works for a very simple reason. If an instruction is on the call stack for X% of the time, then removing it would save exactly that much. Guaranteed. The expectation of the number of samples that shows the instruction is X%, plain and simple. – Schaerbeek 5/11, 2008 at 21:52

... now is it clear why it CAN'T give you a wrong answer? – Schaerbeek 5/11, 2008 at 23:53

As I implied in my comment to your original post, most of my programming is highly threaded. This method doesn't work for that, but a profiler does. – Licit 6/11, 2008 at 0:58

Yeah, threads are harder to handle. If profilers would gen the same statistic that the halting method does, then I wouldn't have a gripe. They could, but I think only some do, and it needs to be understood that that's the important statistic. – Schaerbeek 6/11, 2008 at 2:43

Regarding threads, this programmer #266873 uses it, no problem. – Schaerbeek 1/12, 2008 at 0:21

@Paul, for the archives - could you describe in more detail what programs you are working on, so we get an idea of areas where we might just as well grab the profiler instead of looking at multiple stack traces? – Hildehildebrand 13/1, 2011 at 8:5

Doing this manually can't really be called "quick" or "effective", but there are several profiling tools which do this automatically; also known as statistical profiling.

Dubenko answered 5/11, 2008 at 19:51 Comment(12)

The devil's in the details. It's not just taking samples that matters, it's what get recorded and how it is summarized. You need to record the entire call stack on each sample. Then you need for each statement (not function, statement) in the samples, what fraction of samples does it appear on. – Schaerbeek 5/11, 2008 at 20:13

and you need to take the samples when it matters, not when hanging for user input. – Schaerbeek 5/11, 2008 at 20:29

> You need to record the entire call stack on each sample There are sampling profilers which do this, and they are very useful. One recent addition into the Windows family: Xperf (Vista required) blogs.msdn.com/pigscanfly/archive/2008/02/09/… – Blende 12/11, 2008 at 20:51

I'm glad they are finally coming around to doing that. I hope they are also raising the visibility of the statistic that I think matters most - fraction of samples containing each statement (in the interval of interest). Too bad you need Vista. – Schaerbeek 14/11, 2008 at 2:35

Use OS X. The Activity Viewer has this facility built right in. Very handy. – Hildehildebrand 17/10, 2009 at 9:16

@Mike Dunlavey: This answer captures my confusion with your question perfectly. You ask: "why use a profiler, when you can just manually sample the program a few times?", and I'm thinking "but I could use my profiler and get several thousand samples in a fraction of the time". I often use the one built in to Activity Viewer, because it it always available on any OSX machine. If I want the full stack traces for every sample, I can get that from Instruments. Doing it manually with breakpoints seems intrusive, slow and ineffective by comparison. – Buffybuford 5/6, 2012 at 15:2

@Mankarse: Trying all over again to make the point. What do you want to do: A) make precise measurements or B) find juicy bottlenecks? (I bet you assumed B requires A.) If there are bottlenecks taking 1) 50%, 2) 25%, 3) 12.5% and 4) 6.25% of the time, fixing all of them gives you 16x speedup. If you miss any one of them you get 8x or less. Now, can measuring find every single one? ... – Schaerbeek 7/6, 2012 at 13:57

@Mankarse: ... That’s a long bet, but if you take 10 samples and study them (not just count them), bottleneck (1) will be staring you in the face on 5 of them. (So what did you need the other thousand samples for?) If you fix it, and repeat, bottleneck (2) will do the same thing. That’s how you get all the bottlenecks, not missing any. – Schaerbeek 7/6, 2012 at 13:58

@MikeDunlavey: Sure, but if it takes less time to take 1000 samples automatically than to take 10 manually, how is it any better? The extra samples aren't costing anything, and they aren't exactly going to hurt. – Buffybuford 28/8, 2012 at 15:15

@Mankarse: Suppose you have a bunch of stack samples, 10 or 1000. Now what? Each one is information-rich about a point in time. If you don't study them, but only make summaries, you miss opportunities, and that is critical for this reason: Typically there are multiple opportunities, like 6 as here, for a 730x speedup. Suppose most are found, but not every one. Any one missed severely reduces the final speedup factor. So the issue isn't the number of samples, it is the number you take the trouble to understand. – Schaerbeek 31/8, 2012 at 17:42

@Mankarse: Trying to say it better: Closely examining 10 samples, treating each one as an opportunity to save time, shows you possible speedups that are not even suspected, and certainly not found when there are 1000 samples not examined closely. One might be tempted to assume the converse is also true, but that is not so. The effectiveness of the technique comes from the act of closely examining the samples, and statistics guarantees that not many are required. – Schaerbeek 1/8, 2016 at 18:54

@MikeDunlavey: When did I say I would not examine the samples closely? All I said was that it is faster to get these samples via a tool that is designed for taking them, rather than via a manual process using a debugger. Once you have the samples, you can analyse them as you wish. Even if you only want 10 samples, it is still faster to get those samples (and another 990) using a profiler than to press 'pause' and 'resume' on your debugger 10 times. – Buffybuford 1/8, 2016 at 19:16

Callstack sampling is a very useful technique for profiling, especially when looking at a large, complicated codebase that could be spending its time in any number of places. It has the advantage of measuring the CPU's usage by wall-clock time, which is what matters for interactivity, and getting callstacks with each sample lets you see why a function is being called. I use it a lot, but I use automated tools for it, such as Luke Stackwalker and OProfile and various hardware-vendor-supplied things.

The reason I prefer automated tools over manual sampling for the work I do is statistical power. Grabbing ten samples by hand is fine when you've got one function taking up 40% of runtime, because on average you'll get four samples in it, and always at least one. But you need more samples when you have a flat profile, with hundreds of leaf functions, none taking more than 1.5% of the runtime.

Say you have a lake with many different kinds of fish. If 40% of the fish in the lake are salmon (and 60% "everything else"), then you only need to catch ten fish to know there's a lot of salmon in the lake. But if you have hundreds of different species of fish, and each species is individually no more than 1%, you'll need to catch a lot more than ten fish to be able to say "this lake is 0.8% salmon and 0.6% trout."

Similarly in the games I work on, there are several major systems each of which call dozens of functions in hundreds of different entities, and all of this happens 60 times a second. Some of those functions' time funnels into common operations (like malloc), but most of it doesn't, and in any case there's no single leaf that occupies more than 1000 μs per frame.

I can look at the trunk functions and see, "we're spending 10% of our time on collision", but that's not very helpful: I need to know exactly where in collision, so I know which functions to squeeze. Just "do less collision" only gets you so far, especially when it means throwing out features. I'd rather know "we're spending an average 600 μs/frame on cache misses in the narrow phase of the octree because the magic missile moves so fast and touches lots of cells," because then I can track down the exact fix: either a better tree, or slower missiles.

Manual sampling would be fine if there were a big 20% lump in, say, stricmp, but with our profiles that's not the case. Instead I have hundreds of functions that I need to get from, say, 0.6% of frame to 0.4% of frame. I need to shave 10 μs off every 50 μs function that is called 300 times per second. To get that kind of precision, I need more samples.

But at heart what Luke Stackwalker does is what you describe: every millisecond or so, it halts the program and records the callstack (including the precise instruction and line number of the IP). Some programs just need tens of thousands of samples to be usefully profiled.

(We've talked about this before, of course, but I figured this was a good place to summarize the debate.)

Halfbreed answered 28/11, 2011 at 0:46 Comment(5)

+1 Thanks for the response. I assume you've seen what I posted on sourceforge. What it says to me is the money lies not in this routine or that line of code, but in some description I can make of what the samples say, that may apply in a lot of places. Anyway, it would be fun to get some exposure to the kind of software you're talking about, to see first-hand where I'm wrong. Cheers. (BTW I'm fully aware of statistical power. That's my day job - building products for biostatistics simulation and analysis.) – Schaerbeek 28/11, 2011 at 2:14

@MikeDunlavey I know you know, this answer was more about summarizing the points of view for other readers. =) – Halfbreed 28/11, 2011 at 2:45

@MikeDunlavey I really wish I could show you profiles from the stuff I'm working on. Unfortunately I'm working on a massive, multimillion-line app under NDA. I wonder if the bake-offs you've done had selection bias in that they were single-purpose apps with small dev teams and a few dominant hotspots. Our app is huge and has had a performance team sweep through it annually for a decade; all the big lumps have been squashed long ago. (Any new code that ate more than 5% of loop would be instantly noticed and ridiculed.) These might give a flavor of the perf work we do: bitly.com/sJndaK – Halfbreed 28/11, 2011 at 3:19

I sympathize with your constraints. I'm also dubious when I hear all the big performance bugs have been squashed, because the "state of the art" is to use profilers, and they miss stuff, big stuff, because they have a selection bias saying that problems are in particular places - hotspots. If they say "there are no big problems" they are really saying "we didn't find any". (I'm not asserting that the big problems in there, like the choice of vector class, is necessarily easy to fix, only that it can be unambiguously identified as costing a large percent, compared to an alternative.) – Schaerbeek 28/11, 2011 at 16:6

I just read through the ppt in the first download of that link you gave. It was very impressive, I must say, but it deals in fixing the kinds of problems you can find with the tools mentioned. Not much in the form of macro-level optimization. The very fact that these games tend to be CPU, not GPU, bound, makes me suspect there's room for improvement, possibly quite a bit. – Schaerbeek 28/11, 2011 at 16:42

There's a difference between things that programmers actually do, and things that they recommend others do.

I know of lots of programmers (myself included) that actually use this method. It only really helps to find the most obvious of performance problems, but it's quick and dirty and it works.

But I wouldn't really tell other programmers to do it, because it would take me too long to explain all the caveats. It's far too easy to make an inaccurate conclusion based on this method, and there are many areas where it just doesn't work at all. (for example, that method doesn't reveal any code that is triggered by user input).

So just like using lie detectors in court, or the "goto" statement, we just don't recommend that you do it, even though they all have their uses.

Expeditionary answered 5/11, 2008 at 20:9 Comment(1)

I'm glad you use it. I suppose it takes some practice. It certainly takes explaining. I've never had it give me wrong information, and hardly ever obvious. On fast code, like user input, you gotta puff it up with a temporary outer loop. – Schaerbeek 5/11, 2008 at 20:23

I'm surprised by the religous tone on both sides.

Profiling is great, and certainly is a more refined and precise when you can do it. Sometimes you can't, and it's nice to have a trusty back-up. The pause technique is like the manual screwdriver you use when your power tool is too far away or the bateries have run-down.

Here is a short true story. An application (kind of a batch proccessing task) had been running fine in production for six months, suddenly the operators are calling developers because it is going "too slow". They aren't going to let us attach a sampling profiler in production! You have to work with the tools already installed. Without stopping the production process, just using Process Explorer, (which operators had already installed on the machine) we could see a snapshot of a thread's stack. You can glance at the top of the stack, dismiss it with the enter key and get another snapshot with another mouse click. You can easily get a sample every second or so.

It doesn't take long to see if the top of the stack is most often in the database client library DLL (waiting on the database), or in another system DLL (waiting for a system operation), or actually in some method of the application itself. In this case, if I remember right, we quickly noticed that 8 times out of 10 the application was in a system DLL file call reading or writing a network file. Sure enough recent "upgrades" had changed the performance characteristics of a file share. Without a quick and dirty and (system administrator sanctioned) approach to see what the application was doing in production, we would have spent far more time trying to measure the issue, than correcting the issue.

On the other hand, when performance requirements move beyond "good enough" to really pushing the envelope, a profiler becomes essential so that you can try to shave cycles from all of your closely-tied top-ten or twenty hot spots. Even if you are just trying to hold to a moderate performance requirement durring a project, when you can get the right tools lined-up to help you measure and test, and even get them integrated into your automated test process it can be fantasticly helpful.

But when the power is out (so to speak) and the batteries are dead, it's nice know how to use that manual screwdriver.

So the direct answer: Know what you can learn from halting the program, but don't be afraid of precision tools either. Most importantly know which jobs call for which tools.

Fordo answered 28/10, 2009 at 7:13 Comment(3)

"Religious tone" - Ouch! Process Explorer - sounds great, now don't just look at the top of the stack. The "gold nuggets" are in the middle. I agree profilers are precision tools - precision of the wrong thing. They measure time with precision, but (if they take and retain stack samples) they actually know the problem location with high precision, but they don't show it to you, and that's what you're looking for. – Schaerbeek 28/10, 2009 at 11:14

... Sorry, can't leave well enough alone. Here's a (only slightly artificial) case study: #926766 It's tempting to think that a profiler will do a better job when you're really trying to push performance, but when you get down to an actual experiment, that doesn't seem to hold. In fact, I've never seen a story where a profiler was used to really wring out a program through a series of steps, as in that example. – Schaerbeek 28/10, 2009 at 13:0

... I don't mean to give you a hard time. Your story about the file system upgrade showing you an 8 in 10 problem is exactly what I'm talking about. Now I'm just trying to raise awareness that in big software it's really easy to get issues like that in your own code in the form of mid-stack calls, and those are not hot-spots, because the program counter is not there. (In real hot-spots, by my understanding, the memory chip actually has a spot of higher temperature.) – Schaerbeek 28/10, 2009 at 13:21

If we take the question "Why isn't it better known?" then the answer is going to be subjective. Presumably the reason why it is not better known is because profiling provides a long term solution rather than a current problem solution. It isn't effective for multi-threaded applications and isn't effective for applications like games which spend a significant portion of its time rendering.

Furthermore, in single threaded applications if you have a method that you expect to consume the most run time, and you want to reduce the run-time of all other methods then it is going to be harder to determine which secondary methods to focus your efforts upon first.

Your process for profiling is an acceptable method that can and does work, but profiling provides you with more information and has the benefit of showing you more detailed performance improvements and regressions.

If you have well instrumented code then you can examine more than just the how long a particular method; you can see all the methods.

With profiling:

You can then rerun your scenario after each change to determine the degree of performance improvement/regression.
You can profile the code on different hardware configurations to determine if your production hardware is going to be sufficient.
You can profile the code under load and stress testing scenarios to determine how the volume of information impacts performance
You can make it easier for junior developers to visualise the impacts of their changes to your code because they can re-profile the code in six months time while you're off at the beach or the pub, or both. Beach-pub, ftw.

Profiling is given more weight because enterprise code should always have some degree of profiling because of the benefits it gives to the organisation of an extended period of time. The more important the code the more profiling and testing you do.

Your approach is valid and is another item is the toolbox of the developer. It just gets outweighed by profiling.

Capone answered 6/11, 2008 at 14:16 Comment(3)

I agree with what you say about profilers as general health-monitoring tools. For finding bottlenecks precisely they only give clues. They don't pinpoint the problem (most of them). They find the haystack, but this method finds the needles. – Schaerbeek 6/11, 2008 at 16:17

Profiling can give you as much info as you want from per component to per statement. It gives it under a variety of scenarios and provides more long term benefits. With AOP or a VM you don't even need to instrument you're code to gain the benefits. The skill of the tool is in the hands of the owner – Capone 7/11, 2008 at 12:2

Thanks, Ryan. I confess I'm not a profiler expert. All I know about them is what I see from their specs. I'm in a big team, and people talk about them but don't use them. Often I just halt the code a few times and say "Did you know you're spending a lot of time doing this ...?" Oops-didn't mean to. – Schaerbeek 7/11, 2008 at 16:40

Hitting the pause button during the execution of a program in "debug" mode might not provide the right data to perform any performance optimizations. To put it bluntly, it is a crude form of profiling.

If you must avoid using a profiler, a better bet is to use a logger, and then apply a slowdown factor to "guesstimate" where the real problem is. Profilers however, are better tools for guesstimating.

The reason why hitting the pause button in debug mode, may not give a real picture of application behavior is because debuggers introduce additional executable code that can slowdown certain parts of the application. One can refer to Mike Stall's blog post on possible reasons for application slowdown in a debugging environment. The post sheds light on certain reasons like too many breakpoints,creation of exception objects, unoptimized code etc. The part about unoptimized code is important - the "debug" mode will result in a lot of optimizations (usually code in-lining and re-ordering) being thrown out of the window, to enable the debug host (the process running your code) and the IDE to synchronize code execution. Therefore, hitting pause repeatedly in "debug" mode might be a bad idea.

Tabbie answered 23/11, 2008 at 2:55 Comment(7)

The things you say are true but don't matter, because a single-thread program spends a sequence of cycles, and you need to find out if any of them are being spent for poor reasons. After you fix those, it takes less cycles, and then runs faster. – Schaerbeek 23/11, 2008 at 16:39

In debug mode, sure there's overhead, but it goes away in release mode. The thing about inlining is it matters in code where the program counter lives. Higher up the call stack it makes no difference, and thats where many problems are. – Schaerbeek 23/11, 2008 at 16:43

I think the problem is confusion between measuring performance and finding performance problems. I suggest separating these goals. – Schaerbeek 23/11, 2008 at 16:44

I've said that profilers help if they sample the entire call stack (some do) and if they tell you, for each instruction (call or non-call) what percentage of stacks it was on. The only remaining point is, for big issues, not many samples are needed. – Schaerbeek 23/11, 2008 at 16:53

Yes, fixing issues will cause the program to run faster. But you might solve the wrong problem. Besides, you have already pointed you the real issue which is unknown behavior of the program in runtime. The only way to optimize such an application, would involve studying the codeflow. – Tabbie 24/11, 2008 at 1:22

Thanks. The keywords are "might" and "would". If you do it, you see that any problem you find and fix is not a wrong problem because it reduces time, and it makes other problems more obvious. Studying codeflow is fine, but it will miss subtle stuff, like compiler-inserted calls. – Schaerbeek 24/11, 2008 at 16:54

This isn't something you do just once. If there, say, 3 performance problems, you fix 1, and that makes the other 2 take larger percentages. Then you do the 2nd, and so forth. If one of them is "right", dont' worry, you'll get it, guaranteed. – Schaerbeek 24/11, 2008 at 17:6

Sampling profilers are only useful when

You are monitoring a runtime with a small number of threads. Preferably one.
The call stack depth of each thread is relatively small (to reduce the incredible overhead in collecting a sample).
You are only concerned about wall clock time and not other meters or resource bottlenecks.
You have not instrumented the code for management and monitoring purposes (hence the stack dump requests)
You mistakenly believe removing a stack frame is an effective performance improvement strategy whether the inherent costs (excluding callees) are practically zero or not
You can't be bothered to learn how to apply software performance engineering day-to-day in your job
....

Ladyfinger answered 27/7, 2009 at 21:42 Comment(2)

@William: What you really need to do is decide what you care about. If the system is empirically "too slow" then wall clock time slices are the thing to sample. For each sample you need to find out why it is being spent. In a single-thread program, the stack can often tell you that, but not always, like if it's an interpreter or message-driven. If it's multi-thread, it may be even harder to determine the why, but that's what you need to determine, because to spend fewer units of the desired resource, you need to find those that have a nonessential reason. – Schaerbeek 29/7, 2009 at 13:22

... ignoring unhelpful remarks, like 6, I just scanned your blog entry and absorbed as much as I could in 10 minutes. It seems we are solving different problems. I am less concerned with ongoing health-monitoring, and more with discovery and removal of performance problems. To that end, I don't care about the overhead of sampling, only that it is unbiased. I am not trying to remove stack frames, but unnecessary time-taking operations, which are very often method calls, and the more levels there are, the better the hunting is. – Schaerbeek 29/7, 2009 at 14:17

These must be some trivial examples that you are working with to get useful results with your method. I can't think of a project where profiling was useful (by whatever method) that would have gotten decent results with your "quick and effective" method. The time it takes to start and stop some applications already puts your assertion of "quick" in question.

Again, with non-trivial programs the method you advocate is useless.

EDIT: Regarding "why isn't it better known"?

In my experience code reviews avoid poor quality code and algorithms, and profiling would find these as well. If you wish to continue with your method that is great - but I think for most of the professional community this is so far down on the list of things to try that it will never get positive reinforcement as a good use of time.

It appears to be quite inaccurate with small sample sets and to get large sample sets would take lots of time that would have been better spent with other useful activities.

Decontrol answered 6/11, 2008 at 14:16 Comment(5)

Actually it works better on bigger software because, since the stack is generally deeper, there are more instructions on it, so more candidates for optimizing. As far as applications taking long time to start and stop, that's exactly when halting it will find out why. – Schaerbeek 6/11, 2008 at 15:54

So here's the scenario: there's a big system, and it's all been done with code reviews, etc., but there's still a problem. The profiler tells you what state and county contains the problem, but stack sampling tells you the exact doorstep. – Schaerbeek 6/11, 2008 at 16:23

Profilers could tell you this, but for some reason they don't, as I explained in my "answer" below. – Schaerbeek 6/11, 2008 at 16:24

Um, I have used profilers that give this information. – Decontrol 6/11, 2008 at 17:11

Are you sure? Fraction of time on call stack, per statement (not function), in time interval of interest, sorted in decreasing order? I think some can do this. Most do not, from what I read. – Schaerbeek 6/11, 2008 at 17:36

Stack trace snapshots only allow you to see stroboscopic x-rays of your application. You may require more accumulated knowledge which a profiler may give you.

The trick is knowing your tools well and choose the best for the job at hand.

Firepower answered 17/10, 2009 at 9:21 Comment(13)

@Thorbjørn: Well, who can argue with your last sentence? Every tool automates a technique. My point is that the nature of this problem is that the technique of sampling (and analyzing) the stack is little-known, simple, and very effective. What's more, the ambient attitudes people have about performance need to be re-evaluated. For example, if you want to measure performance precisely, that's fine, but if you want to improve performance, measurement misses the point. – Schaerbeek 17/10, 2009 at 15:21

... If I could add, yes you are taking stroboscopic x-rays of your application. (I think that is an excellent metaphor.) Typically there are unexpected things that the app is doing that could be replaced for substantial speedup. The time that would save is the probability they will appear on each stackshot. That's why it works. – Schaerbeek 17/10, 2009 at 15:31

And a very useful tool could be one that programatically takes a complete application stacktrace and dump it somewhere. jvisualvm can do it externally, but you may not always be able to attach with jvisualvm (or you want to do it on a schedule instead of invoked manually). This requires Java 6. – Hildehildebrand 17/10, 2009 at 16:38

... BTW I'm not alone. There are people on SO who "get it", and so does Jon Bentley, on page 33 of this PDF: dimacs.rutgers.edu/Workshops/EAA/slides/bentley.pdf – Schaerbeek 17/10, 2009 at 22:53

I encourage you to get a Mac, and see that the Activity Monitor allows for doing exactly this, as well as aggregate the results. – Hildehildebrand 17/10, 2009 at 23:9

... BTW you might be interested in the comments on this answer. A user took 4 samples and got a 7x speedup. Later he took 20 samples and got a 3x speedup. (I don't know if that meant he got an overall 21x speedup, or if those were independent, but I have seen that happen.) #890722 – Schaerbeek 19/10, 2009 at 13:2

Now I've found out what nagged me about your explanation. You say you need to halt your program - that is incorrect, as what you describe is basically repeated thread dumps which can be done WITHOUT halting the Java program. Ctrl-Break reportedly does it under WIndows, and "kill -3" under Unix, and Java 6 even allows you to do that programatically. – Hildehildebrand 19/10, 2009 at 17:23

@Thorbjørn: Hmmm... I guess I never thought much about the distinction. The "profiler" I wrote would also take the samples on-the-fly, but IMO nothing compares to getting your fingers right in the code – Schaerbeek 20/10, 2009 at 12:1

I still encourage you to investigate the posibillities with OS X. You will like it. – Hildehildebrand 20/10, 2009 at 14:38

and by the way... No Java programs are single threaded any more - you have multiple threads running all the time, and have thread pools to do things in parallel etc. In other words "halt the program" does not make much sense any more. – Hildehildebrand 20/10, 2009 at 14:42

@Thorbjørn: Regarding Java threading, look at the selected answer (from krosenvold). It's not that "halting the program" doesn't make sense. They just made it harder to do because they don't understand how useful it is. If you're waiting for the program, then part of it is waiting for some other part, and so on, so there is a waiting-chain. Tracking that chain across threads is not simple, but within a thread it is easy (the call stack), and that's what needs to be done. Just because they make it difficult doesn't mean you don't need to do it. – Schaerbeek 27/10, 2009 at 16:20

@Mike Dunlavey: no, it was not a 21x speedup. The second iteration indicated there was insignificant (maybe 10%) to be gained for one kind of operation (the rest, 90%, I/O limited or code execution in a third-party COM library), but that 10% had a much larger impact for another kind of operation that I was not initially optimising. – Dachy 2/1, 2011 at 2:42

A minor comment for anyone reading this: It is my personal opinion now here a few years later that the "manually look at stack traces taken with some time apart" may be fine for small programs with only one or a few threads, but for programs with thousands of threads of more you will need additional tooling, e.g. a profiler. – Hildehildebrand 11/9, 2012 at 11:58

What if the program is in production and being used at the same time by paying clients or colleagues. A profiler allows you to observe without interferring (as much, because of course it will have a little hit too as per the Heisenberg principle).

Profiling can also give you much richer and more detailed accurate reports. This will be quicker in the long run.

Peta answered 5/11, 2008 at 19:53 Comment(8)

You're confusing the Heisenberg principle with the Observer effect: en.wikipedia.org/wiki/Conflated – Dubenko 5/11, 2008 at 19:57

huh? check en.wikipedia.org/wiki/… wikihow.com/Optimize-Your-Program%27s-Performance – Schaerbeek 5/11, 2008 at 20:4

@paul ;) like it @Dubenko no confusion; if you try to measure the effect, your act of measurinng/profiling, will also distort what is happening (to varying degrees of course) and so would be different than that if you were not profiling. granted Werner was talking about ratherr small particles – Peta 5/11, 2008 at 20:23

Only if it's hooked into the outside world. Otherwise, stopping it doesn't change its behavior. – Schaerbeek 5/11, 2008 at 20:27

Paul is right. The heisenberg principle has to do with measuring location versus momentum. The Observer effect has to do with the act of watching something affecting it. – Counteract 5/11, 2008 at 20:31

I like quantum physics too, and you could be right for issues like cache misses. What I nearly always find is just dumb code, usually caused by too many layers of abstraction, and speedups of 40x are common. – Schaerbeek 5/11, 2008 at 20:39

jargon.net/jargonfile/h/heisenbug.html – Licit 5/11, 2008 at 21:30

That's a cute concept, but it's a diversion. It simply doesn't apply here. – Schaerbeek 5/11, 2008 at 21:35

EDIT 2008/11/25: OK, Vineet's response has finally made me see what the issue is here. Better late than never.

Somehow the idea got loose in the land that performance problems are found by measuring performance. That is confusing means with ends. Somehow I avoided this by single-stepping entire programs long ago. I did not berate myself for slowing it down to human speed. I was trying to see if it was doing wrong or unnecessary things. That's how to make software fast - find and remove unnecessary operations.

Nobody has the patience for single-stepping these days, but the next best thing is to pick a number of cycles at random and ask what their reasons are. (That's what the call stack can often tell you.) If a good percentage of them don't have good reasons, you can do something about it.

It's harder these days, what with threading and asynchrony, but that's how I tune software - by finding unnecessary cycles. Not by seeing how fast it is - I do that at the end.

Here's why sampling the call stack cannot give a wrong answer, and why not many samples are needed.

During the interval of interest, when the program is taking more time than you would like, the call stack exists continuously, even when you're not sampling it.

If an instruction I is on the call stack for fraction P(I) of that time, removing it from the program, if you could, would save exactly that much. If this isn't obvious, give it a bit of thought.

If the instruction shows up on M = 2 or more samples, out of N, its P(I) is approximately M/N, and is definitely significant.

The only way you can fail to see the instruction is to magically time all your samples for when the instruction is not on the call stack. The simple fact that it is present for a fraction of the time is what exposes it to your probes.

So the process of performance tuning is a simple matter of picking off instructions (mostly function call instructions) that raise their heads by turning up on multiple samples of the call stack. Those are the tall trees in the forest.

Notice that we don't have to care about the call graph, or how long functions take, or how many times they are called, or recursion.

I'm against obfuscation, not against profilers. They give you lots of statistics, but most don't give P(I), and most users don't realize that that's what matters.

You can talk about forests and trees, but for any performance problem that you can fix by modifying code, you need to modify instructions, specifically instructions with high P(I). So you need to know where those are, preferably without playing Sherlock Holmes. Stack sampling tells you exactly where they are.

This technique is harder to employ in multi-thread, event-driven, or systems in production. That's where profilers, if they would report P(I), could really help.

Schaerbeek answered 5/11, 2008 at 19:49 Comment(7)

"never"??? Man, your experience is nothing like mine. Methinks you're generalizing from a very small data set. – Licit 5/11, 2008 at 21:32

Hardly. Been doing it 30 years. If you've had bad luck with sampling maybe you're not doing it quite right. I've done my level best to explain it: en.wikipedia.org/wiki/… – Schaerbeek 5/11, 2008 at 22:25

Hah! I love your little link to wikipedia - that is something that you obviously wrote... So to defend your position and to lend some credence you point to another url - the contents of which you also wrote. That is amazing. – Decontrol 6/11, 2008 at 14:30

Tim, like most people on this website, I'm just trying to be helpful. Stack sampling is a really useful idea and I'm trying to tell people about it. Ideas are tested by proof, by reason or example, not by "lending credence". – Schaerbeek 7/11, 2008 at 13:6

Sampling works if you do it right. I've seen people take 1 sample, of a 30-level stack. It appears meaningless, so they give up, considering their skepticism justified. You gotta follow the procedure. – Schaerbeek 7/11, 2008 at 13:20

If "takes too long" is still too short for me to be able to break into the problem, then this method fails. I am a game developer. When my game is running at 20 fps instead of 25, halting and reading callstack once will not tell me much. I need to do this many times - i.e. sampling profiler. – Blende 12/11, 2008 at 20:47

Hi Suma. What I do in a case like that is take the code that has to run on each frame and write a loop that runs it flat out, not on a timer. That's what I take samples of in order to make it faster. – Schaerbeek 14/11, 2008 at 2:30

Stepping through code is great for seeing the nitty-gritty details and troubleshooting algorithms. It's like looking at a tree really up close and following each vein of bark and branch individually.

Profiling lets you see the big picture, and quickly identify trouble points -- like taking a step backwards and looking at the whole forest and noticing the tallest trees. By sorting your function calls by length of execution time, you can quickly identify the areas that are the trouble points.

Schmooze answered 5/11, 2008 at 20:38 Comment(2)

If you sort function calls by (length_of_execution_time TIMES number_of_invocations), then I agree you're getting there. Even so, you may need more context to really understand if a function call could be avoided, and halting gives you that. – Schaerbeek 5/11, 2008 at 20:51

Actually, that's tricky because of recursion. The call-stack-sampling technique does not suffer from confusion about recursion. – Schaerbeek 5/11, 2008 at 23:42

I've typically used it on real-time programs that were overrunning their timeslice. You can't manually stop and restart code that has to run 60 times every second.

I've also used it to track down the bottleneck in a compiler I had written. You wouldn't want to try to break such a program manually, because you really have no way of knowing if you are breaking at the spot where the bottlenck is, or just at the spot after the bottleneck when the OS is allowed back in to stop it. Also, what if the major bottleneck is something you can't do anything about, but you'd like to get rid of all the other largeish bottlenecks in the system? How to you prioritize which bottlenecks to attack first, when you don't have good data on where they all are, and what their relative impact each is?

Chemotropism answered 6/11, 2008 at 14:51 Comment(7)

First question: run the code separately in a long loop, and take your time about squeezing it. – Schaerbeek 6/11, 2008 at 16:4

Second question: That's why you take several samples. The bigger each bottleneck is, the more it will stand out. And it doesn't matter what order you tackle them, because each one will make it faster. – Schaerbeek 6/11, 2008 at 16:7

The point is, you don't have to wonder where they are. It pinpoints each one. All you have to do is figure out which ones you can do something about. – Schaerbeek 6/11, 2008 at 16:29

Actually, the term "bottleneck" bothers me, because it gives a misleading image of typical problems. They are more like government waste. The more layers there are, the more likely it's in there somewhere. – Schaerbeek 6/11, 2008 at 17:13

I think you are missing my point. The issue is that a debugger can't always stop a program just anywhere. For instance, it typically can't stop you inside a system call. Instead it will break you an instruction or two later, and you will think the problem is there. – Chemotropism 6/11, 2008 at 19:6

Good point. Fortunately it's not a serious problem, because it's no different from a lengthy instruction. If you halt right after "call FileOpen", you're looking at a gold nugget of information. Is the file being opened/closed unnecessarily? Look higher up. – Schaerbeek 6/11, 2008 at 19:49

Where I used to run into difficulty was some 90's version of VC, where if you hit the pause button, it wouldn't bother stopping until the next windows message. Man, that was annoying. – Schaerbeek 7/11, 2008 at 0:30

The larger your program gets, the more useful a profiler will be. If you need to optimize a program which contains thousands of conditional branches, a profiler can be indispensible. Feed in your largest sample of test data, and when it's done import the profiling data into Excel. Then you check your assumptions about likely hot spots against the actual data. There are always surprises.

Dig answered 23/11, 2008 at 0:45 Comment(1)

Thanks for your comment. Few people have actually tried this, instead relying on intuition. Profilers are fine for what they do. But if you actually take some samples and study them, you'll be surprised, especially in big programs. I know it's hard to believe. – Schaerbeek 23/11, 2008 at 2:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags