Performance optimization strategies of last resort [closed]

Asked 29/5, 2009 at 14:26 Answered 29/5, 2009 at 14:42

Solved performance optimization language-agnostic

639

There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.

Let's assume:

the code already is working correctly
the algorithms chosen are already optimal for the circumstances of the problem
the code has been measured, and the offending routines have been isolated
all attempts to optimize will also be measured to ensure they do not make matters worse

What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.

Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.

I'll add a reply with my own initial suggestions, and look forward to whatever else the Stack Overflow community can think of.

Haulage answered 29/5, 2009 at 14:27 Comment(0)

457

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:

The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.
Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds.
Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.

Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.

Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.

That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.

Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.

More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.
Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.
Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.
Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.

Total speedup factor: 43.6

Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.

P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions.

I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.

ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:

 /* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.

Here is the second problem, in two separate lines:

 /* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)

These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.

I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.

REFERENCE ADDED: The source code, both original and redesigned, can be found in www.ddj.com, for 1993, in file 9311.zip, files slug.asc and slug.zip.

EDIT 2011/11/26: There is now a SourceForge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn't follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.

Gailgaile answered 29/5, 2009 at 14:27 Comment(13)

I'd love to read some of the details of the steps you outline above. Is it possible to include some fragments of the optimizations for flavour? (without making the post too long?) – Haulage 30/5, 2009 at 12:27

@jerryjvl: I'll see if I can post the sequence of sources (and maybe page images) somewhere. Bear with me. – Gailgaile 30/5, 2009 at 14:4

... I'm not sure if I've got copyright issues with Dr. Dobbs. – Gailgaile 30/5, 2009 at 14:6

... I also wrote a book that's now out of print, so it's going for a ridiculous price on Amazon - "Building Better Applications" ISBN 0442017405. Essentially the same material is in the first chapter. – Gailgaile 30/5, 2009 at 14:9

Amazon.co.uk link: amazon.co.uk/Building-Better-Applications-Efficient-Development/… – Lenity 1/4, 2011 at 16:2

@Thorbjørn: Yeah. Pretty amazing huh? I've only got 2 copies myself (plus 3 copies in Chinese) that I'm hanging onto. – Gailgaile 2/4, 2011 at 16:33

@Mike Dunlavey, Considered allowing Google to scan it in? – Lenity 2/4, 2011 at 17:52

@Thorbjørn: I guess I don't know how to do that, also I don't know if it's allowed, especially since the publisher has been bought out a couple times over. I have scanned it myself and sent it to a couple interested people, but in jpg form it's a bit massive to email. – Gailgaile 2/4, 2011 at 18:18

@Mike Dunlavey, I would suggest telling Google you have it scanned in already. They probably already have an agreement with whoever bought your publisher. – Lenity 2/4, 2011 at 18:36

@Thorbjørn: Just to follow up, I did hook up with GoogleBooks, filled out all forms, and sent them a hard copy. I got an email back asking if I really really owned the copyright. The publisher Van Nostrand Reinhold, which was bought by International Thompson, which was bought by Reuters, and when I try to call or email them it's like a black hole. So it's in limbo - I haven't yet had the energy to really chase it down. – Gailgaile 4/9, 2011 at 22:8

@Demetri: Try this: sourceforge.net/projects/randompausedemo – Gailgaile 25/9, 2013 at 12:47

Google Books link: books.google.dk/books?id=8A43E1UFs_YC – Lenity 16/7, 2014 at 11:14

@Thorbjørn: Thanks for noticing that, and I hope you are doing well. It still only shows the first chunk of the book. I kinda gave up trying to figure out who owns the copyright and posting the whole thing. Sometimes I email it to people who ask. – Gailgaile 16/7, 2014 at 12:23

196

Suggestions:

Pre-compute rather than re-calculate: any loops or repeated calls that contain calculations that have a relatively limited range of inputs, consider making a lookup (array or dictionary) that contains the result of that calculation for all values in the valid range of inputs. Then use a simple lookup inside the algorithm instead.
Down-sides: if few of the pre-computed values are actually used this may make matters worse, also the lookup may take significant memory.
Don't use library methods: most libraries need to be written to operate correctly under a broad range of scenarios, and perform null checks on parameters, etc. By re-implementing a method you may be able to strip out a lot of logic that does not apply in the exact circumstance you are using it.
Down-sides: writing additional code means more surface area for bugs.
Do use library methods: to contradict myself, language libraries get written by people that are a lot smarter than you or me; odds are they did it better and faster. Do not implement it yourself unless you can actually make it faster (i.e.: always measure!)
Cheat: in some cases although an exact calculation may exist for your problem, you may not need 'exact', sometimes an approximation may be 'good enough' and a lot faster in the deal. Ask yourself, does it really matter if the answer is out by 1%? 5%? even 10%?
Down-sides: Well... the answer won't be exact.

Haulage answered 29/5, 2009 at 14:38 Comment(12)

Precomputation doesn't always help, and it can even hurt sometimes -- if your lookup table is too big, it can kill your cache performance. – Recent 29/5, 2009 at 14:42

Cheating can often be the win. I had a color correction process that at the core was a 3-vector dotted with a 3x3 matrix. The CPU had a matrix multiply in hardware that left out some of the cross terms and went real fast compared to all the other ways to do it, but only supported 4x4 matrices and 4-vectors of floats. Changing the code to carry around the extra empty slot and converting the calculation to floating point from fixed point allowed for a slightly less-accurate but much faster result. – Rafiq 30/5, 2009 at 2:19

+ Yes, these are things that can be done when you have found a problem. The part I would emphasize is the "finding" skill. Most folks would say "All you gotta do is profile / measure", but to me, that is not enough. What's more, the repetition is key. – Gailgaile 1/6, 2009 at 20:24

@RBerteig, out of curiosity - which CPU was this? – Lenity 4/9, 2011 at 22:37

@Thorbjørn, it was an Hitachi SH4 family member. I'd have to dig into a dusty archive box to identify the specific part number. At least it did have hardware floating point support, so switching from fixed point to floating point didn't introduce any other huge burdens. – Rafiq 6/9, 2011 at 18:59

@Rafiq how was this cheating? in using floating point? That's not cheating, floating point is more accurate than integer for data that was originally analog (like the color correction algorithm probably was fed with). What's really cheating is to use integers to represent analog data! – Duffie 28/9, 2011 at 0:6

The cheating was in using a matrix multiply that left out some of the inner products, making it possible to implement in microcode for a single CPU instruction that completed faster than even the equivalent sequence of individual instructions could. Its a cheat because it doesn't get the "correct" answer, just an answer that is "correct enough". – Rafiq 28/9, 2011 at 8:19

@RBerteig: just "correct enough" is an opportunity for optimisation that most people miss in my experience. – Kareem 23/2, 2013 at 15:15

Add this: Never assume your algorithm is optimal unless you can prove it. Frequently, it is possible to use cheating/application specific knowledge to drop your algorithm into a lower complexity class. Would you have known, for instance, that there are cases where sorting can be done in O(n) instead of O(nlogn)? Ponder this on an extended walk / trafic jam / sleepless night / whatever helps you to have bright ideas... – Dinerman 28/6, 2013 at 19:13

You cannot always assume that everybody is more intelligent than you. At the end we are all profesionnals. You can assume however, that a specific library that you use exists and has reached your environment because of its quality, therefore the writing of this library must be very thorough, you can't do it as well only because you are not specialized in that field, and you dont invest the same kind of time in it. Not because you are less smart. come on. – Macarthur 31/7, 2014 at 6:35

A good example of "cheating" is to sometimes use faster methods for unintended purposes than slower methods for intended purposes. For example, if you are simulating a "drug trip" in a video game, instead of adding "rand()" to every vertex in the renderer, you could instead use en.wikipedia.org/wiki/Fast_inverse_square_root with single precision floating point decimal everywhere, resulting in unpreccidentedly more performant blurriness. – Marcelinomarcell 12/2, 2018 at 23:22

Concerning the approximate solution: This is especially useful / elaborated out for NP-complete problems (or even for #P-complete problems). Not to mention the keyword "Randomized Algorithms", which are fast on average (as opposed to the usual worst-case-analysis). – Turnabout 21/9, 2023 at 16:3

176

When you can't improve the performance any more - see if you can improve the perceived performance instead.

You may not be able to make your fooCalc algorithm faster, but often there are ways to make your application seem more responsive to the user.

A few examples:

anticipating what the user is going to request and start working on that before then
displaying results as they come in, instead of all at once at the end
Accurate progress meter

These won't make your program faster, but it might make your users happier with the speed you have.

Fireworks answered 29/5, 2009 at 14:27 Comment(2)

A progress bar speeding up at the end may be perceived as faster than an absolutely accurate one. In "Rethinking the Progress Bar" (2007) Harrison, Amento, Kuznetsov and Bell tests multiple types of bars on a group of users as well as discussing some ways to rearrange the operations so that the progress may be perceived as faster. – Cygnus 26/6, 2012 at 11:38

naxa, most progress bars are fake because predicting multiple widely differing steps of a flow into a single percentage is hard or sometimes impossible. Just look at all those bars that gets stuck at 99% :-( – Cygnus 18/6, 2013 at 20:38

147

I spend most of my life in just this place. The broad strokes are to run your profiler and get it to record:

Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls.
Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing.
Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing.
Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall.
Sequential floating-point ops. Make these SIMD.

And one more thing I like to do:

Set your compiler to output assembly listings and look at what it emits for the hotspot functions in your code. All those clever optimizations that "a good compiler should be able to do for you automatically"? Chances are your actual compiler doesn't do them. I've seen GCC emit truly WTF code.

Gerstner answered 29/5, 2009 at 14:27 Comment(16)

What profiler(s) do you use? And are any of them applicable to C# .NET development? I'd love to try some of these myself. – Haulage 30/5, 2009 at 0:32

I mostly use Intel VTune and PIX. No idea if they can adapt to C#, but really once you've got that JIT abstraction layer most of these optimizations are beyond your reach, except for improving cache locality and maybe avoiding some branches. – Gerstner 31/5, 2009 at 0:33

Even so, checking on the post-JIT output may help figure out if there are any constructs that just do not optimize well through the JIT stage... investigation can never hurt, even if turns out a dead end. – Haulage 31/5, 2009 at 14:12

I think many people, including myself, would be interested in this "wtf assembly" produced by gcc. Yours sounds like a very interesting job :) – Peng 28/4, 2011 at 21:12

Examples on the PowerPC ... <-- That is, some implementations of PowerPC. PowerPC is an ISA, not a CPU. – Augmentation 20/2, 2013 at 19:41

@BillyONeal True, but until someone manufactures a PPC implementation that doesn't have a microcoded imul and idiv, the distinction is sort of academic. – Gerstner 21/2, 2013 at 8:48

@Crashworks: A G5 or a full blown Cell/POWER6/POWER7 (the server version) require microcode for a simple multiply? I can see divide being microcoded but I find multiply unusual. – Augmentation 21/2, 2013 at 17:49

@BillyONeal For integer multiply, yes, on all those cores. I think it's probably because of the 128-bit precision (64 bits times 64 bits needs a high and low word)... it might be more precise to say that it's nonpipelined rather than microcoded since I don't know the internal circuitry. Floating point math is pipelined on those implementations, although the pipe is quite deep so you can't actually read the result for many cycles without stalling. I spent over a year of my life fretting over the exact cycle count for Cell opcodes and manually scheduling around the pipeline. – Gerstner 21/2, 2013 at 22:13

@Crashworks: :sigh: wow, that's insane. I can't believe there are still CPUs in common use that require microcoded instructions for every array access. – Augmentation 28/2, 2013 at 18:48

@BillyONeal Even on modern x86 hardware, imul can stall the pipeline; see "Intel® 64 and IA-32 Architectures Optimization Reference Manual" §13.3.2.3: "Integer multiply instruction takes several cycles to execute. They are pipelined such that an integer multiply instruction and another long-latency instruction can make forward progress in the execution phase. However, integer multiply instructions will block other single-cycle integer instructions from issuing due to requirement of program order." That's why it's usually better to use word-aligned array sizes and lea. – Gerstner 1/3, 2013 at 9:47

@Crashworks: I could see it having higher latency than, say, addition. But 40-50 cycle and/or pipeline stalls are kind of extreme for that. (Particularly given that lea is a multiply instruction and is frequently abused for this purpose) – Augmentation 1/3, 2013 at 18:2

@Crashworks: That Intel ref manual must be old (or not update). On current Intel designs, imul r64 is 2 uops, 3 cycle latency, and is fully pipelined (can issue 1 per cycle, on port1). On P4, imul r32 is 16 cycle latency, recip throughput one per 8 cycles. lea is 1 or 3 cycle latency, (simple or complex/rip-relative addressing mode). lea is often really useful as a non-destructive shift+add+add_offset. Often a good move is to increment your loop counter by 8 or 16, instead of having to multiply or use a complex addressing mode. – Larianna 2/7, 2015 at 15:16

Load-hit-store is also a PowerPC-specific problem. Or at least it doesn't affect x86 at all. FP store / integer reload is no slower than FP store / FP reload, and the other direction is fine too. (FP includes SSE/AVX). On typical Intel CPUs, it's about 6 cycle store-forwarding latency. (Of course, there are also ALU instructions to move data directly between vector and integer registers, with lower latency but potentially worse throughput for getting all the elements of a vector.) – Larianna 1/11, 2017 at 16:43

@PeterCordes Moving data between SSE registers and integer registers used to be abysmally slow because of an LHS (hidden inside the architecture). Maybe that's improved since 2013, I haven't timed it recently. – Gerstner 11/1, 2018 at 20:6

@Crashworks: Are you talking about AMD CPUs? integer<->xmm registers isn't great on AMD, but it's just a medium-latency ALU instruction at worst, not a mis-speculation that requires a rollback or anything. It's always been single-cycle on Intel for movd eax, xmm0 or movd xmm0, eax. On AMD Jaguar, those are 4 and 6 cycle latency respectively, with one per clock throughput, according to agner.org/optimize. They're worse on Bulldozer/Piledriver, like 8 and 10c latency, but still good throughput and doesn't stall the pipeline or cause mis-speculation. So it's not a "load hit store" – Larianna 11/1, 2018 at 20:13

Bulldozer has slow-ish store-forwarding normally, even integer store / integer reload is 8c round trip vs. 5c on Intel. (Or even 4c with simple addressing modes on Skylake.) – Larianna 11/1, 2018 at 20:16

Throw more hardware at it!

Pejoration answered 29/5, 2009 at 14:32 Comment(13)

My thoughts exactly. When you start talking about "last few percents" when there is nothing left, the clearly cheapest way of making it run faster is faster hardware rather than spending a ton of programmer time on squeezing those last percents out of it. – Patin 29/5, 2009 at 14:35

+1 an answer that especially managers seem to overlook too often ;) – Haulage 29/5, 2009 at 14:39

Although I want to encourage more software-oriented answers as well, because sometimes you may already have developers with spare time on their hands, and no money to invest in hardware... or sometimes you may already have the fastest available hardware and still not enough performance. Especially considering that 'faster hardware' is looking less and less likely to be a never-ending path, and not everything parallellises – Haulage 29/5, 2009 at 14:41

more hardware isn't always an option when you have software that is expected to run on hardware already out in the field. – Nitrite 29/5, 2009 at 21:14

Not a very helpful answer to someone making consumer software: the customer isn't going to want to hear you say, "buy a faster computer." Especially if you're writing software to target something like a video game console. – Gerstner 29/5, 2009 at 22:7

If you believe Moore's Law runs at x2/18 months, that's a compounded rate of ~0.13%/day (and it works weekends and doesn't take holidays :^). Busting a gut on last-few-percent optimization has to be weighed against what you can get for free by just waiting. If you take a long enough view, this applies just as much in the console/consumer/embedded space (the performance graphs are just more quantized), and you can still make a case for moving on to the next thing rather than investing time on something with small and diminishing returns. – Plumage 29/5, 2009 at 23:14

@Crashworks, or for that matter, an embedded system. When the last feature is finally in and the first batch of boards are already spun is not the moment to discover that you should have used a faster CPU in the first place... – Rafiq 30/5, 2009 at 2:15

Technically, that's the perfomance strategy of first resort. – Asynchronism 19/6, 2009 at 16:54

I once had to debug a program that had a huge memory leak -- its VM size grew by about 1Mb per hour. A colleague joked that all I needed to do was add memory at a constant rate. :) – Piceous 27/6, 2010 at 14:12

More hardware: ah yes the mediocre developer's lifeline. I don't know how many times I've heard "add another machine and double the capacity!" – Sarraceniaceous 31/3, 2011 at 22:59

All depends on how much of a problem it is for the user, and whether or not you can create more value for the users by spending the same amount of time in creating new features. Some programmers are willing to spend 6 months shaving off a couple seconds on a little used feature, but refuse to spend 5 minutes to implement a shortcut that would save several minutes of work every day for the users. All in the name of performance. And no, it's not a rhetorical example – Varico 11/4, 2011 at 12:51

This answer actually made me laugh :D – Magpie 6/11, 2013 at 14:39

I just optimized some code, that ran ~11 hours and couldn't be fully completed because the 64GB RAM were not enough. The newer hardware (128GB RAM and faster CPU/disks) took for the same data about 6.5 hours, which was abviously better. After my optimizations the OLD machine took 1:40 minutes and could complete the whole data with the 64GB RAM while the new machine is now down to 1:15 minutes total time. So yes, adding hardware can help, but proper optmization, that includes the whole production process, can drastically improve utillization. And such servers are not exactly cheap either. – Zygoma 4/2, 2016 at 11:25

More suggestions:

Avoid I/O: Any I/O (disk, network, ports, etc.) is always going to be far slower than any code that is performing calculations, so get rid of any I/O that you do not strictly need.
Move I/O up-front: Load up all the data you are going to need for a calculation up-front, so that you do not have repeated I/O waits within the core of a critical algorithm (and maybe as a result repeated disk seeks, when loading all the data in one hit may avoid seeking).
Delay I/O: Do not write out your results until the calculation is over, store them in a data structure and then dump that out in one go at the end when the hard work is done.
Threaded I/O: For those daring enough, combine 'I/O up-front' or 'Delay I/O' with the actual calculation by moving the loading into a parallel thread, so that while you are loading more data you can work on a calculation on the data you already have, or while you calculate the next batch of data you can simultaneously write out the results from the last batch.

Forbes answered 29/5, 2009 at 14:27 Comment(4)

Note that "moving the IO to a parallel thread" should be done as asynchronous IO on many platforms (e.g. Windows NT). – Augmentation 28/2, 2013 at 18:49

I/O is indeed a critical point, because it is slow and has huge latencies, and you can get faster with this advice, but it's still fundamentally flawed: The points are the latency (which has to be hidden) and the syscall overhead (which has to be reduced by reducing the number of I/O calls). Best advice is: use mmap() for input, do appropriate madvise() calls and use aio_write() to write large chunks of output (= a few MiB). – Dinerman 28/6, 2013 at 19:27

This last option is fairly easy to implement in Java, especially. It gave HUGE performance increases for applications I've written. Another important point (more than moving I/O upfront) is to make it SEQUENTIAL and large-block I/O. Lots of small reads is far more expensive than 1 big one, due to disk seek time. – James 26/8, 2013 at 18:7

At one point I cheated in avoiding I/O, by just temporarily moving all the files to a RAM disk before the computation and moving them back afterwards. This is dirty, but might be useful in situation where you do not control the logic that makes the I/O calls. – Eurydice 29/4, 2019 at 14:15

Since many of the performance problems involve database issues, I'll give you some specific things to look at when tuning queries and stored procedures.

Avoid cursors in most databases. Avoid looping as well. Most of the time, data access should be set-based, not record by record processing. This includes not reusing a single record stored procedure when you want to insert 1,000,000 records at once.

Never use select *, only return the fields you actually need. This is especially true if there are any joins as the join fields will be repeated and thus cause unnecesary load on both the server and the network.

Avoid the use of correlated subqueries. Use joins (including joins to derived tables where possible) (I know this is true for Microsoft SQL Server, but test the advice when using a differnt backend).

Index, index, index. And get those stats updated if applicable to your database.

Make the query sargable. Meaning avoid things which make it impossible to use the indexes such as using a wildcard in the first character of a like clause or a function in the join or as the left part of a where statement.

Use correct data types. It is faster to do date math on a date field than to have to try to convert a string datatype to a date datatype, then do the calculation.

Never put a loop of any kind into a trigger!

Most databases have a way to check how the query execution will be done. In Microsoft SQL Server this is called an execution plan. Check those first to see where problem areas lie.

Consider how often the query runs as well as how long it takes to run when determining what needs to be optimized. Sometimes you can gain more perfomance from a slight tweak to a query that runs millions of times a day than you can from wiping time off a long_running query that only runs once a month.

Use some sort of profiler tool to find out what is really being sent to and from the database. I can remember one time in the past where we couldn't figure out why the page was so slow to load when the stored procedure was fast and found out through profiling that the webpage was asking for the query many many times instead of once.

The profiler will also help you to find who are blocking who. Some queries that execute quickly while running alone may become really slow due to locks from other queries.

Volding answered 29/5, 2009 at 14:27 Comment(0)

The single most important limiting factor today is the limited memory bandwitdh. Multicores are just making this worse, as the bandwidth is shared betwen cores. Also, the limited chip area devoted to implementing caches is also divided among the cores and threads, worsening this problem even more. Finally, the inter-chip signalling needed to keep the different caches coherent also increase with an increased number of cores. This also adds a penalty.

These are the effects that you need to manage. Sometimes through micro managing your code, but sometimes through careful consideration and refactoring.

A lot of comments already mention cache friendly code. There are at least two distinct flavors of this:

Avoid memory fetch latencies.
Lower memory bus pressure (bandwidth).

The first problem specifically has to do with making your data access patterns more regular, allowing the hardware prefetcher to work efficiently. Avoid dynamic memory allocation which spreads your data objects around in memory. Use linear containers instead of linked lists, hashes and trees.

The second problem has to do with improving data reuse. Alter your algorithms to work on subsets of your data that do fit in available cache, and reuse that data as much as possible while it is still in the cache.

Packing data tighter and making sure you use all data in cache lines in the hot loops, will help avoid these other effects, and allow fitting more useful data in the cache.

Highclass answered 29/5, 2009 at 14:27 Comment(0)

What hardware are you running on? Can you use platform-specific optimizations (like vectorization)?
Can you get a better compiler? E.g. switch from GCC to Intel?
Can you make your algorithm run in parallel?
Can you reduce cache misses by reorganizing data?
Can you disable asserts?
Micro-optimize for your compiler and platform. In the style of, "at an if/else, put the most common statement first"

Nefertiti answered 29/5, 2009 at 14:42 Comment(4)

Should be "switch from GCC to LLVM" :) – Truthful 30/5, 2009 at 15:43

Can you make your algorithm run in parallel? -- the inverse also applies – Ineffective 29/4, 2011 at 16:12

True that, reducing amount of threads can be an equally good optimization – Nefertiti 30/4, 2011 at 8:16

re: micro-optimizing: if you check the compiler's asm output, you can often tweak the source to hand-hold it into producing better asm. See Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? for more about helping or beating the compiler on modern x86. – Larianna 1/11, 2017 at 17:5

Although I like Mike Dunlavey's answer, in fact it is a great answer indeed with supporting example, I think it could be expressed very simply thus:

Find out what takes the largest amounts of time first, and understand why.

It is the identification process of the time hogs that helps you understand where you must refine your algorithm. This is the only all-encompassing language agnostic answer I can find to a problem that's already supposed to be fully optimised. Also presuming you want to be architecture independent in your quest for speed.

So while the algorithm may be optimised, the implementation of it may not be. The identification allows you to know which part is which: algorithm or implementation. So whichever hogs the time the most is your prime candidate for review. But since you say you want to squeeze the last few % out, you might want to also examine the lesser parts, the parts that you have not examined that closely at first.

Lastly a bit of trial and error with performance figures on different ways to implement the same solution, or potentially different algorithms, can bring insights that help identify time wasters and time savers.

HPH, asoudmove.

Corabella answered 29/5, 2009 at 14:27 Comment(0)

You should probably consider the "Google perspective", i.e. determine how your application can become largely parallelized and concurrent, which will inevitably also mean at some point to look into distributing your application across different machines and networks, so that it can ideally scale almost linearly with the hardware that you throw at it.

On the other hand, the Google folks are also known for throwing lots of manpower and resources at solving some of the issues in projects, tools and infrastructure they are using, such as for example whole program optimization for gcc by having a dedicated team of engineers hacking gcc internals in order to prepare it for Google-typical use case scenarios.

Similarly, profiling an application no longer means to simply profile the program code, but also all its surrounding systems and infrastructure (think networks, switches, server, RAID arrays) in order to identify redundancies and optimization potential from a system's point of view.

Filipe answered 29/5, 2009 at 14:27 Comment(0)

Inline routines (eliminate call/return and parameter pushing)
Try eliminating tests/switches with table look ups (if they're faster)
Unroll loops (Duff's device) to the point where they just fit in the CPU cache
Localize memory access so as not to blow your cache
Localize related calculations if the optimizer isn't already doing that
Eliminate loop invariants if the optimizer isn't already doing that

Shipper answered 29/5, 2009 at 14:27 Comment(1)

IIRC Duff's device is very rarely faster. Only when the op is very short (like a single small math expression) – Wernsman 17/6, 2009 at 20:23

Divide and conquer

If the dataset being processed is too large, loop over chunks of it. If you've done your code right, implementation should be easy. If you have a monolithic program, now you know better.

Tennant answered 29/5, 2009 at 14:27 Comment(1)

+1 for the flyswatter "smack" sound I heard while reading the last sentence. – Anderegg 27/9, 2011 at 15:50

When you get to the point that you're using efficient algorithms its a question of what you need more speed or memory. Use caching to "pay" in memory for more speed or use calculations to reduce the memory footprint.
If possible (and more cost effective) throw hardware at the problem - faster CPU, more memory or HD could solve the problem faster then trying to code it.
Use parallelization if possible - run part of the code on multiple threads.
Use the right tool for the job. some programing languages create more efficient code, using managed code (i.e. Java/.NET) speed up development but native programing languages creates faster running code.
Micro optimize. Only were applicable you can use optimized assembly to speed small pieces of code, using SSE/vector optimizations in the right places can greatly increase performance.

Myrta answered 29/5, 2009 at 14:27 Comment(0)

First of all, as mentioned in several prior answers, learn what bites your performance - is it memory or processor or network or database or something else. Depending on that...

...if it's memory - find one of the books written long time ago by Knuth, one of "The Art of Computer Programming" series. Most likely it's one about sorting and search - if my memory is wrong then you'll have to find out in which he talks about how to deal with slow tape data storage. Mentally transform his memory/tape pair into your pair of cache/main memory (or in pair of L1/L2 cache) respectively. Study all the tricks he describes - if you don's find something that solves your problem, then hire professional computer scientist to conduct a professional research. If your memory issue is by chance with FFT (cache misses at bit-reversed indexes when doing radix-2 butterflies) then don't hire a scientist - instead, manually optimize passes one-by-one until you're either win or get to dead end. You mentioned squeeze out up to the last few percent right? If it's few indeed you'll most likely win.
...if it's processor - switch to assembly language. Study processor specification - what takes ticks, VLIW, SIMD. Function calls are most likely replaceable tick-eaters. Learn loop transformations - pipeline, unroll. Multiplies and divisions might be replaceable / interpolated with bit shifts (multiplies by small integers might be replaceable with additions). Try tricks with shorter data - if you're lucky one instruction with 64 bits might turn out replaceable with two on 32 or even 4 on 16 or 8 on 8 bits go figure. Try also longer data - eg your float calculations might turn out slower than double ones at particular processor. If you have trigonometric stuff, fight it with pre-calculated tables; also keep in mind that sine of small value might be replaced with that value if loss of precision is within allowed limits.
...if it's network - think of compressing data you pass over it. Replace XML transfer with binary. Study protocols. Try UDP instead of TCP if you can somehow handle data loss.
...if it's database, well, go to any database forum and ask for advice. In-memory data-grid, optimizing query plan etc etc etc.

HTH :)

Oxalate answered 29/5, 2009 at 14:27 Comment(0)

Caching! A cheap way (in programmer effort) to make almost anything faster is to add a caching abstraction layer to any data movement area of your program. Be it I/O or just passing/creation of objects or structures. Often it's easy to add caches to factory classes and reader/writers.

Sometimes the cache will not gain you much, but it's an easy method to just add caching all over and then disable it where it doesn't help. I've often found this to gain huge performance without having to micro-analyse the code.

Circumferential answered 29/5, 2009 at 14:27 Comment(0)

I've spent some time working on optimising client/server business systems operating over low-bandwidth and long-latency networks (e.g. satellite, remote, offshore), and been able to achieve some dramatic performance improvements with a fairly repeatable process.

Measure: Start by understanding the network's underlying capacity and topology. Talking to the relevant networking people in the business, and make use of basic tools such as ping and traceroute to establish (at a minimum) the network latency from each client location, during typical operational periods. Next, take accurate time measurements of specific end user functions that display the problematic symptoms. Record all of these measurements, along with their locations, dates and times. Consider building end-user "network performance testing" functionality into your client application, allowing your power users to participate in the process of improvement; empowering them like this can have a huge psychological impact when you're dealing with users frustrated by a poorly performing system.
Analyze: Using any and all logging methods available to establish exactly what data is being transmitted and received during the execution of the affected operations. Ideally, your application can capture data transmitted and received by both the client and the server. If these include timestamps as well, even better. If sufficient logging isn't available (e.g. closed system, or inability to deploy modifications into a production environment), use a network sniffer and make sure you really understand what's going on at the network level.
Cache: Look for cases where static or infrequently changed data is being transmitted repetitively and consider an appropriate caching strategy. Typical examples include "pick list" values or other "reference entities", which can be surprisingly large in some business applications. In many cases, users can accept that they must restart or refresh the application to update infrequently updated data, especially if it can shave significant time from the display of commonly used user interface elements. Make sure you understand the real behaviour of the caching elements already deployed - many common caching methods (e.g. HTTP ETag) still require a network round-trip to ensure consistency, and where network latency is expensive, you may be able to avoid it altogether with a different caching approach.
Parallelise: Look for sequential transactions that don't logically need to be issued strictly sequentially, and rework the system to issue them in parallel. I dealt with one case where an end-to-end request had an inherent network delay of ~2s, which was not a problem for a single transaction, but when 6 sequential 2s round trips were required before the user regained control of the client application, it became a huge source of frustration. Discovering that these transactions were in fact independent allowed them to be executed in parallel, reducing the end-user delay to very close to the cost of a single round trip.
Combine: Where sequential requests must be executed sequentially, look for opportunities to combine them into a single more comprehensive request. Typical examples include creation of new entities, followed by requests to relate those entities to other existing entities.
Compress: Look for opportunities to leverage compression of the payload, either by replacing a textual form with a binary one, or using actual compression technology. Many modern (i.e. within a decade) technology stacks support this almost transparently, so make sure it's configured. I have often been surprised by the significant impact of compression where it seemed clear that the problem was fundamentally latency rather than bandwidth, discovering after the fact that it allowed the transaction to fit within a single packet or otherwise avoid packet loss and therefore have an outsize impact on performance.
Repeat: Go back to the beginning and re-measure your operations (at the same locations and times) with the improvements in place, record and report your results. As with all optimisation, some problems may have been solved exposing others that now dominate.

In the steps above, I focus on the application related optimisation process, but of course you must ensure the underlying network itself is configured in the most efficient manner to support your application too. Engage the networking specialists in the business and determine if they're able to apply capacity improvements, QoS, network compression, or other techniques to address the problem. Usually, they will not understand your application's needs, so it's important that you're equipped (after the Analyse step) to discuss it with them, and also to make the business case for any costs you're going to be asking them to incur. I've encountered cases where erroneous network configuration caused the applications data to be transmitted over a slow satellite link rather than an overland link, simply because it was using a TCP port that was not "well known" by the networking specialists; obviously rectifying a problem like this can have a dramatic impact on performance, with no software code or configuration changes necessary at all.

Knack answered 29/5, 2009 at 14:27 Comment(0)

I think this has already been said in a different way. But when you're dealing with a processor intensive algorithm, you should simplify everything inside the most inner loop at the expense of everything else.

That may seem obvious to some, but it's something I try to focus on regardless of the language I'm working with. If you're dealing with nested loops, for example, and you find an opportunity to take some code down a level, you can in some cases drastically speed up your code. As another example, there are the little things to think about like working with integers instead of floating point variables whenever you can, and using multiplication instead of division whenever you can. Again, these are things that should be considered for your most inner loop.

Sometimes you may find benefit of performing your math operations on an integer inside the inner loop, and then scaling it down to a floating point variable you can work with afterwards. That's an example of sacrificing speed in one section to improve the speed in another, but in some cases the pay off can be well worth it.

Cerebritis answered 29/5, 2009 at 14:27 Comment(0)

Not nearly as in depth or complex as previous answers, but here goes: (these are more beginner/intermediate level)

obvious: dry
run loops backwards so you're always comparing to 0 rather than a variable
use bitwise operators whenever you can
break repetitive code into modules/functions
cache objects
local variables have slight performance advantage
limit string manipulation as much as possible

Florettaflorette answered 29/5, 2009 at 14:27 Comment(3)

About looping backwards: yes, the comparison for loop end will be faster. Typically you use the variable to index into memory though, and accessing it reversed may be counter productive due to frequent cache misses (no prefetch). – Och 10/7, 2013 at 21:30

AFAIK, in most cases, any reasonable optimiser will do just fine with loops, without the programmer having to explicitly run in reverse. Either the optimiser will reverse the loop itself, or it has another way that's equally good. I've noted identical ASM output for (admittedly relatively simple) loops written both ascending vs max and descending vs 0. Sure, my Z80 days have me in the habit of reflexively writing backwards loops, but I suspect mentioning it to newbies is usually a red herring/premature optimisation, when readable code & learning more important practices should be priorities. – Haste 21/2, 2016 at 23:29

On the contrary, running a loop backwards will be slower in lower level languages because in a war between comparison to zero plus additional subtraction vs a single integer comparison, the single integer comparison is faster. Instead of decrementing, you can have a pointer to the start address in memory and a pointer to the end address in memory. Then, increment the start pointer till it is equal to the end pointer. This will eliminate the extra memory offset operation in the assembly code, thus proving much more performant. – Marcelinomarcell 12/2, 2018 at 23:52

Did you know that a CAT6 cable is capable of 10x better shielding off external inteferences than a default Cat5e UTP cable?

For any non-offline projects, while having best software and best hardware, if your throughoutput is weak, then that thin line is going to squeeze data and give you delays, albeit in milliseconds...

Also the maximum throughput is higher on CAT6 cables because there is a higher chance that you will actually receive a cable whose strands exist of cupper cores, instead of CCA, Cupper Cladded Aluminium, which is often fount in all your standard CAT5e cables.

I if you are facing lost packets, packet drops, then an increase in throughput reliability for 24/7 operation can make the difference that you may be looking for.

For those who seek the ultimate in home/office connection reliability, (and are willing to say NO to this years fastfood restaurants, at the end of the year you can there you can) gift yourself the pinnacle of LAN connectivity in the form of CAT7 cable from a reputable brand.

Vassalage answered 29/5, 2009 at 14:27 Comment(0)

Last few % is a very CPU and application dependent thing....

cache architectures differ, some chips have on-chip RAM you can map directly, ARM's (sometimes) have a vector unit, SH4's a useful matrix opcode. Is there a GPU - maybe a shader is the way to go. TMS320's are very sensitive to branches within loops (so separate loops and move conditions outside if possible).

The list goes on.... But these sorts of things really are the last resort...

Build for x86, and run Valgrind/Cachegrind against the code for proper performance profiling. Or Texas Instruments' CCStudio has a sweet profiler. Then you'll really know where to focus...

Forbes answered 29/5, 2009 at 14:27 Comment(0)

Very difficult to give a generic answer to this question. It really depends on your problem domain and technical implementation. A general technique that is fairly language neutral: Identify code hotspots that cannot be eliminated, and hand-optimize assembler code.

P answered 29/5, 2009 at 14:32 Comment(0)

Adding this answer since I didnt see it included in all the others.

Minimize implicit conversion between types and sign:

This applies to C/C++ at least, Even if you already think you're free of conversions - sometimes its good to test adding compiler warnings around functions that require performance, especially watch-out for conversions within loops.

GCC spesific: You can test this by adding some verbose pragmas around your code,

#ifdef __GNUC__
#  pragma GCC diagnostic push
#  pragma GCC diagnostic error "-Wsign-conversion"
#  pragma GCC diagnostic error "-Wdouble-promotion"
#  pragma GCC diagnostic error "-Wsign-compare"
#  pragma GCC diagnostic error "-Wconversion"
#endif

/* your code */

#ifdef __GNUC__
#  pragma GCC diagnostic pop
#endif

I've seen cases where you can get a few percent speedup by reducing conversions raised by warnings like this.

In some cases I have a header with strict warnings that I keep included to prevent accidental conversions, however this is a trade-off since you may end up adding a lot of casts to quiet intentional conversions which may just make the code more cluttered for minimal gains.

Wiersma answered 29/5, 2009 at 14:27 Comment(2)

This is why I like that in OCaml, casting between numeric types must be xplicit. – Tijerina 25/7, 2014 at 11:1

@Tijerina fair point - but in many cases changing languages isn't a realistic choice. Since C/C++ are so widely used its useful to be able to make them more strict, even if its compiler specific. – Wiersma 3/8, 2014 at 16:53

If you have a lot of highly parallel floating point math-especially single-precision-try offloading it to a graphics processor (if one is present) using OpenCL or (for NVidia chips) CUDA. GPUs have immense floating point computing power in their shaders, which is much greater than that of a CPU.

Lyre answered 29/5, 2009 at 14:27 Comment(0)

Here are some quick and dirty optimization techniques I use. I consider this to be a 'first pass' optimization.

Learn where the time is spent Find out exactly what is taking the time. Is it file IO? Is it CPU time? Is it the network? Is it the Database? It's useless to optimize for IO if that's not the bottleneck.

Know Your Environment Knowing where to optimize typically depends on the development environment. In VB6, for example, passing by reference is slower than passing by value, but in C and C++, by reference is vastly faster. In C, it is reasonable to try something and do something different if a return code indicates a failure, while in Dot Net, catching exceptions are much slower than checking for a valid condition before attempting.

Indexes Build indexes on frequently queried database fields. You can almost always trade space for speed.

Avoid lookups Inside of the loop to be optimized, I avoid having to do any lookups. Find the offset and/or index outside of the loop and reuse the data inside.

Minimize IO try to design in a manner that reduces the number of times you have to read or write especially over a networked connection

Reduce Abstractions The more layers of abstraction the code has to work through, the slower it is. Inside the critical loop, reduce abstractions (e.g. reveal lower-level methods that avoid extra code)

Spawn Threads for projects with a user interface, spawning a new thread to preform slower tasks makes the application feel more responsive, although isn't.

Pre-process You can generally trade space for speed. If there are calculations or other intense operations, see if you can precompute some of the information before you're in the critical loop.

Farkas answered 29/5, 2009 at 14:27 Comment(0)

The google way is one option "Cache it.. Whenever possible don't touch the disk"

Babe answered 29/5, 2009 at 14:27 Comment(0)

If better hardware is an option then definitely go for that. Otherwise

Check you are using the best compiler and linker options.
If hotspot routine in different library to frequent caller, consider moving or cloning it to the callers module. Eliminates some of the call overhead and may improve cache hits (cf how AIX links strcpy() statically into separately linked shared objects). This could of course decrease cache hits also, which is why one measure.
See if there is any possibility of using a specialized version of the hotspot routine. Downside is more than one version to maintain.
Look at the assembler. If you think it could be better, consider why the compiler did not figure this out, and how you could help the compiler.
Consider: are you really using the best algorithm? Is it the best algorithm for your input size?

Solander answered 29/5, 2009 at 14:27 Comment(1)

I would add to your first par.: do not forget turning off all the debugging info in your compiler options. – Caliph 4/9, 2013 at 19:40

Impossible to say. It depends on what the code looks like. If we can assume that the code already exists, then we can simply look at it and figure out from that, how to optimize it.

Better cache locality, loop unrolling, Try to eliminate long dependency chains, to get better instruction-level parallelism. Prefer conditional moves over branches when possible. Exploit SIMD instructions when possible.

Understand what your code is doing, and understand the hardware it's running on. Then it becomes fairly simple to determine what you need to do to improve performance of your code. That's really the only truly general piece of advice I can think of.

Well, that, and "Show the code on SO and ask for optimization advice for that specific piece of code".

Draughty answered 29/5, 2009 at 14:27 Comment(0)

Reduce variable sizes (in embedded systems)

If your variable size is larger than the word size on a specific architecture, it can have a significant effect on both code size and speed. For example, if you have a 16 bit system, and use a long int variable very often, and later realize that it can never get outside the range (−32.768 ... 32.767) consider reducing it to short int.

From my personal experience, if a program is ready or almost ready, but we realize it takes up about 110% or 120% of the target hardware's program memory, a quick normalization of variables usually solves the problem more often than not.

By this time, optimizing the algorithms or parts of the code itself can become frustratingly futile:

reorganize the whole structure and the program no longer works as intended, or at least you introduce a lot of bugs.
do some clever tricks: usually you spend a lot of time optimizing something, and discover no or very small decrease in code size, as the compiler would have optimized it anyway.

Many people make the mistake of having variables which exactly store the numerical value of a unit they use the variable for: for example, their variable time stores the exact number of milliseconds, even if only time steps of say 50 ms are relevant. Maybe if your variable represented 50 ms for each increment of one, you would be able to fit into a variable smaller or equal to the word size. On an 8 bit system, for example, even a simple addition of two 32-bit variables generates a fair amount of code, especially if you are low on registers, while 8 bit additions are both small and fast.

Fingered answered 29/5, 2009 at 14:27 Comment(6)

Not at all. Not everyone developes web applications or overdesigned GUIs with memory-managed develomepent systems which run on high-end systems with gigabytes of RAM, and you can allow a small integer to use kilobytes of memory. Even today, there is need for systems where the entire hardware cost for the embedded electronics must be under 1$. For such firmware, every byte counts even today. – Fingered 6/10, 2011 at 17:4

I actually work in embedded, and even there you don't for 99.9% of the times. (microchip, motor control) And the few that are (like for small bulk electronic equipments) are mostly based on older ones. Not much new development going on there. Give or take the odd new "wellness" category device. – Countenance 7/10, 2011 at 8:4

I work in embedded too, and I need to use it all the time. Maybe you work in an area where you have a big enough processor, lot of space on your layout, and the number of units produced is very small compared to the task complexity, so the developement costs are more important than the unit costs. That is not always the case. – Fingered 7/10, 2011 at 15:44

I'm totally with @Fingered here. So many times have I run into memory problems even when programming a moderately large DSP. It really depends on the specific application being implemented. For example, trying to push a modem software on a BlackFin DSP, you run into a wall very early in the design process... which forces you to start thinking cleverly and out-of-the-box on solving the memory issues. – Concettaconcettina 23/11, 2011 at 0:40

Be careful doing this. Somebody wasn't when they wrote the firmware for my thermostat, and if I leave it in "auto" mode (to switch between hot and cool) it will blast the AC when it hits -1F outside because of a signed -> unsigned conversion. The thermostat thinks it's 255F out :( – Anderegg 28/2, 2012 at 20:41

A state of the art server park running a managed web server application often consumes huge amounts of power because the application is inefficient to begin with. Maybe power consumption plays last fiddle to having something up and running. Why shouldn't architecture goals also aim for a target environment requiring just a few servers? It isn't THAT difficult. – Sarraceniaceous 5/3, 2012 at 11:46

pass by reference instead of by value

Dirham answered 29/5, 2009 at 14:27 Comment(2)

Though the question is language-agnostic, let me mention that with the advent of c++0x (including move semantics and extended const rvalue reference lifetime extensions) the compiler will (many times) be able to elide copies (NRVO, URVO) but only if the parameter was passed by value. End answer: profile and understand your hotspots – Seaworthy 14/6, 2011 at 15:1

not useful: C++11 aside (but very good point), in the languages that allow passing by value, avoiding it when appropriate is surely a basic mantra at the early levels of learning, not an "optimisation strategy of last resort" as asked in the question. – Haste 21/2, 2016 at 23:41

Tweak the OS and framework.

It may sound an overkill but think about it like this: Operating Systems and Frameworks are designed to do many things. Your application only does very specific things. If you could get the OS do to exactly what your application needs and have your application understand how the the framework (php,.net,java) works, you could get much better out of your hardware.

Facebook, for example, changed some kernel level thingys in Linux, changed how memcached works (for example they wrote a memcached proxy, and used udp instead of tcp).

Another example for this is Window2008. Win2K8 has a version were you can install just the basic OS needed to run X applicaions (e.g. Web-Apps, Server Apps). This reduces much of the overhead that the OS have on running processes and gives you better performance.

Of course, you should always throw in more hardware as the first step...

Algoid answered 29/5, 2009 at 14:27 Comment(1)

That would be a valid approach after all other approaches failed, or if a specific OS or Framework feature was responsible for markedly decreased performance, but the level of expertise and control needed to pull that off may not be available to every project. – Farkas 20/7, 2011 at 16:20

Sometimes changing the layout of your data can help. In C, you might switch from an array or structures to a structure of arrays, or vice versa.

Gravy answered 29/5, 2009 at 14:27 Comment(0)

In a language with templates (C++/D) you can try propagating constant values via template args. You can even do this for small sets of not really constant values with a switch.

Foo(i, j); // i always in 0-4.

becomes

switch(i)
{
    case 0: Foo<0>(j); break;
    case 1: Foo<1>(j); break;
    case 2: Foo<2>(j); break;
    case 3: Foo<3>(j); break;
    case 4: Foo<4>(j); break;
}

The downside is cache pressure so this would only be a gain in deep or long running call trees where the value is constant for the duration.

Wernsman answered 29/5, 2009 at 14:27 Comment(0)

There is no such blanket statement possible, it depends on the problem domain. Some possibilities:

Since you don't specify outright that your application is 100% calculating:

Search for calls that block (database, network harddisk, display update), and isolate them and/or put them in thread.

If you have use a database and it happens to be Microsoft SQL Server:

investigate nolock and rowlock directives. (There are threads on this forum.)

IF your app is purely calculating, you can look at this question of mine about cache optimization for rotating large images. The increase in speed flabbergasted me.

It is a long shot, but maybe it gives an idea, specially if your problem is in the imaging domain: rotating-bitmaps-in-code

Another one is avoiding dynamic memory allocation as much as possible. Allocate multiple structs at once, release them at once.

Otherwise, identify your tightest loops and post them here, either in pseudo or not, with some of the datastructures.

Countenance answered 29/5, 2009 at 14:27 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Minimize implicit conversion between types and sign:

Recommended topics

Hot tags