Code coverage with optimization

Asked 29/4, 2016 at 5:27 Answered 28/6, 2022 at 9:42

Currently I have a bunch of unit tests for my C++ project, but I am not (yet) testing the code coverage. I am compiling the tests with -O3 optimization flag to expose potential subtle bugs, but it seems if I want to collect coverage information with tools like gcov, any optimization flag must be disabled. Should I build the tests twice (one with -O3, the other without)? How is this problem usually dealt with?

Blaineblainey answered 29/4, 2016 at 5:27 Comment(3)

How is this problem usually dealt with? I am compiling tests with -O0. To find potential bugs is more suitable using profilers like Valgrind (or some sanitize flags of the gcc). -O3 is suitable for performance benchmarks. – Nonpros 29/4, 2016 at 5:53

@Nonpros Yes I am using Valgrind to run my tests. Is -O0 just the same as no optimization flags? Since it's the default option. – Blaineblainey 29/4, 2016 at 6:6

Is -O0 just the same as no optimization flags? Yep. – Nonpros 29/4, 2016 at 6:39

There are typically many kinds of tests that one performs to assure the quality of the software, and different criteria for what compiler options.

Typically, a build system offers two or more choices of builds, for example:

Debug: -O0 (no optimisation) with asserts

Release: "higher optimisation" (-O2, -Os or -O3 depending on what is "best" for your project) without asserts. This is usually the mode which you deliver the code to customers.

Sometimes there are "Release+Asserts" so that you can still do checking of correctness in the code while running with some semblance of performance.

Here are some categories that I think tests can be classed into:

Functional correctness (aka "positive tests"). This is where you check that "the code works correctly under normal circumstances". Run both Debug and Release.
Negative tests. Check that error conditions work correctly - passing rubbish values that should give errors ("file that doesn't exist" should give E_NO_SUCH_FILE). Typicaly both debug and release.
Stress tests - running harsh tests that check that the software behaves correctly when you run it for long times, with lots of threads, etc, etc. Typically debug mode - maybe both.
Coverage. Run a set of tests to ensure that you "cover all paths" (often with a degree of "not covered", such as you should cover 95% of functions, and 85% of branches - since some conditions may be extremely difficult to achieve without manually instrumenting the code - there are errors that only occur when the disk is completely full, or when the OS can't create a new process). Typically compiled as Debug.
Fault tolerance tests. A form of "negative tests" where you insert a "mock" functionality for the memory allocations and similar, that simulates failures either sequentially or at random, to discover cases where errors are not detected and the code fails as a follow-on consequence of an earlier error, rather than producing the correct error at the correct place. Again, typically run with Debug - but it may be worth running in Release as well.
Performance testing. Where you measure the performance of your program - frames per second generated, lines per second in a compiler or gigabytes per hour in a file download system, etc. This should be compiled as per Release, as running performance in "not optimised" code is nearly always pointless.

For complex software products, you often have to compromise between "running everything" and "the time it takes" - for example, running ALL 4000 functional tests in both debug and release mode may take 12 hours, running only Debug mode take 7 hours, so preferrable. This compromise is the usual "engineering decision" - "In an ideal world, you'd do this, but in the real world, we have to compromise, and here's why I think this configuration of tests is right".

Many test systems are running light testing on every change to the source code [after "I think this works" from the engineer him/herself], heavier testing each night, and more tests over a weekend, for example. This allows a compromise between the time it takes to run ALL tests and the time it takes one engineer to make a small change.

Ruwenzori answered 29/4, 2016 at 7:0 Comment(2)

as you said in 4, -O0 should be used for coverage test. Is it right? – Cadelle 27/5, 2021 at 4:32

Yes, that's often the case, but not always. Particularly if you have #if DEBUG that are calling functions to check things, turning debug off will give different coverage results. On the other hand, depending on how good the coverage tool is [how well it interacts with the compiler], it may not "see" coverage that come from inlined functions, so coverage with full optimisation may not work well either. A portion of this depends on exactly what you are trying to achieve with the coverage tool - testing the production version of the code, or check that tests cover all branches, etc... – Ruwenzori 3/6, 2021 at 8:41

I am compiling the tests with -O3 optimization flag to expose potential subtle bugs

Bugs that might arise from optimising your build include timing-related bugs. They may indicate race conditions or deeper problems with your software design.

However, you may also observe more evidence of existing bugs in the form of undefined behaviour. To test for UB, run your tests with sanitizers enabled.

it seems if I want to collect coverage information with tools like gcov, any optimization flag must be disabled

Coverage testing is inexact because the compiler isn't required to generate code which maps neatly back to lines in a source file. However, disabling certain optimisations will simplify code to your advantage, so consider using -O0 or -Og when measuring coverage and see how that helps.

Should I build the tests twice (one with -O3, the other without)?

That would be my advice. You are testing two things:

Is the code correct? To test this, favour release build configuration so that what you are testing is more like a production build, but consider enabling instrumentation, such as sanitizers and asserts, that help catch bugs. Start with -fsanitize=address,undefined and consider Thread Sanitizer if there's the possibility of race conditions.

If these tests fail, they are your highest priority as they indicate that you have bugs in your code.
How much of your code is tested? To test this, run the same tests, but gather coverage metrics. Consider disabling optimisations to gather clearer coverage metrics.

Disable asserts and other development-time constructs that you use to detect bugs: you have eliminated all previously-detected bugs so this code is unlikely to be covered.

How is this problem usually dealt with?

The above advice is just what I've tended towards through personal experience. Everyone, including @mats-petersson, has their own experience. Notably, projects tend to get complicated and complexity impedes change. So try to set up a good testing regime as early as possible and insist on minimising unnecessary complexity in projects that are going to be around for a while.

Footbridge answered 28/6, 2022 at 9:42 Comment(0)

Recommended topics

Hot tags