Quantifiable metrics (benchmarks) on the usage of header-only c++ libraries

Asked 5/9, 2012 at 22:23 Answered 27/11, 2012 at 21:5

I've tried to find an answer to this using SO. There are a number of questions that list the various pros and cons of building a header-only library in c++, but I haven't been able to find one that does so in quantifiable terms.

So, in quantifiable terms, what's different between using traditionally separated c++ header and implementation files versus header only?

For simplicity, I'm assuming that templates are not used (because they require header only).

To elaborate, I've listed what I have seen from the articles to be the pros and cons. Obviously, some are not easily quantifiable (such as ease of use), and are therefore useless for quantifiable comparison. I'll mark those that I expect quantifiable metrics with a (quantifiable).

Pros for header-only

It's easier to include, since you don't need to specify linker options in your build system.
You always compile all the library code with the same compiler (options) as the rest of your code, since the library's functions get inlined in your code.
It may be a lot faster. (quantifiable)
May give compiler/linker better opportunities for optimization (explanation/quantifiable, if possible)
Is required if you use templates anyways.

Cons for header-only

It bloats the code. (quantifiable) (how does that affect both execution time and the memory footprint)
Longer compile times. (quantifiable)
Loss of separation of interface and implementation.
Sometimes leads to hard-to-resolve circular dependencies.
Prevents binary compatibility of shared libraries/DLLs.
It may aggravate co-workers who prefer the traditional ways of using C++.

Any examples that you can use from larger, open source projects (comparing similarly-sized codebases) would be very much appreciated. Or, if you know of a project that can switch between header-only and separated versions (using a third file that includes both), that would be ideal. Anecdotal numbers are useful too because they give me a ballpark with which I can gain some insight.

sources for pros and cons:

Thanks in advance...

UPDATE:

For anyone that may be reading this later and is interested in getting a bit of background information on linking and compiling, I found these resources useful:

UPDATE: (in response to the comments below)

Just because answers may vary, doesn't mean that measurement is useless. You have to start measuring as some point. And the more measurements you have, the clearer the picture is. What I'm asking for in this question is not the whole story, but a glimpse of the picture. Sure, anyone can use numbers to skew an argument if they wanted to unethically promote their bias. However, if someone is curious about the differences between two options and publishes those results, I think that information is useful.

Has no one been curious about this topic, enough to measure it?

I love the shootout project. We could start by removing most of those variables. Only use one version of gcc on one version of linux. Only use the same hardware for all benchmarks. Do not compile with multiple threads.

Then, we can measure:

executable size
runtime
memory footprint
compile time (for both entire project and by changing one file)
link time

Hampton answered 5/9, 2012 at 22:23 Comment(11)

Pre-compiled headers are an interesting solution in this scenario, and could decrease some of the build-time issues. – Erelong 5/9, 2012 at 22:27

Interesting... any numbers on that? – Hampton 5/9, 2012 at 22:27

not for C++ directly, no. But for Objective-C and including something like <Foundation/Foundation.h> (Approx 100k lines of code), using a PCH over a normal header can increase the build times by about 2x. – Erelong 5/9, 2012 at 22:28

Very useful. Thank you. I consider Objective-C and C++ very comparable in your example. – Hampton 5/9, 2012 at 22:30

One thing that could mess up pre-compiled headers is templates. Not quite sure how those would work. – Erelong 5/9, 2012 at 22:30

The shootout's benchmarks aren't that good for this particular test, AFAICT, they are all one unit. For my test I will use box2d to start with, and look for more compute intensive, multi-unit projects. Any suggestions? Perhaps something string operation intensive that uses ICU for example. etc. – Performing 26/11, 2012 at 12:49

I don't know of any projects that are good candidates. I had looked around for a while. I'm sure that whatever ideas you come up with will be fine. – Hampton 26/11, 2012 at 17:53

@Hampton I been considering adding gmp+gmpbench to the mix, though it is a C library/benchmark. Would this be helpful? – Performing 26/11, 2012 at 18:8

It's not identical to c++, but I think it's pretty comparable. The only thing that I can think of that may play an issue is the automatic inlining of class methods in c++. And because of the nature of header only file (all methods are defined in the header), that feature may end up being significant. What do you think? – Hampton 26/11, 2012 at 18:11

Yes, it may be slightly different, because it isn't doing many C++ things like polymorphism etc. But I think it will still give you a general idea of what combining compilation units can do, even if it doesn't reflect fully on C++. Since I haven't actually done the benchmark, I don't even know what results to expect. From my results so far, performance was surprisingly slower when everything was included. Must be instruction cache misses or somesuch. If I have time, I'll perhaps run valgrind/kcachegrind and add that to the results. – Performing 26/11, 2012 at 18:42

I'm sure that any contributions that you make will be valuable. I really appreciate any contributions. Your time is valuable and I appreciate you giving to help clarify this issue for me. – Hampton 26/11, 2012 at 19:2

Summary (notable points):

Two packages benchmarked (one with 78 compilation units, one with 301 compilation units)
Traditional Compiling (Multi Unit Compilation) resulted in a 7% faster application (in the 78 unit package); no change in application runtime in the 301 unit package.
Both Traditional Compiling and Header-only benchmarks used the same amount of memory when running (in both packages).
Header-only Compiling (Single Unit Compilation) resulted in an executable size that was 10% smaller in the 301 unit package (only 1% smaller in the 78 unit package).
Traditional Compiling used about a third of the memory to build over both packages.
Traditional Compiling took three times as long to compile (on the first compilation) and took only 4% of the time on recompile (as header-only has to recompile the all sources).
Traditional Compiling took longer to link on both the first compilation and subsequent compilations.

Box2D benchmark, data:

box2d_data_gcc.csv

Botan benchmark, data:

botan_data_gcc.csv

Box2D SUMMARY (78 Units)

enter image description here

Botan SUMMARY (301 Units)

enter image description here

NICE CHARTS:

Box2D executable size:

Box2D executable size

Box2D compile/link/build/run time:

Box2D compile/link/build/run time

Box2D compile/link/build/run max memory usage:

Box2D compile/link/build/run max memory usage

Botan executable size:

Botan executable size

Botan compile/link/build/run time:

Botan compile/link/build/run time

Botan compile/link/build/run max memory usage:

Botan compile/link/build/run max memory usage

Benchmark Details

TL;DR

The projects tested, Box2D and Botan were chosen because they are potentially computationally expensive, contain a good number of units, and actually had few or no errors compiling as a single unit. Many other projects were attempted but were consuming too much time to "fix" into compiling as one unit. The memory footprint is measured by polling the memory footprint at regular intervals and using the maximum, and thus might not be fully accurate.

Also, this benchmark does not do automatic header dependency generation (to detect header changes). In a project using a different build system, this may add time to all benchmarks.

There are 3 compilers in the benchmark, each with 5 configurations.

Compilers:

gcc
icc
clang

Compiler configurations:

Default - default compiler options
Optimized native - -O3 -march=native
Size optimized - -Os
LTO/IPO native - -O3 -flto -march=native with clang and gcc, -O3 -ipo -march=native with icpc/icc
Zero optimization - -Os

I think these each can have different bearings on the comparisons between single-unit and multi-unit builds. I included LTO/IPO so we might see how the "proper" way to achieve single-unit-effectiveness compares.

Explanation of csv fields:

Test Name - name of the benchmark. Examples: Botan, Box2D.
Test Configuration - name a particular configuration of this test (special cxx flags etc.). Usually the same as Test Name.
Compiler - name of the compiler used. Examples: gcc,icc,clang.
Compiler Configuration - name of a configuration of compiler options used. Example: gcc opt native
Compiler Version String - first line of output of compiler version from the compiler itself. Example: g++ --version produces g++ (GCC) 4.6.1 on my system.
Header only - a value of True if this test case was built as a single unit, False if it was built as a multi-unit project.
Units - number of units in the test case, even if it is built as a single unit.
Compile Time,Link Time,Build Time,Run Time - as it sounds.
Re-compile Time AVG,Re-compile Time MAX,Re-link Time AVG,Re-link Time MAX,Re-build Time AVG,Re-build Time MAX - the times across rebuilding the project after touching a single file. Each unit is touched, and for each, the project is rebuilt. The maximum times, and average times are recorded in these fields.
Compile Memory,Link Memory,Build Memory,Run Memory,Executable Size - as they sound.

To reproduce the benchmarks:

The bullwork is run.py.
Requires psutil (for memory footprint measurements).
Requires GNUMake.
As it is, requires gcc, clang, icc/icpc in the path. Can be modified to remove any of these of course.
Each benchmark should have a data-file that lists the units of that benchmarks. run.py will then create two test cases, one with each unit compiled separately, and one with each unit compiled together. Example: box2d.data. The file format is defined as a json string, containing a dictionary with the following keys
- "units" - a list of c/cpp/cc files that make up the units of this project
- "executable" - A name of the executable to be compiled.
- "link_libs" - A space separated list of installed libraries to link to.
- "include_directores" - A list of directories to include in the project.
- "command" - optional. special command to execute to run the benchmark. For example, "command": "botan_test --benchmark"
Not all C++ projects can this be easily done with; there must be no conflicts/ambiguities in the single unit.
To add a project to the test cases, modify the list test_base_cases in run.py with the information for the project, including the data file name.
If everything runs well, the output file data.csv should contain the benchmark results.

To produce the bar charts:

You should start with a data.csv file produced by the benchmark.
Get chart.py. Requires matplotlib.
Adjust the fields list to decide which graphs to produce.
Run python chart.py data.csv.
A file, test.png should now contain the result.

Box2D

Box2D was used from svn as is, revision 251.
The benchmark was taken from here, modified here and might not be representative of a good Box2D benchmark, and it might not use enough of Box2D to do this compiler benchmark justice.
The box2d.data file was manually written, by finding all the .cpp units.

Botan

Using Botan-1.10.3.
Data file: botan_bench.data.
First ran ./configure.py --disable-asm --with-openssl --enable-modules=asn1,benchmark,block,cms,engine,entropy,filters,hash,kdf,mac,bigint,ec_gfp,mp_generic,numbertheory,mutex,rng,ssl,stream,cvc, this generates the header files and Makefile.
I disabled assembly, because assembly might intefere with optimizations that can occure when the function boundaries do not block optimization. However, this is conjecture and might be totally wrong.
Then ran commands like grep -o "\./src.*cpp" Makefile and grep -o "\./checks.*" Makefile to obtain the .cpp units and put them into botan_bench.data file.
Modified /checks/checks.cpp to not call the x509 unit tests, and removed x509 check, because of conflict between Botan typedef and openssl.
The benchmark included in the Botan source was used.

System specs:

OpenSuse 11.4, 32-bit
4GB RAM
Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz

Performing answered 27/11, 2012 at 21:5 Comment(9)

Looking good Realz... if you're looking for a charting solution, maybe we could plug it into this: joedesigns.com/labs/Beautiful-Analytics-Chart – Hampton 28/11, 2012 at 0:13

@Hampton I think a bar graph like the one I copied in my other answer is more appropriate, no? I am able to make the charts manually in OpenOffice, but its quite tedious for this type of data. I am now trying it with matplotlib instead. – Performing 28/11, 2012 at 0:16

Unless you particularly want me to plug it in that lib. – Performing 28/11, 2012 at 0:18

No, you've already done a lot more than expected. Anything suitable is awesome. I just understood that you were looking for a visualization library and I had used that one in the past with a lot of success. tyvm – Hampton 28/11, 2012 at 0:22

Hey... I added a summary table. You may choose to remove it, but that was exactly what I was looking for. You nailed it buddy! – Hampton 28/11, 2012 at 3:48

@Hampton you made it worth my while. Must satisfy my stackexchange addiction! – Performing 28/11, 2012 at 3:52

let us continue this discussion in chat – Performing 28/11, 2012 at 3:53

I'm going to give you more points after this (if it lets me give you 200 more). The last bounty required me to start at 400. But if I can give you 200 more, I will. Great job man. – Hampton 28/11, 2012 at 3:53

It'll only let me give you 500 more... but what the hey... :-) – Hampton 28/11, 2012 at 4:1

Update

This was Real Slaw's original answer. His answer above (the accepted one) is his second attempt. I feel that his second attempt answers the question entirely. - Homer6

Well, for comparison, you can look up the idea of "unity build" (nothing to do with the graphics engine). Basically, a "unity build" is where you include all the cpp files into a single file, and compile them all as one compilation unit. I think this should provide a good comparison, as AFAICT, this is equivalent to making your project header-only. You'd be surprised about the 2nd "con" you listed; the whole point of "unity builds" are to decrease compile times. Supposedly unity builds compile faster because they:

.. are a way of reducing build over-head (specifically opening and closing files and reducing link times by reducing the number of object files generated) and as such are used to drastically speed up build times.

― altdevblogaday

Compilation time comparison (from here):

enter image description here

Three major references for "unity build:

I assume you want reasons for the pros and cons listed.

Pros for header-only

[...]

3) It may be a lot faster. (quantifiable) The code might be optimized better. The reason is, when the units are separate, a function is just a function call, and thus must be left so. No information about this call is known, for example:

Will this function modify memory (and thus our registers reflecting those variables/memory will be stale when it returns)?
Does this function look at global memory (and thus we cannot reorder where we call the function)
etc.

Furthermore, if the function internal code is known, it might be worthwhile to inline it (that is to dump its code directly into the calling function). Inlining avoids the function call overhead. Inlining also allows a whole host of other optimizations to occur (for example, constant propagation; for example we call factorial(10), now if the compiler doesn't know the code of factorial(), it is forced to leave it like that, but if we know the source code of factorial(), we can actually variables the variables in the function and replace it with 10, and if we are lucky we can even end up with the answer at compile time, without running anything at all at runtime). Other optimizations after inlining include dead-code elimination and (possibly) better branch prediction.

4) May give compiler/linker better opportunities for optimization (explanation/quantifiable, if possible)

I think this follows from (3).

Cons for header-only

1) It bloats the code. (quantifiable) (how does that affect both execution time and the memory footprint) Header-only can bloat the code in a few ways, that I know of.

The first is template bloat; where the compiler instantiates unnecessary templates of types that are never used. This isn't particular to header-only but rather templates, and modern compilers have improved on this to make it of minimal concern.

The second more obvious way, is the (over)inlining of functions. If a large function is inlined everywhere it is used, those calling functions will grow in size. This might have been a concern about executable size and executable-image-memory size years ago, but HDD space and memory have grown to make it almost pointless to care about. The more important issue is that this increased function size can ruin the instruction cache (so that the now-larger function doesn't fit into the cache, and now the cache has to be refilled as the CPU executes through the function). Register pressure will be increased after inlining (there is a limit on the number of registers, the on-CPU memory that the CPU can process with directly). This means that the compiler will have to juggle the registers in the middle of the now-larger-function, because there are too many variables.

2) Longer compile times. (quantifiable)

Well, header-only compilation can logically result in longer compile times for many reasons (notwithstanding the performance of "unity builds"; logic isn't necessarily real-world, where other factors get involved). One reason can be, if an entire project is header-only, then we lose incremental builds. This means any change in any part of the project means the entire project has to be rebuilt, while with separate compilation units, changes in one cpp just means that object file must be rebuilt, and the project relinked.

In my (anecdotal) experience, this is a big hit. Header-only increases performance a lot in some special cases, but productivity wise, it is usually not worth it. When you start getting a larger codebase, compilation time from scratch can take > 10 minutes each time. Recompiling on a tiny change starts getting tiresome. You don't know how many times I forgot a ";" and had to wait 5 mins to hear about it, only to go back and fix it, and then wait another 5 mins to find something else I just introduced by fixing the ";".

Performance is great, productivity is much better; it will waste a large chunk of your time, and demotivate/distract you from your programming goal.

Edit: I should mention, that interprocedural optimization (see also link-time optimization, and whole program optimization) tries to accomplish the optimization advantages of the "unity build". Implementations of this is still a bit shaky in most compilers AFAIK, but eventually this might overcome performance advantages.

Performing answered 12/9, 2012 at 2:14 Comment(9)

Excellent post. +1 for introducing the term unity builds and for your explanation of potential sources of speedups. Unfortunately, just like the other posts that I've seen, relying on explanations alone is a bit of conjecture. Programmers are notoriously bad at predicting where a speedup or slowdown will occur. That’s really the point of the quantifiable metrics. It’s meant to show, to what degree, how much faster or slower something is. For example, if something is 80% faster, that’s quite different from it being 2% faster. – Hampton 12/9, 2012 at 16:49

And if it’s only 2% faster, then it’s not much of a factor to consider in the overall picture. So, I can’t accept this as the answer unless you provide numbers to the questions. Thanks. – Hampton 12/9, 2012 at 16:49

PS. I read all of the articles. Thank you for including them. :-) – Hampton 12/9, 2012 at 16:52

I included the chart for some numbers, but it would take a lot of work to get numbers for all the questions, and they wouldn't even generalize (they would be very particular to the type of test and the type of code etc.). But I guess that's ok; it's a hard question. Not every question has a good answer :) – Performing 12/9, 2012 at 17:33

I'm just surprised that no one has these numbers on hand. I thought this would be an easy question to answer. I guess people don't necessarily make a habit of challenging established build methods. I kind of expected to see and open and shut case, if the numbers were there to back it up. – Hampton 12/9, 2012 at 18:11

Well the numbers are highly subjective, so there is no yes/no answer to "should I use something like unity build (or lean toward header-only or not)". It depends on the compiler, the project, the design and layout of the code, the type of code, the bottlenecks of the code etc. If someone were to give you numbers and say definitively: Use/Don't use unity builds, they would be deceiving you, because there is no right answer in general. In a particular case, you can get an answer, and all numbers would be to the particular case (also to a particular platform/compiler etc.). – Performing 12/9, 2012 at 18:34

To answer this question generally, you'd need some sort of crazy "shootout" (in the spirit of shootout.alioth.debian.org), except you'd be testing only C++, among all the compilers and platforms, and measuring compiling times, link times, runtimes, code size, multiplied by cpu type/hdd type multiplied by all the programs in the shootout (which wouldn't be the same as in their site, since those are mostly short programs, rather a selection of larger FOSS projects etc. so one can better measure things like code-size). – Performing 12/9, 2012 at 18:38

Good comments; I've added a response to them as an update to the end of the question. – Hampton 12/9, 2012 at 18:48

Nice post, but unity build can be far different from separate compilation with header-only libraries. In unity build, header dependency fences will normally prevent header content from being included more than once. You get only one copy of include file code. With separate compilation and include-only libraries, you may end up with one copy of a library per compilation unit. Hence the code bloat described by the OP. – Hippocrene 1/10, 2012 at 22:5

I hope this isn't too similar to what Realz said.

Executable (/object) size: (executable 0% / object up to 50% bigger on header only)

I would assume defined functions in a header file will be copied into every object. When it comes to generating the executable, I'd say it should be rather easy to cut out duplicate functions (no idea which linkers do/don't do this, I assume most do), so (probably) no real difference in the executable size, but well in the object size. The difference should largely depend on how much code is actually in the headers versus the rest of the project. Not that the object size really matters these days, except for link time.

Runtime: (1%)

I'd say basically identical (a function address is a function address), except for inline functions. I'd expect inline functions to make less than a 1% difference in your average program, because function calls do have some overhead but this is nothing compared to the overhead of actually doing anything with a program.

Memory footprint: (0%)

Same things in the executable = same memory footprint (during runtime), assuming the linker cuts out duplicate functions. If duplicate functions aren't cut out, it can make quite a difference.

Compile time (for both entire project and by changing one file): (entire up to 50% faster for either one, single up to 99% faster for not header only)

Huge difference. Changing something in the header file causes everything that includes it to recompile, while changes in an cpp file just requires that object to be recreated and a re-link. And an easy 50% slower for a full compile for header only libraries. However, with pre-compiling headers or unity builds, a full compile with header-only libraries would probably be faster, but one change requiring a lot of files to recompile is a huge disadvantage, and I'd say that makes it not worth it. Full recompiles aren't needed often. Also, you can include something in a cpp file but not in it's header file (this can happen often), so, in a proper designed program (tree-like dependency structure / modularity), when changing a function declaration or something (always requires changes to the header file), header-only would cause a lot of things to recompile, but with not header-only you can limit this greatly.

Link time: (up to 50% faster for header-only)

The objects are likely bigger, thus it would take longer to process them. Probably linearly proportional to how much bigger the files are. From my limited experience in big projects (where compile + link time is long enough to actually matter), link time is almost negligible compared to compile time (unless you keep making small changes and building, then I'd expect you'd feel it, which I suppose can happen often).

Avarice answered 2/10, 2012 at 15:31 Comment(1)

Thanks Dukeling, this definitely helps. However, it is a bit similar to the other post because it's speculative. I was more looking for actual measurements, regardless of platform, etc. – Hampton 6/10, 2012 at 21:41

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Benchmark Details

Box2D

Botan

System specs:

Recommended topics

Hot tags