Reducing the footprint of debug symbols (executable is bloated to 4 GB)
Asked Answered
J

3

12

So the basic problem is that my built executable is 4GB in size with debug symbols turned on (between 75 MB and 300 MB with no debug symbols and varying optimization levels). How can I diagnose/analyze where all these symbols are coming from, and which are the biggest offenders in terms of taking up space? I have found some questions on reducing the non-debug executable size (though they have not been terribly illuminating), but here I am mainly concerned with reducing the debug symbol clutter. The executable is so large that it takes gdb a significant amount of time to load up all the symbols, which is hindering debugging. Perhaps reducing the code bloat is the fundamental task, but I would first like to know where my 4GB is being spent.

Running the executable through 'size --format=SysV' I get the following output:

section                    size       addr
.interp                      28    4194872
.note.ABI-tag                32    4194900
.note.gnu.build-id           36    4194932
.gnu.hash                714296    4194968
.dynsym                 2728248    4909264
.dynstr                13214041    7637512
.gnu.version             227354   20851554
.gnu.version_r              528   21078912
.rela.dyn                 37680   21079440
.rela.plt                 15264   21117120
.init                        26   21132384
.plt                      10192   21132416
.text                  25749232   21142608
.fini                         9   46891840
.rodata                 3089441   46891872
.eh_frame_hdr            584228   49981316
.eh_frame               2574372   50565544
.gcc_except_table       1514577   53139916
.init_array                2152   56753888
.fini_array                   8   56756040
.jcr                          8   56756048
.data.rel.ro             332264   56756064
.dynamic                    992   57088328
.got                        704   57089320
.got.plt                   5112   57090048
.data                     22720   57095168
.bss                    1317872   57117888
.comment                     44          0
.debug_aranges          2978704          0
.debug_info           278337429          0
.debug_abbrev           1557345          0
.debug_line            13416850          0
.debug_str           3620467085          0
.debug_loc            236168202          0
.debug_ranges          37473728          0
Total                4242540803

from which I guess we can see that 'debug_str' takes up ~3.6 GB. I don't 100% know what "debug_str" are but I guess they might literally be the string names of the debug symbols? So is this telling me that the de-mangled names of my symbols are just insanely big? How can I figure out which ones and fix them?

I guess I can somehow do something with 'nm', directly inspecting the symbol names, but the output is enormous and I'm not sure how best to search it. Are there any tools to do this kind of analysis?

The compiler used was 'c++ (GCC) 4.9.2'. And I guess I should mention that I am working in a Linux environment.

Jointless answered 25/10, 2016 at 14:35 Comment(4)
4Gb is only 512MB and most probably won't cause any problem. Unless the binary is 4GB ~ 32GbGeorgetown
Sorry yes I mean 4 GB. i.e. ~4e9 bytes, as in the 'size' outputJointless
Even the release version of your program is very large. Maybe it's time to think about a redesign, and split the program into multiple (smaller) modules? Either multiple executable files, or one executable and a set of small shared libraries?Winkle
@Someprogrammerdude: That may actually worsen the situation; you need symbol tables for imported and exported functions. How big are those going to be?!Bungalow
J
8

So I have tracked down the main culprit by doing the following, based mostly on John Zwinck's answer. Essentially I just followed his suggestion to just run "string" on the executable and analyzed the output.

strings my_executable > exec_strings.txt

I then sorted the output mostly following mindriot's method:

cat exec_strings.txt | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2- > exec_strings_sorted.txt

and had a look at the longest strings. Indeed it all seemed to be some insane template bloat, from a particular library. I then did a little more counting like:

cat exec_strings.txt | wc -l
2928189
cat exec_strings.txt | grep <culprit_libname> | wc -l
1108426

to see that of the approximately 3 million strings that are extracted, it seems like ~1 million of them were coming from this library. Finally, doing

cat exec_strings.txt | wc -c
3659369876
cat exec_strings.txt | grep <culprit_libname> | wc -c
3601918899

it became apparent that these million strings are all super long and constitute the great bulk of the debug symbol garbage. So at least now I can focus on this one library while trying to remove the root of the problem.

Jointless answered 25/10, 2016 at 16:36 Comment(0)
R
6

One trick I use is to run strings on the executable, which will print all those long (probably due to templates) and numerous (ditto) debug symbol names. You can pipe it to sort | uniq -c | sort -n and look at the results. In many large C++ executables you'll see patterns like this:

my_template<std::basic_string<char, traits, allocator>, std::unordered_map<std::basic_string<char, traits, allocator>, 1L>
my_template<std::basic_string<char, traits, allocator>, std::unordered_map<std::basic_string<char, traits, allocator>, 2L>
my_template<std::basic_string<char, traits, allocator>, std::unordered_map<std::basic_string<char, traits, allocator>, 3L>

You get the idea.

In some cases I've decided to simply reduce the amount of templating. Sometimes it gets out of hand. Other times you may win something by using explicit template instantiation, or compiling specific parts of your project without debugging symbols, or even disabling RTTI if you don't rely on dynamic_cast or typeid.

Rosemari answered 25/10, 2016 at 14:48 Comment(0)
E
3

I guess I can somehow do something with 'nm', directly inspecting the symbol names, but the output is enormous and I'm not sure how best to search it. Are there any tools to do this kind of analysis?

You can run the following to order all nm's symbol output by symbol length:

nm --no-demangle -a -P --size-sort myexecutable \
    | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-

(Kudos to Sort a text file by line length including spaces for everything after the first |.) This will show the longest names last. You can further pipe the output into c++filt -t to get demangled names, which may help you with your search.

Depending on your situation, it could be useful to split the executable and its debug symbols into separate files, which would allow you to distribute a less bloated executable to your target environments/clients/etc., and keep the debug symbols in a single location if needed. See How to generate gcc debug symbol outside the build target? for some details.

Encouragement answered 25/10, 2016 at 14:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.