Is it possible to strip type names from executable while keeping RTTI enabled?
Asked Answered
D

3

4

I recently disabled RTTI on my compiler (MSVC10) and the executable size decreased significantly. By comparing the produced executables using a text editor, I found that the RTTI-less version contains much less symbol names, explaining the saved space.

AFAIK, those symbol names are only used to fill the type_info structure associated with each the polymorphic type, and one can programmatically access them calling type_info::name().

According to the standard, the format of the string returned by type_info::name() is unspecified. That is, no one can rely one it to do serious things portably. So, it should be possible for an implementation to always return an empty string without breaking anything, thus reducing the executable size without disabling RTTI support (so we can still use the typeid operator & compare type_info's objects safely).

But... is it possible ? I'm using MSVC10 and I've not found any option to do that. I can either disable completely RTTI (/GR-), or enable it with full type names (/GR). Does any compiler provide such an option?

Draught answered 9/7, 2012 at 21:5 Comment(22)
I have once experimented with it, and at least under gcc on linux comparison between typeinfo objects is purely name-based. If you declare different classes and name them same (in differenr compilation units of course), and then somehow you will get their typeinfos together, comparison returns true, so name must be important. It doesn't seem there is any other sensible way to provide functional type-info object.Trichome
What, you are not exactly proud of some of the names you gave to your types?Monophyletic
@Trichome Was it possible to compile what you describe ? How does the linker handle duplicate symbols when two classes have the same name ?Vasili
@AndreyT Of course I am. But I don't want my users to discover how crazy they sometimes are :)Vasili
Classes themselves don't produce any symbols. I haven't written any constructors or destructors for them, and all methods had different names. I was a bit afraid that typeinfo-data itself might also create colliding symbol, but it seems those are kept in every translation unit for it's own use, so there was no problem.Trichome
@Trichome I think it could be implemented easily. Just remap each string to an unique number, and that's it.Vasili
@FrédéricTerrazzoni And what if user loads another shared lib with some other classes? Who should be responsible for assigning their numbers so those don't collide with existing ones?Trichome
@Trichome I think it could be done at link time (the .lib should maybe contain some additional informations?) , but I'm not sure if it is possible. That's a good remark x)Vasili
Are you asking whether it's possible to develop a new implementation (compiler, linker, etc.) that uses empty names for all types, works, and conforms with the standard? Or whether it's possible for you to strip out all the names that MSVC10 puts in and expect everything to keep working? Or whether it must be possible if MSVC conforms? The answers are yes, probably no, and no.Phonology
@Phonology At the beginning I was hoping to do that with MSVC. Unfortunately, I now understand that this is not implemented... But if it is implementable at all, why isn't this feature already available everywhere ? I'd be much better than the double edged "disable RTTI" switch (tradeoff standard compatibility VS executable size). So now I'm wondering if it is implementable at all. The j_kubik remark about shared libs let me think it's not as simple as it seems to be :/ I hope to get a complete explanation here :)Vasili
Here's a really easy solution: Use a UUID instead of a name in the ABI. For dllexport, you create a UUID, store that UUIDs in the actual RTTI structures, and stash the name:UUID mapping in the .LIB import library. For dllimport, you look up the UUID in the .LIB. At runtime, the names don't exist anywhere, just the UUIDs. (Of course all of your class names—or at least the ones that have exported methods—are still going to be visible as part of the mangled names of those methods' symbols…)Phonology
@Phonology UUIDs sound ok, but think about such scenario: standard template class used by many libraries compiled by different vendors, now used in my application. Who should assign it's UUID? What about other compilers/platforms? Unix/Linux doesn't use .LIB files or anything similar to store additional type-info.Trichome
@j_kubik: For your first question, class templates don't reside in DLLs. Their template class instantiations can, but in that case, the DLL that instantiates it assigns the UUID. (So, what if two DLLs expose the same instantiation of the same template? If it matters which one your app gets, it's broken with any ABI. If it doesn't matter, then it doesn't matter which UUID your app sees either.)Phonology
@j_kubik: For your second question, the MS ABI also doesn't use .LIB files to store additional typeinfo, but the OP is just asking whether it's implementable at all, and by modifying the MS ABI to store additional typeinfo in those .LIB files, it's implementable. Likewise, you could extend the HP-Intel/ELF/linux ABI to add .LIB files, or you could replace the .so->.so.6 symlinks with actual .so files that serve a similar purpose, or whatever, and then it's implementable.Phonology
@Phonology First issue: It shouldn't matter which implementation is being used, but not what UUID it has. Imagine DLL creating class instance and passing it to you in some variant container along with type_info reference, so you know what you got. But you are using different template instantiation (even from the same header as DLL, why not), your app will not recognize type_info (different UUID) and show "Unknown UUID {123...}" in its log (at very best). Or do you imply that std::string that your DLL is using is different type than std::string used by my app linking your DLL?Trichome
@Phonology Plus many UUIDs would be even longer than typenames so it would make the problem only worse.Trichome
j_kubik: The OP asks whether it would be possible for any implementation to work without names. I've suggested one, as an existence proof. Yes, it's pointless, but so is the whole original idea, so that's not a fault of the implementation. As for passing C++ objects across DLL boundaries, that already doesn't work in many cases (e.g., in some versions of VC, each module's std::map<int> has a different static sentinel, so you can't iterate a map from a DLL in your EXE), and there's nothing in the C++ standard that even implies it should. That being said, there are ways to make it work.Phonology
@Phonology "there's nothing in the C++ standard that even implies it should" - because standard doesn't use concept of shred library (or any other). But there is also implementation specification, and if implementation is designed to support such libs (for OS-es that use them at least), I would expect that it would also fill in the gaps to complete the standard.Trichome
@Phonology "in some versions of VC, each module's std::map<int> has a different static sentinel, so you can't iterate a map from a DLL in your EXE". Perhaps. I don't have that much experience with VC - tell me: is VC evaluating comparisons between their type_info objects to true or false then? Are those types considered to be the same? Could you check for me? (I cannot, I am running linux)Trichome
@j_kubik: SO comments are not sufficient for this argument. But here's an even simpler existence proof. The Itanium ABI (that your linux box probably uses) explicitly does not use names: "It is intended that two type_info pointers point to equivalent type descriptions if and only if the pointers are equal. An implementation must satisfy this constraint, e.g. by using symbol preemption, COMDAT sections, or other mechanisms."Phonology
@Phonology My implementation seems to use symbol preemption to acheve that. Now, it seems possible to remove class-name-strings from binary file, thus saving some space. However class names are contained in symbol names, so they are still present (app users can see it.) So if you disable it, you will save some space because names are in fact doubled: symbol names, and strings in executable.Trichome
Unless you are worried about executable-file size (embedded app, slow internet connection) you will not care about file size, because sections with those data, as long as you don't actually use it, will never be swaped-in into memory, so you are losing only a bit of disk space, and a bit of available address space. If comparing type_info pointer values is equivalent to type_info::operator==(), then this memory will never be touched, as there is no more info than it's name that can be squeezed out of it (I assume before() could do pointer-comparison as well, instead of string comparison.)Trichome
P
3

You're asking three different questions here.

  1. The initial question asks whether there's any way to get MSVC to not generate names, or whether it's possible with other compilers, or, failing that, whether there's any way to strip the names out of the generated type_info without breaking things.

  2. Then you want to know whether it would be possible to modify the MS ABI (presumably not too radically) so that it would be possible to strip the names.

  3. Finally, you want to know whether it would be possible to design an ABI that didn't have names.

Question #1 is itself a complex question. As far as I know, there's no way to get MSVC to not generate names. And most other compilers are aimed at ABIs that specifically define what typeid(foo).name() must return, so they also can't be made to not generate names.

The more interesting question is, what happens if you strip out the names. For MSVC, I don't know the answer. The best thing to do here is probably to try it—go into your DLLs and change the first character of each name to \0 and see if it breaks dynamic_cast, etc. (I know that you can do this with Mac and linux x86_64 executables generated by g++ 4.2 and it works, but let's put that aside for now.)

On to question #2, assuming blanking the names doesn't work, it wouldn't be that hard to modify a name-based system to no longer require names. One trivial solution is to use hashes of the names, or even ROT13-encoded names (remember that the original goal here is "I don't want casual users to see the embarrassing names of my classes"). But I'm not sure that would count for what you're looking for. A slightly more complex solution is as follows:

  • For "dllexport"ed classes, generate a UUID, put that in the typeinfo, and also put it in the .LIB import library that gets generated along with the DLL.
  • For "dllimport"ed classes, read the UUID out of the .LIB and use that instead.

So, if you manage to get the dllexport/dllimport right, it will work, because your exe will be using the same UUID as the dll. But what if you don't? What if you "accidentally" specify identical classes (e.g., an instantiation of the same template with the same parameters) in your DLL and your EXE, without marking one as dllexport and one as dllimport? RTTI won't see them as the same type.

Is this a problem? Well, the C++ standard doesn't say it is. And neither does any MS documentation. In fact, the documentation explicitly says that you're not allowed to do this. You cannot use the same class or function in two different modules unless you explicitly export it from one module and import it into another. The fact that this is very hard to do with class templates is a problem, and it's a problem they don't try to solve.

Let's take a realistic example: Create a node-based linkedlist class template with a global static sentinel, where every list's last node points to that sentinel, and the end() function just returns a pointer to it. (Microsoft's own implementation of std::map used to do exactly this; I'm not sure if that's still true.) New up a linkedlist<int> in your exe, and pass it by reference to a function in your dll that tries to iterate from l.begin() to l.end(). It will never finish, because none of the nodes created by the exe will point to the copy of the sentinel in the dll. Of course if you pass l.begin() and l.end() into the DLL, instead of passing l itself, you won't have this problem. You can usually get away with passing a std::string or various other types by reference, just because they don't depend on anything that breaks. But you're not actually allowed to do so, you're just getting lucky. So, while replacing the names with UUIDs that have to be looked up at link time means types can't be matched up at link-loader time, the fact that types already can't be matched up at link-loader time means this is irrelevant.

It would be possible to build a name-based system that didn't have these problems. The ARM C++ ABI (and the iOS and Android ABIs based on it) restricts what programmers can get away with much less than MS, and has very specific requirements on how the link-loader has to make it work (3.2.5). This one couldn't be modified to not be name-based because it was an explicit choice in the design that:

• type_info::operator== and type_info::operator!= compare the strings returned by type_info::name(), not just the pointers to the RTTI objects and their names.

• No reliance is placed on the address returned by type_info::name(). (That is, t1.name() != t2.name() does not imply that t1 != t2).

The first condition effectively requires that these operators (and type_info::before()) must be called out of line, and that the execution environment must provide appropriate implementations of them.

But it's also possible to build an ABI that doesn't have this problem and that doesn't use names. Which segues nicely to #3.

The Itanium ABI (used by, among other things, both OS X and recent linux on x86_64 and i386) does guarantee that a linkedlist<int> generated in one object and a linkedlist<int> generated from the same header in another object can be linked together at runtime and will be the same type, which means they must have equal type_info objects. From 2.9.1:

It is intended that two type_info pointers point to equivalent type descriptions if and only if the pointers are equal. An implementation must satisfy this constraint, e.g. by using symbol preemption, COMDAT sections, or other mechanisms.

The compiler, linker, and link-loader must work together to make sure that a linkedlist<int> created in your executable points to the exact same type_info object that a linkedlist<int> created in your shared object would.

So, if you just took out all the names, it wouldn't make any difference at all. (And this is pretty easily tested and verified.)

But how could you possibly implement this ABI spec? j_kubik effectively argues that it's impossible because you'd have to preserve some link-time information in the .so files. Which points to the obvious answer: preserve some link-time information in the .so files. In fact, you already have to do that to handle, e.g., load-time relocations; this just extends what you need to preserve. And in fact, both Apple and GNU/linux/g++/ELF do exactly that. (This is part of the reason everyone building complex linux systems had to learn about symbol visibility and vague linkage a few years ago.)

There's an even more obvious way to solve the problem: Write a C++-based link loader, instead of trying to make the C++ compiler and linker work together to trick a C-based link loader. But as far as I know, nobody's tried that since Be.

Phonology answered 10/7, 2012 at 23:47 Comment(2)
The original goal is more to save space than avoiding the users to see class names, although it could be a desirable side effect. Anyway, this is a very complete answer and I got the point even if I didn't fully understand everything. I didn't know the "Itanium ABI" (totally unrelated to Itanium CPU, right?), so I'll do some research to understand how it works. ThanksVasili
Footnote on the Itanium C++ ABI: As the name implies, this was originally designed for the Itanium CPU family, by a coalition originally put together by Intel and HP. But with a few trivial changes, it worked for linux x86_64, and Darwin i386, and so on, so it spread far beyond the Itanium CPU. (Its second most common name is "G++ ABI", which was a little more fitting until Apple's move from g++/libstdc++ to clang/libc++.)Phonology
T
4

So, it should be possible for an implementation to always return an empty string without breaking anything, thus reducing the executable size without disabling RTTI support (so we can still use the typeid operator & compare type_info's objects safely).

You are misreading the standard. The intent of making the return value from type_info::name() unspecified (other than a null-terminated binary string) was to give the implementers of the compiler/library/run-time environment free reign to implement the RTTI requirements as they see best. You, the programmer, have no say in how the Application Binary Interface (if there is one) is designed or implemented.

Teleview answered 9/7, 2012 at 21:31 Comment(8)
Independently of what the intent is, though, the language in the spec does permit every single name to be the empty string if this is what's desired. It should be possible, in theory, to implement RTTI without storing any identifiable names anywhere.Standridge
And even if the standard would say something about strings being unique, they could still be super-compact, non-human-readable byte sequences. Plus, compilers are free to do whatever they want to, if the user tells them to -- if I say "I don't need those strings", a compiler could very well never generate these and give me an error when I use type_info::name()Roadster
It's not necessary to generate an error, since it'd break standard compatibility. Returning a static "" is both standard compliant & safe. What I would like to have is just a compiler switch to toggle this behavior.Vasili
Yes, it would be standard-compliant for an implementation to use "" for all names, and use some other means of distinguishing types. But if your implementation (or the ABI it's written to) depends on the names to distinguish types, and you strip the names out, the code will break. And that's not a bug with the implementation. In exactly the same way, the standard doesn't mandate vtables, but if you strip out all the vtables, the code will not work.Phonology
@Phonology Yes but are the full names necessary to distinguish different types ? Isn't possible to assign them an unique ID to make efficient comparisons between them for instance ? I know that the Objective-C compiler does something similar for each selector, so I imagine the same could be done for C++ (modifying the implementation of course)Vasili
@FrédéricTerrazzoni: No, the ObjC compiler does nothing similar. In effect, @selector(foo:withBar:) is (void*)"foo:withBar:" (except that it guarantees that every instance of that selector is a pointer to the same instance of the string, which isn't normally guaranteed by C). That's why the C link-loader doesn't have to understand C++, and that's why you can do NSSelectorFromString at runtime. Also, ObjC doesn't guarantee uniqueness: -[Baz foo:withBar:] and -[Qux foo:withBar:] have the same selector, and are therefore the same message. That wouldn't work with C++.Phonology
Hum ok. Last thing : since each class having at least one virtual method has an unique associated vtable, could the vtable pointer be used for that purpose ? If yes, it could be possible for an programmer to create its own RTTI without requiring names to be stored ? (it would be non-portable of course)Vasili
@FrédéricTerrazzoni: If you know that your compiler uses a single vtbl per class, and you're willing to write undefined (but working) code that gets at the vptr from any class instance (which may be harder than you think once you take MI and VI into account), then yes, you could do type equality by comparing vptrs. But you need more than equality to make dynamic_cast work, and if you turn off compiler RTTI, where do you get all the other info from?Phonology
P
3

You're asking three different questions here.

  1. The initial question asks whether there's any way to get MSVC to not generate names, or whether it's possible with other compilers, or, failing that, whether there's any way to strip the names out of the generated type_info without breaking things.

  2. Then you want to know whether it would be possible to modify the MS ABI (presumably not too radically) so that it would be possible to strip the names.

  3. Finally, you want to know whether it would be possible to design an ABI that didn't have names.

Question #1 is itself a complex question. As far as I know, there's no way to get MSVC to not generate names. And most other compilers are aimed at ABIs that specifically define what typeid(foo).name() must return, so they also can't be made to not generate names.

The more interesting question is, what happens if you strip out the names. For MSVC, I don't know the answer. The best thing to do here is probably to try it—go into your DLLs and change the first character of each name to \0 and see if it breaks dynamic_cast, etc. (I know that you can do this with Mac and linux x86_64 executables generated by g++ 4.2 and it works, but let's put that aside for now.)

On to question #2, assuming blanking the names doesn't work, it wouldn't be that hard to modify a name-based system to no longer require names. One trivial solution is to use hashes of the names, or even ROT13-encoded names (remember that the original goal here is "I don't want casual users to see the embarrassing names of my classes"). But I'm not sure that would count for what you're looking for. A slightly more complex solution is as follows:

  • For "dllexport"ed classes, generate a UUID, put that in the typeinfo, and also put it in the .LIB import library that gets generated along with the DLL.
  • For "dllimport"ed classes, read the UUID out of the .LIB and use that instead.

So, if you manage to get the dllexport/dllimport right, it will work, because your exe will be using the same UUID as the dll. But what if you don't? What if you "accidentally" specify identical classes (e.g., an instantiation of the same template with the same parameters) in your DLL and your EXE, without marking one as dllexport and one as dllimport? RTTI won't see them as the same type.

Is this a problem? Well, the C++ standard doesn't say it is. And neither does any MS documentation. In fact, the documentation explicitly says that you're not allowed to do this. You cannot use the same class or function in two different modules unless you explicitly export it from one module and import it into another. The fact that this is very hard to do with class templates is a problem, and it's a problem they don't try to solve.

Let's take a realistic example: Create a node-based linkedlist class template with a global static sentinel, where every list's last node points to that sentinel, and the end() function just returns a pointer to it. (Microsoft's own implementation of std::map used to do exactly this; I'm not sure if that's still true.) New up a linkedlist<int> in your exe, and pass it by reference to a function in your dll that tries to iterate from l.begin() to l.end(). It will never finish, because none of the nodes created by the exe will point to the copy of the sentinel in the dll. Of course if you pass l.begin() and l.end() into the DLL, instead of passing l itself, you won't have this problem. You can usually get away with passing a std::string or various other types by reference, just because they don't depend on anything that breaks. But you're not actually allowed to do so, you're just getting lucky. So, while replacing the names with UUIDs that have to be looked up at link time means types can't be matched up at link-loader time, the fact that types already can't be matched up at link-loader time means this is irrelevant.

It would be possible to build a name-based system that didn't have these problems. The ARM C++ ABI (and the iOS and Android ABIs based on it) restricts what programmers can get away with much less than MS, and has very specific requirements on how the link-loader has to make it work (3.2.5). This one couldn't be modified to not be name-based because it was an explicit choice in the design that:

• type_info::operator== and type_info::operator!= compare the strings returned by type_info::name(), not just the pointers to the RTTI objects and their names.

• No reliance is placed on the address returned by type_info::name(). (That is, t1.name() != t2.name() does not imply that t1 != t2).

The first condition effectively requires that these operators (and type_info::before()) must be called out of line, and that the execution environment must provide appropriate implementations of them.

But it's also possible to build an ABI that doesn't have this problem and that doesn't use names. Which segues nicely to #3.

The Itanium ABI (used by, among other things, both OS X and recent linux on x86_64 and i386) does guarantee that a linkedlist<int> generated in one object and a linkedlist<int> generated from the same header in another object can be linked together at runtime and will be the same type, which means they must have equal type_info objects. From 2.9.1:

It is intended that two type_info pointers point to equivalent type descriptions if and only if the pointers are equal. An implementation must satisfy this constraint, e.g. by using symbol preemption, COMDAT sections, or other mechanisms.

The compiler, linker, and link-loader must work together to make sure that a linkedlist<int> created in your executable points to the exact same type_info object that a linkedlist<int> created in your shared object would.

So, if you just took out all the names, it wouldn't make any difference at all. (And this is pretty easily tested and verified.)

But how could you possibly implement this ABI spec? j_kubik effectively argues that it's impossible because you'd have to preserve some link-time information in the .so files. Which points to the obvious answer: preserve some link-time information in the .so files. In fact, you already have to do that to handle, e.g., load-time relocations; this just extends what you need to preserve. And in fact, both Apple and GNU/linux/g++/ELF do exactly that. (This is part of the reason everyone building complex linux systems had to learn about symbol visibility and vague linkage a few years ago.)

There's an even more obvious way to solve the problem: Write a C++-based link loader, instead of trying to make the C++ compiler and linker work together to trick a C-based link loader. But as far as I know, nobody's tried that since Be.

Phonology answered 10/7, 2012 at 23:47 Comment(2)
The original goal is more to save space than avoiding the users to see class names, although it could be a desirable side effect. Anyway, this is a very complete answer and I got the point even if I didn't fully understand everything. I didn't know the "Itanium ABI" (totally unrelated to Itanium CPU, right?), so I'll do some research to understand how it works. ThanksVasili
Footnote on the Itanium C++ ABI: As the name implies, this was originally designed for the Itanium CPU family, by a coalition originally put together by Intel and HP. But with a few trivial changes, it worked for linux x86_64, and Darwin i386, and so on, so it spread far beyond the Itanium CPU. (Its second most common name is "G++ ABI", which was a little more fitting until Apple's move from g++/libstdc++ to clang/libc++.)Phonology
T
0

Requirements for type-descriptor:

  • Works correctly in multi compilation-unit and shared-library environment;
  • Works correctly for different versions of shared libraries;
  • Works correctly although different compilation units don't share any information about type, except it's name: usually one header is used for all compilation units to define same type, but it's not required; even if, it doesn't affect resulting object file.
  • Work correctly despite fact that template instantiations must be fully-defined (so including type_info data) in every library that uses them, and yet behave like one type if several such libs are used together.

The fourth rule essentially bans all non-name based type-descriptors like UUIDs (unless specifically mentioned in type definition, but that is just name-replacement at best, and probably requires standard-alterations).

Stroing thuse UUIDs in separate files like suggeste .LIB files also causes trouble: different library versions implementing new types would cause trouble.

Compilation units should be able to share the same type (and its type_info) without the need to involve linker - because it should stay free of any language-specifics.

So type-name can be only unique type descriptor without completely re-modeling compilation and linking (also dynamic). I could imagine it working, but not under current scheme.

Trichome answered 10/7, 2012 at 2:3 Comment(4)
Pointing out the problems with shared libs explains well why it's not possible in the current implementations. Thank you.Vasili
You're adding a requirement that they not only work in shared library environments, but work better than the name-based implementation in MSVC do, which seems a bit extreme. It's certainly not motivated by the standard, or by the MS ABI. Also, "not under current scheme" isn't relevant here. You're arguing that an ABI that uses names requires names, and one that didn't use names would be a different ABI. Well, yeah, sure, but so what? That doesn't mean that no such different ABI could exist.Phonology
Looking over different ABIs that have partly or completely solved the C++-and-shared-libraries problems, not a single one of them meets these requirements. That includes the Itanium ABI, the MS ABI, the Be ABI, and the old loosely-specified g++/linux ABI.Phonology
And again, the fact that it works on your own system without relying on names means that your inability to imagine it working isn't that compelling.Phonology

© 2022 - 2024 — McMap. All rights reserved.