Non-alphanumeric characters in COM/.NET interface names
Asked Answered
K

2

8

I'm thinking of using the characters #@! in some COM interfaces our system generates. The COM type library is also exported to .NET. Will those characters cause me trouble later on?

I've tested it out for most of the day today, and it all seems fine. Our system continues to work just like it always did.

The reason I'm cautious is that those characters are illegal in MIDL, which uses C syntax for type names. But we don't use MIDL - we build our type libraries with ICreateTypeInfo and ICreateTypeLib. Looks like that's just a MIDL restriction, and COM and .NET are happy with the non-alphanumeric characters. But maybe there's something I don't know...

Kinetic answered 12/11, 2010 at 5:30 Comment(1)
Good question because it demonstrates thoroughness. I like questions like this because it means at least some programmers try to anticipate problems ahead of time.Vesicle
K
2

This is what I've found.

I think there's no question that the names are legal at the binary level in COM, since a COM interface’s name is its IID and the text name is just documentation.

On the .NET side, the relevant specification is the Common Language Infrastructure specification (ECMA-335, http://www.ecma-international.org/publications/standards/Ecma-335.htm.) I wonder whether .NET or Mono add their own restrictions on top – to do so would reduce interoperability, but this is the real world.

Section 8.5.1 covers valid type names in the Common Type System, and simply says that names are compared using code points. Odd that it says nothing about the composition of a name, only how names are compared. This section is paraphrased by MSDN at http://msdn.microsoft.com/en-us/library/exy17tbw%28v=VS.85%29.aspx, which says that the only two restrictions are (1) type names are "encoded as strings of Unicode (16-bit) characters", and (2) they can't contain an embedded 0x0000.

I've quoted the bit about 16-bit Unicode, rather than paraphrase it, because it uses imprecise language. Presumably the author of that page meant UTF-16. In any case, ECMA-335 specifies byte-by-byte comparison, and makes no mention of Unicode (regarding type names), and neither does it prohibit embedded zeros. Perhaps .NET has deviated from the CTS here, although I doubt it. More likely, the author of this MSDN page was thinking about programming languages when he wrote it.

The Common Language Specification (also defined in ECMA-335) defines the rules for identifiers in source code. Identifiers aren't directly relevant to my question because my internal type names never appear in source code, but I looked into it anway. The CLS is a subset of the CTS, and as such its restrictions aren’t necessarily part of the broader CTS. CLS Rule 4 says that identifiers must follow the rules of Annex 7 of Technical Report 15 of the Unicode Standard 3.0 - see http://www.unicode.org/reports/tr15/tr15-18.html. That document too is a little vague, in that it refers to "other letter" and "connector punctuations" but doesn't define them. This helped: http://notes.jschutz.net/topics/unicode/.

Section 8.5.1 of the ECMA spec includes a non-normative note that a CLS consumer (such as C# or the Visual Studio type browser, I suppose) “need not consume types that violate CLS Rule 4.” My proposed interface names do violate this Rule 4. This note seems to imply that a valid type may have a name that violates rule 4, and that a CLS consumer should either accept the rogue name or safely ignore it. (The Visual Studio type browser displays it without complaint.)

So my proposed type names are generally illegal in source code. But note that section 10.1 (about identifiers in the CLS) says “Since its rules apply only to items exported to other languages, private members or types that aren’t exported from an assembly can use any names they choose.”

I conclude that it's safe to use the characters #@! in my type names as long as they remain in the binary domain and never need appear in source code nor outside the assembly. And in fact they're never used outside the COM server.

A word about future-proofing... The CTS pretty much has nothing to say about the composition of type names, despite having a section called “Valid names” (section 8.5.1). They might change that in the future, but this broad and liberal specification has invited us all to do what we like. If the CTS designers had wanted to leave room for change then surely they would have built in some provision for that, or at least been less generous.

Kinetic answered 29/11, 2010 at 3:4 Comment(6)
Nice write up. You have a typo, I think you mean ECMA-335. My research followed this same thread. While I might weigh the pros and cons differently, you make a good case. My remaining arguments against are Off Topic, so instead, let me share this supporting reference. ECMA-335 Partition II Section 5.3 on Metadata Identifier Syntax says, "ILAsm syntax allows the use of any identifier that can be formed using the Unicode character set. To achieve this, an identifier shall be placed within single quotation marks." It seems unlike MIDL, ILAsm allows and supports your unusual names.Amputee
Please post back here if you do run into any issues. Thanks.Amputee
Thanks, Jim :-) I'd be interested to hear your arguments against... they might cover something I haven't thought of. And yes, I'll post back here if there are any developments. (And I'll fix up that typo, too.)Kinetic
I feel a little awkward marking my own answer as the accepted answer. But it really is the right answer, and nobody's disputed it (would still be interested in Jim's off-topic arguments, though...)Kinetic
By the way, a number of .Net obfuscators will rename all types and members to single unprintable Unicode charactersAdministration
I hadn't thought about that. Thanks for pointing it out. (Actually, I'm not sure whether you intended it as endorsement of strange characters or as a warning.)Kinetic
A
1

It's interesting that you seem to have found a loophole in COM type naming. Microsoft restricts the use of characters '#@!' as identifiers in MIDL, but they don't duplicate that restriction in the ICreateTypeInfo and ICreateTypeLib interfaces.

Using these characters works today, so what's the risk?

  1. Well, Microsoft could see this as a bug and 'fix' ICreateTypeInfo, ICreateTypeLib, .Net COM Interop, and/or .Net type naming restrictions in the next release.

  2. You're creating and using an interface that doesn't have any valid MIDL definition.

  3. You're using names that will probably have to change if (when) you transition from COM to .Net. Even if just you want to create an adapter type in .Net you will not be able to reuse any of the "invalid" names.

  4. Is this compatible with Mono and other non-Microsoft .Net compatible technologies?

  5. There are plenty of known valid names that could be used (use something like '_at_' instead of '@', etc.) to avoid any possible future issue.

If none of this matters to you, then you'll probably be fine. But I suspect by the very fact that you asked this question, at some level it doesn't 'feel' right to you.

Good luck.

Amputee answered 24/11, 2010 at 3:16 Comment(9)
Here's where my thinking is up to: In COM the text name is not a part of the type definition - it's just documentation. So at the COM binary level there's no problem. But I'm not familiar enough with the low level details of .NET to make the same determination for .NET. (At least, not yet.)Kinetic
(Oops, hit return too soon.) The product is already very much in .NET so the .NET angle is a live issue today. However, the interface names never appear in source code - these interfaces are only used internally and are loaded by ITypeLib etc. (Another person's comments on that topic mysteriously disappeared from SO the other day... maybe because I didn't vote his answer up or something like that.)Kinetic
Microsoft ain't going to change the behaviour of ICreateTypeInfo etc. at this late stage of the game. Not so sure about .NET, and certainly not sure at all about Mono.Kinetic
Names like you suggest won't work. The problem is this: Our product allows the developer to define types in an object model. Internally the product implements that object model using several COM interfaces. If he defines a type called Whatever then we define IWhatever, IDispWhatever, IWhateverEvents. Up until now those internal interfaces share the same namespace as the developer's object model, so this precludes him from defining his own types with those names, and we're not too happy with that. So I'm thinking of #Whatever, @Whatever and !Whatever instead.Kinetic
> COM the text name is not a part of the type definition - it's just documentation. You say this because in COM it's the GUID that's important, the ProgIDs and other text names are just text. You have a point.Amputee
. On the one hand these strange named types will be used by customers (which is a minus), but customers are not going to use the classes directly, and never use the type names (a plus for your approach). I spend some time looking and was surprised I could not find an authoritative lexical definition of an <identifier>.Amputee
Personally I would probably use a prefix and create names like __jimh_IWhatever (but with the company or product name instead of my name), but that's just me. I'll leave you with one final suggestion, pick just one of "#@!" and use it to create names like: @IWhatever, @IDispWhatever, @IWhateverEvents. That way you're picking just one unusual character which gives you a little less exposure to Microsoft changing the rules (or enforcement) under you.Amputee
I've examined many schemes using legal prefixes and none eliminated the possibility of conflicts, nor even reduced it to non-trivial cases. For example, what if the developer defines a type called __jimh_IWhatever? Your suggestion of mixing @ with the fully-spelled-out interface names was initially attractive but also has the same problem: what if the developer defines two types Whatever and WhateverEvents? (Surely a reasonable thing for him to do.) We'd end up with two internal interfaces called @IWhateverEvents.Kinetic
Thanks for your replies so far. I think maybe what's left in this question now is whether there exists some disconnect between the source-code text name and the runtime type identity in .NET, similar to that in COM.Kinetic

© 2022 - 2024 — McMap. All rights reserved.