String interning. How does the compiler know?
Asked Answered
A

1

6

I know what string interning is, and why the following code behaves the way it does:

var hello = "Hello";
var he_llo = "He" + "llo";
var b = ReferenceEquals(hello, he_llo); //true

Or

var hello = "Hello";
var h_e_l_l_o = new string(new char[] { 'H', 'e', 'l', 'l', 'o' });
var b = ReferenceEquals(hello, he_llo); //false

...or I thought I did, because a subtle bug has cropped up in some code I'm working on due to this:

var s = "";
var sss = new string(new char[] { });
var b = ReferenceEquals(s, sss); //True!?

How does the compiler know that sss will in fact be an empty string?

Aldershot answered 27/1, 2017 at 13:58 Comment(11)
Because the string constructor for char[] has exceptional logic for this in the CLR internally, and will simply point to the one, true, empty string if you pass an empty array rather than actually construct a new object. There is a question on SO (with a bad title) that explains it. To be clear, this is a runtime issue -- the surprise is not that the compiler is clairvoyant but that new doesn't always new.Addendum
An interesting follow-up question would be: is there any way whatsoever to create an empty string s at runtime (such that s.Length == 0) for which Object.ReferenceEquals(s, "") does not hold? If there is, I haven't found it -- creating one by manipulating an initially non-empty string doesn't seem to do it, no matter how clever you get.Addendum
If you look at the compiled->decompiled code, you'll see that the example you are asking about is compiled as written (look at the right pane)Zoosperm
A fiddle of some example code: dotnetfiddle.net/xdtcRGChalkboard
@JeroenMostert wow, thanks for the link; if Jon Skeet considered this a strange corner case, I feel better already.Aldershot
It is a corner case and a rather terrible one at that. Even the ECMA spec for CLI states, without reservation, that the newobj opcode creates "a new object or a new instance of a value type". Nowhere does it say that the runtime is allowed to return a reference to an existing instance in this case, but this is exactly what the CLR does anyway. It wouldn't be so bad if this wasn't an observable difference, but it is. I'd be tempted to call it a bug, except the behavior is so old (and the optimization demonstrably useful) that it's more of a quirk.Addendum
@JeroenMostert What does it say about reference types? String is an immutable reference type, not a value type.Chalkboard
@JeroenMostert Oh, my bad ... that's what he "a new object " part is on about!Chalkboard
@AndyJ: "The newobj instruction allocates a new instance of the class associated with ctor and initializes all the fields in the new instance to 0 (of the proper type) or null as appropriate. It then calls the constructor with the given arguments along with the newly created instance. After the constructor has been called, the now initialized object reference is pushed on the stack." First of all, this is obviously not what literally happens for string (just in effect), but even here, I would never expect the same reference to be returned twice based on this description!Addendum
@JeroenMostert Many thanks for all the input! Very instructive.Aldershot
@JeroenMostert Yeah, thanks for the pointer to that page, this is really interesting stuff.Chalkboard
T
4

If an empty array or null array is passed in a string constructor then it returns an empty string.

It is specified in a comment in the reference code.

 // Creates a new string with the characters copied in from ptr. If
 // ptr is null, a 0-length string (like String.Empty) is returned.

You can also see the same result with null array like:

char[] tempArray = null;
var s = "";
var sss2 = new string(tempArray);
var b = ReferenceEquals(s, sss2); //True!?
Tiga answered 27/1, 2017 at 14:8 Comment(7)
You should specify that this happens at runtime, not at compile time. (compare it with the "A" + "B""AB" that happens at compile time)Zoosperm
Sorry but this is a rather superficial answer. That it creates an empty string is obvious, but the subtle part here is not about returning an empty string, its about always returning the specific interned empty string. You can't do that in C#; what is already said in comments, new is not newing at all, which is very unexpected.Aldershot
@Aldershot But it's not a "C#" constructor that's being called. If you look at the link to the reference code in the answer you can see the comment is on an extern interop definition which is a "MethodImplOptions.InternalCall". That means it's making a call in to the CLR, and the CLR can return whatever instance it wants.Chalkboard
@AndyJ and is that being said anywhere in this answer?Aldershot
@Aldershot Well it is if you read the links ;) I agree that it should be in the actual answer text so if that link dies then the information is still there.Chalkboard
@InBetween, yes it is a CLR call, but as far as interning is concerned, some versions of .Net framework automatically interns the empty string, some don't. There is good blog post by Eric Lippert about that. The statement that the CLR call returns the specific interned empty string is somewhat wrong, it just returns an empty string, which is interned later by the .Net framework.Tiga
In what way do you think "the .NET Framework" and the CLR are different in this discussion? The CLR is the runtime; it both implements the newobj instruction that results in this constructor call and the constructor call itself. There is no layer inbetween that does something "later". It's just as correct to say that some versions of the CLR exhibit this behavior and some don't (though Eric unfortunately doesn't mention which ones don't, or whether this is subject to runtime settings).Addendum

© 2022 - 2024 — McMap. All rights reserved.