Why does .NET create new substrings instead of pointing into existing strings?
Asked Answered
J

5

8

From a brief look using Reflector, it looks like String.Substring() allocates memory for each substring. Am I correct that this is the case? I thought that wouldn't be necessary since strings are immutable.

My underlying goal was to create a IEnumerable<string> Split(this String, Char) extension method that allocates no additional memory.

Joses answered 4/7, 2009 at 15:42 Comment(2)
I haven't thought about it very hard, or looked at StringBuilder's implementation with Reflector, but would an IEnumerable<StringBuilder> Split(this StringBuilder, Char) method work?Shavonda
If String.Substring() dont allocate new memory, string dont will be ImmutableWilltrude
P
24

One reason why most languages with immutable strings create new substrings rather than refer into existing strings is because this will interfere with garbage collecting those strings later.

What happens if a string is used for its substring, but then the larger string becomes unreachable (except through the substring). The larger string will be uncollectable, because that would invalidate the substring. What seemed like a good way to save memory in the short term becomes a memory leak in the long term.

Paniculate answered 4/7, 2009 at 16:29 Comment(2)
I thought the main reason was in regards to algorithms over the strings. If you can safely assume that a string will never change you can pass references to it safely and it's also inherently threadsafe. I guess that ties in with garbage collection too.Agni
@Agni - that is a reason for immutability. It's not a reason for avoiding shared buffers between strings. Once you have immutability and GC, you can easily implement shared buffers behind the scenes without breaking thread safety or existing algorithms.Glovsky
A
2

Not possible without poking around inside .net using String classes. You would have to pass around references to an array which was mutable and make sure no one screwed up.

.Net will create a new string every time you ask it to. Only exception to this is interned strings which are created by the compiler (and can be done by you) which are placed into memory once and then pointers are established to the string for memory and performance reasons.

Agni answered 4/7, 2009 at 15:49 Comment(0)
G
1

Each string has to have it's own string data, with the way that the String class is implemented.

You can make your own SubString structure that uses part of a string:

public struct SubString {

   private string _str;
   private int _offset, _len;

   public SubString(string str, int offset, int len) {
      _str = str;
      _offset = offset;
      _len = len;
   }

   public int Length { get { return _len; } }

   public char this[int index] {
      get {
         if (index < 0 || index > len) throw new IndexOutOfRangeException();
         return _str[_offset + index];
      }
   }

   public void WriteToStringBuilder(StringBuilder s) {
      s.Write(_str, _offset, _len);
   }

   public override string ToString() {
      return _str.Substring(_offset, _len);
   }

}

You can flesh it out with other methods like comparison that is also possible to do without extracting the string.

Gleason answered 4/7, 2009 at 16:8 Comment(2)
What about a substring into another substring?Glovsky
Yes, it's easy for the SubString structure to create another that is part of itself.Gleason
S
0

Because strings are immutable in .NET, every string operation that results in a new string object will allocate a new block of memory for the string contents.

In theory, it could be possible to reuse the memory when extracting a substring, but that would make garbage collection very complicated: what if the original string is garbage-collected? What would happen to the substring that shares a piece of it?

Of course, nothing prevents the .NET BCL team to change this behavior in future versions of .NET. It wouldn't have any impact on existing code.

Showcase answered 4/7, 2009 at 15:55 Comment(5)
Java's String actually does it that way: Substrings are merely pointers into the original string. However, that also means that when you take a 200-character substring of a 200-MiB string, the 200-MiB string will always lie around in memory as long as the small substring isn't garbage-collected.Efflorescence
I think it could impact existing code given that it is designed around this behaviour. If people assume that interning their string will stop it from being duplicated and this behaviour was stopped it could cause working apps to stop with out of memory exceptions.Agni
How can you design around this behavior? Because of the immutability of strings, there's really no way to create code that would break if the internal implementation of the string class changes.Showcase
.Net string operations indeed create new string objects, but it's not because strings are immutable. In fact, it's because strings are immutable that string operations could reuse current string objects instead of creating new ones.Zugzwang
If C# used this approach, it wouldn't make garbage collection any different. The original string would have multiple references to it, and so it would not be garbage collected until all substrings based on it were also unreachable. Hence what Joey says. Java has faster substring, potentially much higher memory use, and C# has slow substring, potentially much more efficient memory use.Catholicism
B
0

Adding to the point that Strings are immutable, you should be that the following snippet will generate multiple String instances in memory.

String s1 = "Hello", s2 = ", ", s3 = "World!";
String res = s1 + s2 + s3;

s1+s2 => new string instance (temp1)

temp1 + s3 => new string instance (temp2)

res is a reference to temp2.

Bonneau answered 4/7, 2009 at 19:41 Comment(5)
This sounds like something that the compiler folks could optimize.Mirepoix
It's not an issue with the compiler, it's a choice made in designing the language. Java has the same rules for Strings. System.Text.StringBuilder is a good class to use that simulates the "mutable" strings.Bonneau
Wrong - s1 + s2 + s3 gets turned into a single call to String.Concat. This is why it is NOT better to use String.Format or StringBuilder (which are both comparatively slow), for up to 4 strings. Look at the IL to see what the compiler does, and use a profiler to find out what performs well in your program. Otherwise you might as well be saying "Look, it is a shoe! He has removed his shoe and this is a sign that others who would follow him should do likewise!" Please post factual answers instead of mythical ones.Glovsky
i.e. Ian Boyd's comment is right (except that the compiler folks already took care of it in version 1.)Glovsky
As per the C# Languge Reference, the + operator on a string is defined as: string operator +(string x, string y); string operator +(string x, object y); string operator +(object x, string y); While the implementation of the operator may use the Concat method, it doesn't change the fact that + is a binary operator; hence, s1 + s2 + s3 would be the equivalent of String.Concat( String.Concat( s1, s2), s3) with a new string object returned for each call to Concat()Bonneau

© 2022 - 2024 — McMap. All rights reserved.