Why does .NET create new substrings instead of pointing into existing strings?

Asked 4/7, 2009 at 15:42 Answered 4/7, 2009 at 19:41

Solved c#.net string memory string-interning

From a brief look using Reflector, it looks like String.Substring() allocates memory for each substring. Am I correct that this is the case? I thought that wouldn't be necessary since strings are immutable.

My underlying goal was to create a IEnumerable<string> Split(this String, Char) extension method that allocates no additional memory.

Joses answered 4/7, 2009 at 15:42 Comment(2)

I haven't thought about it very hard, or looked at StringBuilder's implementation with Reflector, but would an IEnumerable<StringBuilder> Split(this StringBuilder, Char) method work? – Shavonda 4/7, 2009 at 16:53

If String.Substring() dont allocate new memory, string dont will be Immutable – Willtrude 6/7, 2009 at 14:42

One reason why most languages with immutable strings create new substrings rather than refer into existing strings is because this will interfere with garbage collecting those strings later.

What happens if a string is used for its substring, but then the larger string becomes unreachable (except through the substring). The larger string will be uncollectable, because that would invalidate the substring. What seemed like a good way to save memory in the short term becomes a memory leak in the long term.

Paniculate answered 4/7, 2009 at 16:29 Comment(2)

I thought the main reason was in regards to algorithms over the strings. If you can safely assume that a string will never change you can pass references to it safely and it's also inherently threadsafe. I guess that ties in with garbage collection too. – Agni 4/7, 2009 at 16:39

@Agni - that is a reason for immutability. It's not a reason for avoiding shared buffers between strings. Once you have immutability and GC, you can easily implement shared buffers behind the scenes without breaking thread safety or existing algorithms. – Glovsky 5/7, 2009 at 9:7

Not possible without poking around inside .net using String classes. You would have to pass around references to an array which was mutable and make sure no one screwed up.

.Net will create a new string every time you ask it to. Only exception to this is interned strings which are created by the compiler (and can be done by you) which are placed into memory once and then pointers are established to the string for memory and performance reasons.

Agni answered 4/7, 2009 at 15:49 Comment(0)

Each string has to have it's own string data, with the way that the String class is implemented.

You can make your own SubString structure that uses part of a string:

public struct SubString {

   private string _str;
   private int _offset, _len;

   public SubString(string str, int offset, int len) {
      _str = str;
      _offset = offset;
      _len = len;
   }

   public int Length { get { return _len; } }

   public char this[int index] {
      get {
         if (index < 0 || index > len) throw new IndexOutOfRangeException();
         return _str[_offset + index];
      }
   }

   public void WriteToStringBuilder(StringBuilder s) {
      s.Write(_str, _offset, _len);
   }

   public override string ToString() {
      return _str.Substring(_offset, _len);
   }

}

You can flesh it out with other methods like comparison that is also possible to do without extracting the string.

Gleason answered 4/7, 2009 at 16:8 Comment(2)

What about a substring into another substring? – Glovsky 5/7, 2009 at 9:8

Yes, it's easy for the SubString structure to create another that is part of itself. – Gleason 5/7, 2009 at 11:59

Because strings are immutable in .NET, every string operation that results in a new string object will allocate a new block of memory for the string contents.

In theory, it could be possible to reuse the memory when extracting a substring, but that would make garbage collection very complicated: what if the original string is garbage-collected? What would happen to the substring that shares a piece of it?

Of course, nothing prevents the .NET BCL team to change this behavior in future versions of .NET. It wouldn't have any impact on existing code.

Showcase answered 4/7, 2009 at 15:55 Comment(5)

Java's String actually does it that way: Substrings are merely pointers into the original string. However, that also means that when you take a 200-character substring of a 200-MiB string, the 200-MiB string will always lie around in memory as long as the small substring isn't garbage-collected. – Efflorescence 4/7, 2009 at 16:0

I think it could impact existing code given that it is designed around this behaviour. If people assume that interning their string will stop it from being duplicated and this behaviour was stopped it could cause working apps to stop with out of memory exceptions. – Agni 4/7, 2009 at 16:32

How can you design around this behavior? Because of the immutability of strings, there's really no way to create code that would break if the internal implementation of the string class changes. – Showcase 4/7, 2009 at 16:36

.Net string operations indeed create new string objects, but it's not because strings are immutable. In fact, it's because strings are immutable that string operations could reuse current string objects instead of creating new ones. – Zugzwang 4/7, 2009 at 16:39

If C# used this approach, it wouldn't make garbage collection any different. The original string would have multiple references to it, and so it would not be garbage collected until all substrings based on it were also unreachable. Hence what Joey says. Java has faster substring, potentially much higher memory use, and C# has slow substring, potentially much more efficient memory use. – Catholicism 5/9, 2015 at 8:37

Adding to the point that Strings are immutable, you should be that the following snippet will generate multiple String instances in memory.

String s1 = "Hello", s2 = ", ", s3 = "World!";
String res = s1 + s2 + s3;

s1+s2 => new string instance (temp1)

temp1 + s3 => new string instance (temp2)

res is a reference to temp2.

Bonneau answered 4/7, 2009 at 19:41 Comment(5)

This sounds like something that the compiler folks could optimize. – Mirepoix 4/7, 2009 at 20:16

It's not an issue with the compiler, it's a choice made in designing the language. Java has the same rules for Strings. System.Text.StringBuilder is a good class to use that simulates the "mutable" strings. – Bonneau 4/7, 2009 at 20:26

Wrong - s1 + s2 + s3 gets turned into a single call to String.Concat. This is why it is NOT better to use String.Format or StringBuilder (which are both comparatively slow), for up to 4 strings. Look at the IL to see what the compiler does, and use a profiler to find out what performs well in your program. Otherwise you might as well be saying "Look, it is a shoe! He has removed his shoe and this is a sign that others who would follow him should do likewise!" Please post factual answers instead of mythical ones. – Glovsky 5/7, 2009 at 9:3

i.e. Ian Boyd's comment is right (except that the compiler folks already took care of it in version 1.) – Glovsky 5/7, 2009 at 9:4

As per the C# Languge Reference, the + operator on a string is defined as: string operator +(string x, string y); string operator +(string x, object y); string operator +(object x, string y); While the implementation of the operator may use the Concat method, it doesn't change the fact that + is a binary operator; hence, s1 + s2 + s3 would be the equivalent of String.Concat( String.Concat( s1, s2), s3) with a new string object returned for each call to Concat() – Bonneau 6/7, 2009 at 18:24

Recommended topics

Hot tags