C#: Strings with same contents
Asked Answered
R

4

18

I have heard and read that a string can not be changed (immutable?). That should be correct I guess. But I have also heard that two strings with the same contents share the same memory-space (or what you call it). Is this correct?

And if so, does that mean that if I create a List with thousands of strings, it wouldn't really take up much space at all if most of those strings were equal to each other?

Rancho answered 17/2, 2009 at 8:7 Comment(0)
S
33

EDIT: In the answer below I've referred to the intern pool as being AppDomain-specific; I'm pretty sure that's what I've observed before, but the MSDN docs for String.Intern suggest that there's a single intern pool for the whole process, making this even more important.

Original answer

(I was going to add this as a comment, but I think it's an important enough point to need an extra answer...)

As others have explained, string interning occurs for all string literals, but not on "dynamically created" strings (e.g. those read from a database or file, or built using StringBuilder or String.Format.)

However, I wouldn't suggest calling String.Intern to get round the latter point: it will populate the intern pool for the lifetime of your AppDomain. Instead, use a pool which is local to just your usage. Here's an example of such a pool:

public class StringPool
{
    private readonly Dictionary<string,string> contents =
        new Dictionary<string,string>();

    public string Add(string item)
    {
        string ret;
        if (!contents.TryGetValue(item, out ret))
        {
            contents[item] = item;
            ret = item;
        }
        return ret;
    }
}

You'd then just use something like:

string data = pool.Add(ReadItemFromDatabase());

(Note that the pool isn't thread-safe; normal usage wouldn't need it to be.)

This way you can throw away your pool as soon as you no longer need it, rather than having a potentially large number of strings in memory forever. You could also make it smarter, implementing an LRU cache or something if you really wanted to.

EDIT: Just to clarify why this is better than using String.Intern... suppose you read a bunch of strings from a database or log file, process them, and then move onto another task. If you call String.Intern on those strings, they will never be garbage collected as long as your AppDomain is alive - and possibly not even then. If you load several different log files, you'll gradually accumulate strings in your intern pool until you either finish or run out of memory. Instead, I'm suggesting a pattern like this:

void ProcessLogFile(string file)
{
    StringPool pool = new StringPool();
    // Process the log file using strings in the pool
} // The pool can now be garbage collected

Here you get the benefit of multiple strings in the same file only existing once in memory (or at least, only getting past gen0 once) but you don't pollute a "global" resource (the intern pool).

Saltire answered 17/2, 2009 at 8:47 Comment(9)
Jon, could you elaborate on what you gain by doing this? I assume that you will now have a more high performant string compare function for strings in the pool? Or am I missing the point here?Shapeless
oh, so interned strings exist forever? That is not so good, hehe. Thanks for noting that.Rancho
I still don't get it: I can see why this is better than interning all your strings, but how is it better than not doing anything at all?Shapeless
Ah, I see: you can reference the same string over and over. Sorry to bother.Shapeless
Any particular reason why you're using TryGetValue instead of ContainsKey?Debt
@C.B.: Yes - I want the value. Why do the lookup twice via ContainsKey and then the indexer, when TryGetValue will do both in one operation?Saltire
P.S. "First, the memory allocated for interned String objects is not likely be released until the common language runtime (CLR) terminates. The reason is that the CLR's reference to the interned String object can persist after your application, or even your application domain, terminates." - Taken from msdn.microsoft.com/en-us/library/system.string.intern.aspxTowards
@PaulZahra: Interesting, although "can persist" isn't the same as "will definitely persist". I'll tweak my answer a little.Saltire
Also, at least for the data set I am working with (~1.3 million items, 38% unique, 40 character average length) it is much faster to use the StringPool dictionary implementation shown - ~2 seconds for the StringPool (at cold boot) vs ~30 seconds for the interning (decreasing to ~12 seconds, for distinct dataset, after it's already choked on the first dataset). I can also throw away the dictionary at the end of the process as the "interned" strings are stored as part of other objects. And this is with a normal dictionary wrapper..Yuletide
S
6

This is more or less true. It is called "string interning". String literals will be present in memory only once and every variable set to the same value points to this single representation. Strings that are created in code are not automatically interned though.

http://msmvps.com/blogs/manoj/archive/2004/01/09/1549.aspx

Shapeless answered 17/2, 2009 at 8:11 Comment(3)
created in code? isn't all strings created in code? or do you mean hard coded strings, as opposed to... i.e. strings fetched from a database runtime?Rancho
Strings created in code are not automatically interned, but they can be interned using String.Intern(). Note that there some differences (bugs?) in how the empty string is handled for interning in different versions of .NET: msdn.microsoft.com/en-us/library/…Domination
So when fetching strings from a database, I would have to use String.Intern for it to be the case?Rancho
I
1

If I remember correctly, string that are hard-coded in code are pooled separately. This is called "Interned" and there is a method to query whether a string is: String.IsInterned Method

On that page under "Remarks" you can read:

The common language runtime automatically maintains a table, called the "intern pool", which contains a single instance of each unique literal string constant declared in a program, as well as any unique instance of String you add programmatically.

Hope this helps you a bit, and correct me if I'm wrong.

Matthias

Inconvincible answered 17/2, 2009 at 8:13 Comment(0)
L
0

In order to make strings to "share" their memory locations is to intern them in the intern pool, which contains a single reference to each unique literal string declared or created programmatically in your program.

Note that all string literals in code are automatically interned.

Londoner answered 17/2, 2009 at 8:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.