On string interning and alternatives

Asked 1/5, 2015 at 9:58 Answered 4/4, 2019 at 14:30

Solved c#.net string hashset string-interning

I have a large file which, in essence contains data like:

Netherlands,Noord-holland,Amsterdam,FooStreet,1,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,2,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,3,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,4,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,5,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,1,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,2,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,3,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,4,...,...
Netherlands,Noord-holland,Amstelveen,BazDrive,1,...,...
Netherlands,Noord-holland,Amstelveen,BazDrive,2,...,...
Netherlands,Noord-holland,Amstelveen,BazDrive,3,...,...
Netherlands,Zuid-holland,Rotterdam,LoremAve,1,...,...
Netherlands,Zuid-holland,Rotterdam,LoremAve,2,...,...
Netherlands,Zuid-holland,Rotterdam,LoremAve,3,...,...
...

This is a multi-gigabyte file. I have a class that reads this file and exposes these lines (records) as an IEnumerable<MyObject>. This MyObject has several properties (Country,Province,City, ...) etc.

As you can see there is a LOT of duplication of data. I want to keep exposing the underlying data as an IEnumerable<MyObject>. However, some other class might (and probably will) make some hierarchical view/structure of this data like:

Netherlands
    Noord-holland
        Amsterdam
            FooStreet [1, 2, 3, 4, 5]
            BarRoad [1, 2, 3, 4]
            ...
        Amstelveen
            BazDrive [1, 2, 3]
            ...
         ...
    Zuid-holland
        Rotterdam
            LoremAve [1, 2, 3]
            ...
        ...
    ...
...

When reading this file, I do, essentially, this:

foreach (line in myfile) {
    fields = line.split(",");
    yield return new MyObject {
        Country = fields[0],
        Province = fields[1],
        City = fields[2],
        Street = fields[3],
        //...other fields
    };
}

Now, to the actual question at hand: I could use string.Intern() to intern the Country, Province, City, and Street strings (those are the main 'vilains', the MyObject has several other properties not relevant to the question).

foreach (line in myfile) {
    fields = line.split(",");
    yield return new MyObject {
        Country = string.Intern(fields[0]),
        Province = string.Intern(fields[1]),
        City = string.Intern(fields[2]),
        Street = string.Intern(fields[3]),
        //...other fields
    };
}

This will save about 42% of memory (tested and measured) when holding the entire dataset in memory since all duplicate strings will be a reference to the same string. Also, when creating the hierarchical structure with a lot of LINQ's .ToDictionary() method the keys (Country, Province etc.) of the resp. dictionaries will be much more efficient.

However, one of the drawbacks (aside a slight loss of performance, which is not problem) of using string.Intern() is that the strings won't be garbage collected anymore. But when I'm done with my data I do want all that stuff garbage collected (eventually).

I could use a Dictionary<string, string> to 'intern' this data but I don't like the "overhead" of having a key and value where I am, actually, only interested in the key. I could set the value to null or the use the same string as value (which will result in the same reference in key and value). It's only a small price of a few bytes to pay, but it's still a price.

Something like a HashSet<string> makes more sense to me. However, I cannot get a reference to a string in the HashSet; I can see if the HashSet contains a specific string, but not get a reference to that specific instance of the located string in the HashSet. I could implement my own HashSet for this, but I am wondering what other solutions you kind StackOverflowers may come up with.

Requirements:

My "FileReader" class needs to keep exposing an IEnumerable<MyObject>
My "FileReader" class may do stuff (like string.Intern()) to optimize memory usage
The MyObject class cannot change; I won't make a City class, Country class etc. and have MyObject expose those as properties instead of simple string properties
Goal is to be (more) memory efficient by de-duplicating most of the duplicate strings in Country, Province, City etc.; how this is achieved (e.g. string interning, internal hashset / collection / structure of something) is not important. However:
I know I can stuff the data in a database or use other solutions in such direction; I am not interested in these kind of solutions.
Speed is only of secondary concern; the quicker the better ofcourse but a (slight) loss in performance while reading/iterating the objects is no problem
Since this is a long-running process (as in: windows service running 24/7/365) that, occasionally, processes a bulk of this data I want the data to be garbage-collected when I'm done with it; string interning works great but will, in the long run, result in a huge string pool with lots of unused data
I would like any solutions to be "simple"; adding 15 classes with P/Invokes and inline assembly (exaggerated) is not worth the effort. Code maintainability is high on my list.

This is more of a 'theoretical' question; it's purely out of curiosity / interest that I'm asking. There is no "real" problem, but I can see that in similar situations this might be a problem to someone.

For example: I could do something like this:

public class StringInterningObject
{
    private HashSet<string> _items;

    public StringInterningObject()
    {
        _items = new HashSet<string>();
    }

    public string Add(string value)
    {
        if (_items.Add(value))
            return value;  //New item added; return value since it wasn't in the HashSet
        //MEH... this will quickly go O(n)
        return _items.First(i => i.Equals(value)); //Find (and return) actual item from the HashSet and return it
    }
}

But with a large set of (to be de-duplicated) strings this will quickly bog down. I could have a peek at the reference source for HashSet or Dictionary or... and build a similar class that doesn't return bool for the Add() method but the actual string found in the internals/bucket.

The best I could come up with until now is something like:

public class StringInterningObject
{
    private ConcurrentDictionary<string, string> _items;

    public StringInterningObject()
    {
        _items = new ConcurrentDictionary<string, string>();
    }

    public string Add(string value)
    {
        return _items.AddOrUpdate(value, value, (v, i) => i);
    }
}

Which has the "penalty" of having a Key and a Value where I'm actually only interested in the Key. Just a few bytes though, small price to pay. Coincidally this also yields 42% less memory usage; the same result as when using string.Intern() yields.

tolanj came up with System.Xml.NameTable:

public class StringInterningObject
{
    private System.Xml.NameTable nt = new System.Xml.NameTable();

    public string Add(string value)
    {
        return nt.Add(value);
    }
}

(I removed the lock and string.Empty check (the latter since the NameTable already does that))

xanatos came up with a CachingEqualityComparer:

public class StringInterningObject
{
    private class CachingEqualityComparer<T> : IEqualityComparer<T> where T : class
    {
        public System.WeakReference X { get; private set; }
        public System.WeakReference Y { get; private set; }

        private readonly IEqualityComparer<T> Comparer;

        public CachingEqualityComparer()
        {
            Comparer = EqualityComparer<T>.Default;
        }

        public CachingEqualityComparer(IEqualityComparer<T> comparer)
        {
            Comparer = comparer;
        }

        public bool Equals(T x, T y)
        {
            bool result = Comparer.Equals(x, y);

            if (result)
            {
                X = new System.WeakReference(x);
                Y = new System.WeakReference(y);
            }

            return result;
        }

        public int GetHashCode(T obj)
        {
            return Comparer.GetHashCode(obj);
        }

        public T Other(T one)
        {
            if (object.ReferenceEquals(one, null))
            {
                return null;
            }

            object x = X.Target;
            object y = Y.Target;

            if (x != null && y != null)
            {
                if (object.ReferenceEquals(one, x))
                {
                    return (T)y;
                }
                else if (object.ReferenceEquals(one, y))
                {
                    return (T)x;
                }
            }

            return one;
        }
    }

    private CachingEqualityComparer<string> _cmp; 
    private HashSet<string> _hs;

    public StringInterningObject()
    {
        _cmp = new CachingEqualityComparer<string>();
        _hs = new HashSet<string>(_cmp);
    }

    public string Add(string item)
    {
        if (!_hs.Add(item))
            item = _cmp.Other(item);
        return item;
    }
}

(Modified slightly to "fit" my "Add() interface")

As per Henk Holterman's request:

public class StringInterningObject
{
    private Dictionary<string, string> _items;

    public StringInterningObject()
    {
        _items = new Dictionary<string, string>();
    }

    public string Add(string value)
    {
        string result;
        if (!_items.TryGetValue(value, out result))
        {
            _items.Add(value, value);
            return value;
        }
        return result;
    }
}

~~I'm just wondering if there's maybe a neater/better/cooler way to 'solve' my (not so much of an actual) problem.~~ By now I have enough options I guess wink

Here are some numbers I came up with for some simple, short, preliminary tests:

Non optimized
Memory: ~4,5Gb
Load time: ~52s

StringInterningObject (see above, the ConcurrentDictionary variant)
Memory: ~2,6Gb
Load time: ~49s

string.Intern()
Memory: ~2,3Gb
Load time: ~45s

System.Xml.NameTable
Memory: ~2,3Gb
Load time: ~41s

CachingEqualityComparer
Memory: ~2,3Gb
Load time: ~58s

StringInterningObject (see above, the (non-concurrent) Dictionary variant) as per Henk Holterman's request:
Memory: ~2,3Gb
Load time: ~39s

Although the numbers aren't very definitive, it seems that the many memory-allocations for the non-optimized version actually slow down more than using either string.Intern() or the above StringInterningObjects which results in (slightly) longer load times. ~~Also, string.Intern() seems to 'win' from StringInterningObject but not by a large margin;~~ << See updates.

Fly answered 1/5, 2015 at 9:58 Comment(9)

It's only a small price of a few bytes to pay - exactly. You already have the solution here, that overhead is negligible. – Graminivorous 1/5, 2015 at 10:20

Exactly why I showed the solution and explained the (minimum) overhead. It's a (good) solution (and works / would work fine). But since I'm working on this problem I was simply wondering if someone could come up with a better alternative shaving off these last few bytes as well (without adding too much complexity because: maintainability). I guess I was wondering if the .Net BCL had/has an alternative to HashSet that would help in this regard that I missed or something. Or maybe, I dunno, just shouting out wild thoughts here, some compiler directive that would help. – Fly 1/5, 2015 at 10:24

It is a "problem" (not quite, more just out of curiosity / interest) because if I store "Amsterdam" in each MyObject (millions) I have a lot of duplicate data in memory. If all MyObjects for Amsterdam could reference the same string (which they will when I use string.Intern()) it is a lot memory friendlier. I guess I am asking for a string.Intern() that would allow garbage collection. – Fly 1/5, 2015 at 10:26

I made a start on a project in January which was to pretty much deal with this but covering a few different cases (backed by string.Intern or not, weak-referenced or not, concurrent at the expense of per-operation cost versus faster at the expense of not being thread-safe). I really must get back to it and release it. In the meantime, writing your own hashset that returns the interned item isn't tricky and I'd go with that. – Classieclassification 1/5, 2015 at 11:49

Is this in any way a reasonable alternative to using a small dbase provider like Sqlite or SQL Compact? I don't see it, interning strings is just a memory leak. – Equalizer 1/5, 2015 at 14:40

I don't want/need persistence nor do I want a dependency on an external process. Also: it's just a theoretical question (maybe try to approach it as a brainteaser / puzzle?) about memory, GC etc. as I also mentiond in the question: "I know I can stuff the data in a database or use other solutions in such direction; I am not interested in these kind of solutions.". About "interning strings is just a memory leak": this was/is also addressed in my question. – Fly 1/5, 2015 at 14:45

Do the work in a separate, temporary, process and then throw it away... – Melanochroi 1/5, 2015 at 14:45

Those benchmark results would be more relevant when you include the ones for Dictionary<string.string>. The other solutions aren't thread-safe either. – Graminivorous 1/5, 2015 at 17:10

@HenkHolterman: Added as you requested ;-) – Fly 1/5, 2015 at 17:39

I've had exactly this requirement and indeed asked on SO, but with nothing like the detail of your question, no useful responses. One option that is built in is a (System.Xml).NameTable, which is basically a string atomization object, which is what you are looking for, we had (we've actually move to Intern because we do keep these strings for App-life).

if (name == null) return null;
if (name == "") return string.Empty; 
lock (m_nameTable)
{
      return m_nameTable.Add(name);
}

on a private NameTable

http://referencesource.microsoft.com/#System.Xml/System/Xml/NameTable.cs,c71b9d3a7bc2d2af shows its implemented as a Simple hashtable, ie only storing one reference per string.

Downside? is its completely string specific. If you do cross-test for memory / speed I'd be interested to see the results. We were already using System.Xml heavily, might of course not seem so natural if you where not.

Maidservant answered 1/5, 2015 at 15:6 Comment(7)

Cool! Since I still have my test-project I'll give it a shot and see what the memory / loadtimes do for this option. I'll add the results to my question. I like the 'creative' thinking. I'll also have a look at the reference-source to see what can be learned from it. (For future reference: I had a quick peek at your question). Edit: Noticed that the if (name == "")... is not necessary; the NameTable already does that. – Fly 1/5, 2015 at 15:11

Wahoo, its a current winner! – Maidservant 1/5, 2015 at 15:24

It is (at least on speed, on memory it's a tie). However: I really would need to run several tests to average "scores". Having said that: well done! Coolio! Very creative (and "out of the box"). I have put back the lock and both "value checks" from your posted code; this made the loadtime "go up" by a second to 42 seconds but these measurements aren't very precise so the difference is probably negligible. – Fly 1/5, 2015 at 15:26

NB I have some other approaches which can save significant memory overall, they are more situational. You can actually hold the data in a 'back link tree', its slight odd and slower on read (not massively) but for example with 3 text 'fields' ... – Maidservant 1/5, 2015 at 15:46

...each of which has 10 values you would have 1000 MyObjects, each of which has 3 string refs to one of 30 strings. so 1030 objects and 3000 refs. You could have each of the 1000 object storing 1 string ref (the last field) and a reference to a 'hidden' object which 'knows' the first 2 fields, there are only (100 of these though, and its the same kind of object), thus you end up with 1140 objects but only 2220 refs. (plus some fairly fixed plumbing), it can pay-off very well if memory usage is vastly more important than speed. – Maidservant 1/5, 2015 at 15:46

It also depends on lifetime and if you care about peek memory usage (ie would 4Gb while loading droping down to 1Gb be better / worse than a peek of ~ 2.6 Gb)? – Maidservant 1/5, 2015 at 15:48

Accepted this answer because it was / is the most creative, thinking-outside-the-box, solution. NOT because it performed the best (see original question which has been updated several times to include other/better solutions). – Fly 1/5, 2015 at 18:6

When in doubt, cheat! :-)

public class CachingEqualityComparer<T> : IEqualityComparer<T> where  T : class
{
    public T X { get; private set; }
    public T Y { get; private set; }

    public IEqualityComparer<T> DefaultComparer = EqualityComparer<T>.Default;

    public bool Equals(T x, T y)
    {
        bool result = DefaultComparer.Equals(x, y);

        if (result)
        {
            X = x;
            Y = y;
        }

        return result;
    }

    public int GetHashCode(T obj)
    {
        return DefaultComparer.GetHashCode(obj);
    }

    public T Other(T one)
    {
        if (object.ReferenceEquals(one, X))
        {
            return Y;
        }

        if (object.ReferenceEquals(one, Y))
        {
            return X;
        }

        throw new ArgumentException("one");
    }

    public void Reset()
    {
        X = default(T);
        Y = default(T);
    }
}

Example of use:

var comparer = new CachingEqualityComparer<string>();
var hs = new HashSet<string>(comparer);

string str = "Hello";

string st1 = str.Substring(2);
hs.Add(st1);

string st2 = str.Substring(2);

// st1 and st2 are distinct strings!
if (object.ReferenceEquals(st1, st2))
{
    throw new Exception();
}

comparer.Reset();

if (hs.Contains(st2))
{
    string cached = comparer.Other(st2);
    Console.WriteLine("Found!");

    // cached is st1
    if (!object.ReferenceEquals(cached, st1))
    {
        throw new Exception();
    }
}

I've created an equality comparer that "caches" the last Equal terms it analyzed :-)

Everything could then be encapsulated in a subclass of HashSet<T>

/// <summary>
/// An HashSet&lt;T;gt; that, thorough a clever use of an internal
/// comparer, can have a AddOrGet and a TryGet
/// </summary>
/// <typeparam name="T"></typeparam>
public class HashSetEx<T> : HashSet<T> where T : class
{

    public HashSetEx()
        : base(new CachingEqualityComparer<T>())
    {
    }

    public HashSetEx(IEqualityComparer<T> comparer)
        : base(new CachingEqualityComparer<T>(comparer))
    {
    }

    public T AddOrGet(T item)
    {
        if (!Add(item))
        {
            var comparer = (CachingEqualityComparer<T>)Comparer;

            item = comparer.Other(item);
        }

        return item;
    }

    public bool TryGet(T item, out T item2)
    {
        if (Contains(item))
        {
            var comparer = (CachingEqualityComparer<T>)Comparer;

            item2 = comparer.Other(item);
            return true;
        }

        item2 = default(T);
        return false;
    }

    private class CachingEqualityComparer<T> : IEqualityComparer<T> where T : class
    {
        public WeakReference X { get; private set; }
        public WeakReference Y { get; private set; }

        private readonly IEqualityComparer<T> Comparer;

        public CachingEqualityComparer()
        {
            Comparer = EqualityComparer<T>.Default;
        }

        public CachingEqualityComparer(IEqualityComparer<T> comparer)
        {
            Comparer = comparer;
        }

        public bool Equals(T x, T y)
        {
            bool result = Comparer.Equals(x, y);

            if (result)
            {
                X = new WeakReference(x);
                Y = new WeakReference(y);
            }

            return result;
        }

        public int GetHashCode(T obj)
        {
            return Comparer.GetHashCode(obj);
        }

        public T Other(T one)
        {
            if (object.ReferenceEquals(one, null))
            {
                return null;
            }

            object x = X.Target;
            object y = Y.Target;

            if (x != null && y != null)
            {
                if (object.ReferenceEquals(one, x))
                {
                    return (T)y;
                }
                else if (object.ReferenceEquals(one, y))
                {
                    return (T)x;
                }
            }

            return one;
        }
    }
}

Note the use of WeakReference so that there aren't useless references to objects that could prevent garbage collection.

Example of use:

var hs = new HashSetEx<string>();

string str = "Hello";

string st1 = str.Substring(2);
hs.Add(st1);

string st2 = str.Substring(2);

// st1 and st2 are distinct strings!
if (object.ReferenceEquals(st1, st2))
{
    throw new Exception();
}

string stFinal = hs.AddOrGet(st2);

if (!object.ReferenceEquals(stFinal, st1))
{
    throw new Exception();
}

string stFinal2;
bool result = hs.TryGet(st1, out stFinal2);

if (!object.ReferenceEquals(stFinal2, st1))
{
    throw new Exception();
}

if (!result)
{
    throw new Exception();
}

Arad answered 1/5, 2015 at 11:9 Comment(9)

The downvoter could at least put a comment. I do think it is pretty clever as an idea to "extend" the HashSet<>. I'm quite happy of it, and I do think it is the most beautiful idea I had this week. – Arad 1/5, 2015 at 11:34

Just to be clear: I didn't downvote. However, not having looked at the code in great detail yet, the sentence ""caches" the last Equal terms it analyzed" makes me think that reading "Amsterdam", "New york", "Amsterdam" results in 2 distinct "Amsterdam" strings in memory? I cannot guarantee the order of strings in the file (and don't want to do an order because of the (big) performance impact). I might just interpret that quote wrong though; I'll have a more in-depth look at the code later today. – Fly 1/5, 2015 at 11:58

@Fly No, the first class can be used to build a GetOrAdd, or a TryGet (as exemplified by the short example and by the longer full fledget HashSet<> sublass) – Arad 1/5, 2015 at 12:2

Clever and thinks outside the box. I'm curious how it performs vs the the options. – Tebet 1/5, 2015 at 16:9

I added the results (spoiler: ~2.3Gb, ~58s) :-) – Fly 1/5, 2015 at 17:27

I accepted this answer but I was torn between that one and yours. I eventually decided for the other one but I still would like to thank you for brainstorming with me and coming up with such a creative, albeit a tad slower and more convoluted, solution. See this comment for more. – Fly 1/5, 2015 at 18:8

This is fiendish. Certainly clever, only issue is that it does rely on an implementation detail of HashSet, that when an add fails the last compare will be the item being added to the item that caused it's rejection. While its hard to conceive of an implementation where that wouldn't be the case it's still enough to keep this in the 'really clever demo code not suitable for production' for me. Really nice idea though. – Maidservant 13/5, 2015 at 8:4

@Maidservant the last compare Note that the Caching happens only on a successfull comparison. So even if the HashSet<> would do other non-suffessfull comparisons it wouldn't break. The only way to break it would be that after failing the Add, the HashSet did a Equals(x, x), using the same value for both terms. – Arad 13/5, 2015 at 8:9

Yes, it seems ludicrous that it would ever break, but, well maybe I'm just paranoid. – Maidservant 13/5, 2015 at 8:43

if (name == null) return null;
if (name == "") return string.Empty; 
lock (m_nameTable)
{
      return m_nameTable.Add(name);
}

on a private NameTable

http://referencesource.microsoft.com/#System.Xml/System/Xml/NameTable.cs,c71b9d3a7bc2d2af shows its implemented as a Simple hashtable, ie only storing one reference per string.

Maidservant answered 1/5, 2015 at 15:6 Comment(7)

Wahoo, its a current winner! – Maidservant 1/5, 2015 at 15:24

It also depends on lifetime and if you care about peek memory usage (ie would 4Gb while loading droping down to 1Gb be better / worse than a peek of ~ 2.6 Gb)? – Maidservant 1/5, 2015 at 15:48

edit3:

instead of indexing strings, putting them in non-duplicate lists will save much more ram.

we have int indexes in class MyObjectOptimized. access is instant. if list is short(like 1000 item) speed of setting values wont be noticable.

i assumed every string will have 5 character . 

this will reduce memory usage
  percentage   : 110 byte /16byte  = 9x gain 
  total        : 5gb/9 = 0.7 gb  +  sizeof(Country_li , Province_li etc ) 

  with int16 index (will further halve ram usage )  
  *note:* int16 capacity is -32768 to +32767 ,
          make sure your  list  is not bigger than 32 767

usage is same but will use the class MyObjectOptimized

main()
{

    // you can use same code
    foreach (line in myfile) {
    fields = line.split(",");
    yield 
    return 
        new MyObjectOptimized {
            Country = fields[0],
            Province = fields[1],
            City = fields[2],
            Street = fields[3],
            //...other fields
        };
    }

}

required classes

// single string size :  18 bytes (empty string size) + 2 bytes per char allocated  
//1 class instance ram cost : 4 * (18 + 2* charCount ) 
// ie charcounts are at least 5
//   cost: 4*(18+2*5)  = 110 byte 
class MyObject 
{
    string Country ;
    string Province ;
    string City ;
    string Street ;
}


public static class Exts
{
    public static int AddDistinct_and_GetIndex(this List<string> list ,string value)
    {
        if( !list.Contains(value)  ) {
            list.Add(value);
        }
        return list.IndexOf(value);
    }
}

// 1 class instance ram cost : 4*4 byte = 16 byte
class MyObjectOptimized
{
    //those int's could be int16 depends on your distinct item counts
    int Country_index ;
    int Province_index ;
    int City_index ;
    int Street_index ;

    // manuallly implemented properties  will not increase memory size
    // whereas field WILL increase 
    public string Country{ 
        get {return Country_li[Country_index]; }
        set {  Country_index = Country_li.AddDistinct_and_GetIndex(value); }
    }
    public string Province{ 
        get {return Province_li[Province_index]; }
        set {  Province_index = Province_li.AddDistinct_and_GetIndex(value); }
    }
    public string City{ 
        get {return City_li[City_index]; }
        set {  City_index = City_li.AddDistinct_and_GetIndex(value); }
    }
    public string Street{ 
        get {return Street_li[Street_index]; }
        set {  Street_index = Street_li.AddDistinct_and_GetIndex(value); }
    }


    //beware they are static.   
    static List<string> Country_li ;
    static List<string> Province_li ;
    static List<string> City_li ;
    static List<string> Street_li ;
}

Plymouth answered 4/4, 2019 at 14:30 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags