Is it safe to copy a ConcurrentDictionary to a normal Dictionary, by enumerating it directly, based on the current implementation of the class?
Asked Answered
M

1

7

TL;DR: Is it possible for a single enumeration of a ConcurrentDictionary, to emit the same key twice? Does the current implementation of the ConcurrentDictionary class (.NET 5) allow this possibility?


I have a ConcurrentDictionary<string, decimal> that is mutated by multiple threads concurrently, and I want periodically to copy it to a normal Dictionary<string, decimal>, and pass it to the presentation layer for updating the UI. There are two ways to copy it, with and without snapshot semantics:

var concurrent = new ConcurrentDictionary<string, decimal>();

var copy1 = new Dictionary<string, decimal>(concurrent.ToArray()); // Snapshot

var copy2 = new Dictionary<string, decimal>(concurrent); // On-the-go

I am pretty sure that the first approach is safe, because the ToArray method returns a consistent view of the ConcurrentDictionary:

Returns a new array containing a snapshot of key and value pairs copied from the ConcurrentDictionary<TKey,TValue>.

But I would prefer to use the second approach, because it generates less contention. I am worried though about the possibility of getting an ArgumentException: An item with the same key has already been added. The documentation doesn't seem to exclude this possibility:

The enumerator returned from the dictionary ... does not represent a moment-in-time snapshot of the dictionary. The contents exposed through the enumerator may contain modifications made to the dictionary after GetEnumerator was called.

Here is the scenario that makes me worried:

  1. The thread A starts enumerating the ConcurrentDictionary, and the key X is emitted by the enumerator. Then the thread is temporarily suspended by the OS.
  2. The thread B removes the key X.
  3. The thread C adds a new entry with the key X.
  4. The thread A resumes enumerating the ConcurrentDictionary, the enumerator observes the newly added X entry, and emits it.
  5. The constructor of the Dictionary class attempts to insert twice the key X into the newly constructed Dictionary, and throws an exception.

I tried to reproduce this scenario, without success. But this is not 100% reassuring, because the conditions that could cause this situation to emerge could be subtle. Maybe the values I added didn't have the "right" hashcodes, or didn't generate the "right" number of hashcode collisions. I tried to find an answer by studying the source code of the class, but unfortunately it's too complicated for me to understand.

My question is: is it safe, based on the current implementation (.NET 5), to create fast copies of my ConcurrentDictionary by enumerating it directly, or should I code defensively and take a snapshot every time I copy it?


Clarification: I would agree with anyone who says that using an API taking into consideration its undocumented implementation details is unwise. But alas, this is what this question is all about. It's a rather educational, out of curiosity question. I am not intending to use the acquired knowledge in production code, I promise. πŸ˜ƒ

Merrow answered 22/12, 2020 at 23:7 Comment(16)
I obviously haven't seen your code and you haven't said what the actual high level problem is that you are trying to solve, but from your description of (1) - (5) I suspect you have a logic flaw more than a coding issue. – Sasin
@MitchWheat my question is not localized to a specific business problem. I want to know generally whether it's possible for a single enumeration of a ConcurrentDictionary, to emit the same key twice. – Merrow
If you want to snapshot a ConcurrentDictionary then ToArray() is the documented way to do that. See ConcurrentDictionary in C#?. Enumerating the dictionary could definitely reflect modifications made during the enumeration, see ConcurrentDictionary enumeration and locking – Flats
There is no need to use the constructor of Dictionary. Just create an empty dictionary, foreach through the ConcurrentDictionary, and set dict[key] = value in the loop, which will take care of duplicates for you by using the later value. – Virtually
@Flats I want to get fast copies of the ConcurrentDictionary, and I would like to avoid taking expensive snapshots. I am OK with the copies reflecting modifications made during the enumeration. But I am not OK with receiving the same key twice, and I am asking whether this is a possible scenario. – Merrow
Peter Duniho eloquently said what I was going to regarding "Only you can decide. It's effectively a race condition and good concurrent code doesn't care about races." – Sasin
@Theodor Zoulias: "But I am not OK with receiving the same key twice, and I am asking whether this is a possible scenario. " - well, yes if your code allows that obviously! – Sasin
@Virtually this is a good idea. But I would still like to know whether taking defensive measures like this is a warranted. – Merrow
I tried to reproduce the duplicate key exception by using an artificially bad hash code algorithm. I couldn't do it, but I did get some very strange results, namely that the concurrent dictionary itself contained different contents depending on the hash code, if one were to modify the dictionary while iterating through it. See dotnetfiddle.net/z7aUZW. Can't explain it actually. – Flats
@TheodorZoulias One approach to consider may be to write your own ToSafeDictionary extension method that takes a ConcurrentDictionary. Then inside it either use ToArray (like you are now), or new up the ConcurrentDictionary and call TryAdd one by one. That way if the key does appear twice you'll be fine. This is similar to @MineR's earlier suggestion (although that suggestion will do last wins, while this suggestion does first wins). – Valentinevalentino
@Valentinevalentino yeap, the ToSafeDictionary extension method would be an easy to implement security measure. Interestingly the ImmutableDictionary<K,V> class does not throw when instantiated by enumerables containing duplicates. It is instantiated safely by design. – Merrow
@Valentinevalentino honestly this would be a different question. What I am interested to know here is specifically whether it's possible in practice to get the same key twice while enumerating a ConcurrentDictionary using the native enumerator. – Merrow
@TheodorZoulias The short answer is "the docs don't say it can't happen (and give general warnings that the general class of problem might occur), so even if it might be safe now it might not be tomorrow". So best to be cautious. – Valentinevalentino
@Valentinevalentino yeap, agreed. But I am really curious about the long answer as well. Does the current implementation prevent double emitted keys? And if so, does it by accident or by (undocumented) design? The short answer is not truly satisfying IMHO. :-) – Merrow
Does the current implementation prevent double emitted keys? Which current implementation are you interested in? Framework (and which version)? Core (and which version)? Mono? etc etc I suspect you see the point I am making. ;) You need to program to the contract, not the implementation (particularly when there is more than one implementation). – Valentinevalentino
@Valentinevalentino I would be satisfied with an answer regarding the .NET 5 implementation of the ConcurrentDictionary. I think it's this source file. – Merrow
C
8

Is it possible in practice for a single enumeration of a ConcurrentDictionary, to emit the same key twice?

That depends on how you define "in practice". But by my definition, yes in practice it is absolutely possible for ConcurrentDictionary to emit the same key twice. That is, you cannot write correct code that makes the assumption that it won't.

The documentation clearly states:

The contents exposed through the enumerator may contain modifications made to the dictionary after GetEnumerator was called.

It provides no other statements about behavior, which means that a key may exist when GetEnumerator() is called, be returned by e.g. the first enumerated element, be removed after that, and then be added again later in such a way that allows the enumerator to retrieve the same key again.

This is the only thing we can count on in practice.

Now, that said, speaking academically (i.e. not in practice)…

Does the current implementation of the ConcurrentDictionary class (.NET 5) allow this possibility?

On inspection of the implementation of GetEnumerator(), it seems to me that the current implementation may avoid the possibility of returning the same key more than once.

Per the comment in the code, which reads:

// Provides a manually-implemented version of (approximately) this iterator:
//     Node?[] buckets = _tables._buckets;
//     for (int i = 0; i < buckets.Length; i++)
//         for (Node? current = Volatile.Read(ref buckets[i]); current != null; current = current._next)
//             yield return new KeyValuePair<TKey, TValue>(current._key, current._value);

And then looking at the "manually-implemented version" the comment refers to…we can see that the implementation does nothing more than iterate through the buckets array, and then within each bucket, iterate through the linked list that constitutes that bucket, just as the example code in the comment suggests.

But looking at the code that adds a new element to a bucket, we see this:

// The key was not found in the bucket. Insert the key-value pair.
var resultNode = new Node(key, value, hashcode, bucket);
Volatile.Write(ref bucket, resultNode);
checked
{
    tables._countPerLock[lockNo]++;
}

There's more to the method than that, of course, but this is the crux. This code passes the head of the bucket list to the new node constructor, which in turn inserts the new node at the head of the list. Then the bucket variable, which is a ref variable, is overwritten with the new node reference.

I.e. the new node becomes the new head of the list.

So we see:

  • The enumerator captures the current _buckets array from the dictionary when MoveNext() is first called.
  • This means that even if the dictionary has to reallocate its backing storage to increase the number of buckets, the enumerator will continue to iterate through the previous array.
  • Furthermore, if reallocated, the old linked lists aren't reused. The code that reallocates the storage creates all new linked lists for the entire collection:
// Copy all data into a new table, creating new nodes for all elements
foreach (Node? bucket in tables._buckets)
{
    Node? current = bucket;
    while (current != null)
    {
        Node? next = current._next;
        ref Node? newBucket = ref newTables.GetBucketAndLock(current._hashcode, out uint newLockNo);

        newBucket = new Node(current._key, current._value, current._hashcode, newBucket);

        checked
        {
            newCountPerLock[newLockNo]++;
        }

        current = next;
    }
}
  • This means that the worst case scenario is an element being removed and re-added without the backing storage being reallocated (since this is the only way to use the same linked list currently being iterated), and so the key winds up in the same linked list.
  • But we know that new nodes always get added to the head of the list. And the enumerator doesn't have any sort of back-tracking which would allow it to see the new node that was added at the head of the list. All it can do is proceed down the rest of the list that was already there.

I believe that this means you can't get the same key twice.

That said, I will point out: the ConcurrentDictionary code is complicated. I'm pretty good at reading code, and think the analysis above is correct. But I can't guarantee that. Heck, even while reading through the code, I switched my view of what is possible and what's not twice, because I'd failed to consider particular possibilities. I might still have overlooked something, some corner case e.g. where the linked list enumeration somehow returns to the head, or the _buckets array somehow gets resized in place rather than creating a whole new copy of the original array (you can't do that in strict C# code, but the CLR has all kinds of dirty tricks it might do in the name of performance).

More to the point, none of this matters at all. The underlying implementation could change any day for any reason (for example, maybe they find a bug in the code that simply can't be fixed using the "no duplicate key during iteration" version of the code). And given that your original question was presented in the context of copying the dictionary contents as a snapshot to another data structure and the ConcurrentDictionary class does in fact have a ToArray() method to provide exactly that functionality, there's no reason to write any code that might stumble over one of those possible corner cases.

Carrasco answered 23/12, 2020 at 18:4 Comment(5)
Thanks Peter for the very interesting and perceptive analysis. This is exactly the kind of info that I was hoping for. Honestly I was expecting that proving that the same key cannot me emitted twice (I think you proved it) would also come with some evidence of intention, i.e. that it's not purely accidental. But judging from the source code you quoted, even if such an intention existed, it's not at all apparent. This adds weight to your conclusion that there is no guarantee that the current behavior will be preserved in future versions of .NET. – Merrow
Btw by "in practice" I meant "based on the actual implementation", as opposed to "in theory" = "based on the documentation". I added it in one of my earlier revisions in order to clarify the question, but now I see that it's confusing, so I just removed it. – Merrow
"even if such an intention existed, it's not at all apparent" -- agreed. If I were writing that code, and had intentionally implemented it to avoid that outcome, I would have a) included one or more comments explaining the specific points of implementation that were intended to preserve that behavior, and b) had the documentation written to reflect that. Of course, even if the behavior is intentional, since it's not documented it still can change, if e.g. there later turns out to be some higher-priority need for a change in the code that is mutually exclusive with preserving that behavior. – Carrasco
I opened an issue on github. I am curious what Microsoft thinks about this. – Merrow
Impressive answer. I tried for a while to get duplicate keys in a C# program and couldn't, but I didn't manage to understand why when reading the code. I missed the fact that the new elements were added at the beginning of the linked list and not at the end as I intuitively thought. It all makes sense now – Lion

© 2022 - 2024 β€” McMap. All rights reserved.