In Java can I depend on reference assignment being atomic to implement copy on write?
Asked Answered
A

5

9

If I have an unsynchronized java collection in a multithreaded environment, and I don't want to force readers of the collection to synchronize[1], is a solution where I synchronize the writers and use the atomicity of reference assignment feasible? Something like:

private Collection global = new HashSet(); // start threading after this

void allUpdatesGoThroughHere(Object exampleOperand) {
  // My hypothesis is that this prevents operations in the block being re-ordered
  synchronized(global) {
    Collection copy = new HashSet(global);
    copy.remove(exampleOperand);
    // Given my hypothesis, we should have a fully constructed object here. So a 
    // reader will either get the old or the new Collection, but never an 
    // inconsistent one.
    global = copy;    
  }
}

// Do multithreaded reads here. All reads are done through a reference copy like:
// Collection copy = global;
// for (Object elm: copy) {...
// so the global reference being updated half way through should have no impact 

Rolling your own solution seems to often fail in these type of situations, so I'd be interested in knowing other patterns, collections or libraries I could use to prevent object creation and blocking for my data consumers.


[1] The reasons being a large proportion of time spent in reads compared to writes, combined with the risk of introducing deadlocks.


Edit: A lot of good information in several of the answers and comments, some important points:

  1. A bug was present in the code I posted. Synchronizing on global (a badly named variable) can fail to protect the syncronized block after a swap.
  2. You could fix this by synchronizing on the class (moving the synchronized keyword to the method), but there may be other bugs. A safer and more maintainable solution is to use something from java.util.concurrent.
  3. There is no "eventual consistency guarantee" in the code I posted, one way to make sure that readers do get to see the updates by writers is to use the volatile keyword.
  4. On reflection the general problem that motivated this question was trying to implement lock free reads with locked writes in java, however my (solved) problem was with a collection, which may be unnecessarily confusing for future readers. So in case it is not obvious the code I posted works by allowing one writer at a time to perform edits to "some object" that is being read unprotected by multiple reader threads. Commits of the edit are done through an atomic operation so readers can only get the pre-edit or post-edit "object". When/if the reader thread gets the update, it cannot occur in the middle of a read as the read is occurring on the old copy of the "object". A simple solution that had probably been discovered and proved to be broken in some way prior to the availability of better concurrency support in java.
Alesandrini answered 15/8, 2012 at 4:5 Comment(4)
This seems like something the collections in java.util.concurrent would do well.Glomerulonephritis
I looked at those data structures but was concerned that about their get and iterator performance compared to my solution above. It would probably be worthwhile for me to benchmark this.Alesandrini
and it is interesting to see that they can sometimes go horribly wrong as well #3293077. Not that I am saying that you shouldn't use it, just that sometimes you need to do your homework and understand how your code is being executed.Alesandrini
MilesHampson - The situation described in that question would happen basically no matter what type of concurrency constructs they were using. Suddenly terminating a thread without giving it a chance to clean up after itself (i.e. release locks it's holding) is going to be catastrophic in any situation. I think the fact that the package involved in killing threads is called Unsafe should be a big enough red flag to know that bad things might happen if you use it!Glomerulonephritis
G
13

Rather than trying to roll out your own solution, why not use a ConcurrentHashMap as your set and just set all the values to some standard value? (A constant like Boolean.TRUE would work well.)

I think this implementation works well with the many-readers-few-writers scenario. There's even a constructor that lets you set the expected "concurrency level".

Update: Veer has suggested using the Collections.newSetFromMap utility method to turn the ConcurrentHashMap into a Set. Since the method takes a Map<E,Boolean> my guess is that it does the same thing with setting all the values to Boolean.TRUE behind-the-scenes.


Update: Addressing the poster's example

That is probably what I will end up going with, but I am still curious about how my minimalist solution could fail. – MilesHampson

Your minimalist solution would work just fine with a bit of tweaking. My worry is that, although it's minimal now, it might get more complicated in the future. It's hard to remember all of the conditions you assume when making something thread-safe—especially if you're coming back to the code weeks/months/years later to make a seemingly insignificant tweak. If the ConcurrentHashMap does everything you need with sufficient performance then why not use that instead? All the nasty concurrency details are encapsulated away and even 6-months-from-now you will have a hard time messing it up!

You do need at least one tweak before your current solution will work. As has already been pointed out, you should probably add the volatile modifier to global's declaration. I don't know if you have a C/C++ background, but I was very surprised when I learned that the semantics of volatile in Java are actually much more complicated than in C. If you're planning on doing a lot of concurrent programming in Java then it'd be a good idea to familiarize yourself with the basics of the Java memory model. If you don't make the reference to global a volatile reference then it's possible that no thread will ever see any changes to the value of global until they try to update it, at which point entering the synchronized block will flush the local cache and get the updated reference value.

However, even with the addition of volatile there's still a huge problem. Here's a problem scenario with two threads:

  1. We begin with the empty set, or global={}. Threads A and B both have this value in their thread-local cached memory.
  2. Thread A obtains obtains the synchronized lock on global and starts the update by making a copy of global and adding the new key to the set.
  3. While Thread A is still inside the synchronized block, Thread B reads its local value of global onto the stack and tries to enter the synchronized block. Since Thread A is currently inside the monitor Thread B blocks.
  4. Thread A completes the update by setting the reference and exiting the monitor, resulting in global={1}.
  5. Thread B is now able to enter the monitor and makes a copy of the global={1} set.
  6. Thread A decides to make another update, reads in its local global reference and tries to enter the synchronized block. Since Thread B currently holds the lock on {} there is no lock on {1} and Thread A successfully enters the monitor!
  7. Thread A also makes a copy of {1} for purposes of updating.

Now Threads A and B are both inside the synchronized block and they have identical copies of the global={1} set. This means that one of their updates will be lost! This situation is caused by the fact that you're synchronizing on an object stored in a reference that you're updating inside your synchronized block. You should always be very careful which objects you use to synchronize. You can fix this problem by adding a new variable to act as the lock:

private volatile Collection global = new HashSet(); // start threading after this
private final Object globalLock = new Object(); // final reference used for synchronization

void allUpdatesGoThroughHere(Object exampleOperand) {
  // My hypothesis is that this prevents operations in the block being re-ordered
  synchronized(globalLock) {
    Collection copy = new HashSet(global);
    copy.remove(exampleOperand);
    // Given my hypothesis, we should have a fully constructed object here. So a 
    // reader will either get the old or the new Collection, but never an 
    // inconsistent one.
    global = copy;    
  }
}

This bug was insidious enough that none of the other answers have addressed it yet. It's these kinds of crazy concurrency details that cause me to recommend using something from the already-debugged java.util.concurrent library rather than trying to put something together yourself. I think the above solution would work—but how easy would it be to screw it up again? This would be so much easier:

private final Set<Object> global = Collections.newSetFromMap(new ConcurrentHashMap<Object,Boolean>());

Since the reference is final you don't need to worry about threads using stale references, and since the ConcurrentHashMap handles all the nasty memory model issues internally you don't have to worry about all the nasty details of monitors and memory barriers!

Glomerulonephritis answered 15/8, 2012 at 4:28 Comment(9)
I quote... "Like Hashtable but unlike HashMap, this class does not allow null to be used as a key or value." ;)Inductee
@veer - Thanks! I changed it to recommend Boolean.TRUE instead.Glomerulonephritis
By the way, you may find it advantageous to use Collections.newSetFromMap.Inductee
@veer - That's really interesting! Good to know that's there. I already voted for your answer since it actually addresses all of the poster's questions. I started writing my post before I noticed your edits at the bottom about using java.util.concurrent, otherwise I wouldn't have even bothered with an answer. You obviously know your stuff!Glomerulonephritis
thanks! You've made good suggestions for alternatives, which is probably more valuable than answering his direct questions on rolling his own, anyways... ;)Inductee
That is probably what I will end up going with, but I am still curious about how my minimalist solution could fail.Alesandrini
MilesHampson - I've updated my answer with a nice long essay on the problems in the solution you posted in your question. Enjoy!Glomerulonephritis
Thanks, that edit was exactly the kind of thing I was looking for (btw you could fix that problem by just moving the synchronized keyword to the method rather than adding a new monitor). I agree, it is so hard to reason about the behavior of concurrent code that you would have to have a really good reason not to use the java utilities. And thanks for the reference to the memory model, imho another compulsory read is Doug Lea's Concurrent Programming in Java: Design Principles and Patterns.Alesandrini
MilesHampson - Good point about being able to just move synchronized to the method level. I honestly didn't even think of that because I've always avoided that. It would definitely work here as long as nothing else is synchronizing on this. Doug Lea is a smart guy—I'm sure his book is very worth reading.Glomerulonephritis
I
7

According to the relevant Java Tutorial,

We have already seen that an increment expression, such as c++, does not describe an atomic action. Even very simple expressions can define complex actions that can decompose into other actions. However, there are actions you can specify that are atomic:

  • Reads and writes are atomic for reference variables and for most primitive variables (all types except long and double).
  • Reads and writes are atomic for all variables declared volatile (including long and double variables).

This is reaffirmed by Section §17.7 of the Java Language Specification

Writes to and reads of references are always atomic, regardless of whether they are implemented as 32-bit or 64-bit values.

It appears that you can indeed rely on reference access being atomic; however, recognize that this does not ensure that all readers will read an updated value for global after this write -- i.e. there is no memory ordering guarantee here.

If you use an implicit lock via synchronized on all access to global, then you can forge some memory consistency here... but it might be better to use an alternative approach.

You also appear to want the collection in global to remain immutable... luckily, there is Collections.unmodifiableSet which you can use to enforce this. As an example, you should likely do something like the following...

private volatile Collection global = Collections.unmodifiableSet(new HashSet());

... that, or using AtomicReference,

private AtomicReference<Collection> global = new AtomicReference<>(Collections.unmodifiableSet(new HashSet()));

You would then use Collections.unmodifiableSet for your modified copies as well.


// ... All reads are done through a reference copy like:
// Collection copy = global;
// for (Object elm: copy) {...
// so the global reference being updated half way through should have no impact

You should know that making a copy here is redundant, as internally for (Object elm : global) creates an Iterator as follows...

final Iterator it = global.iterator();
while (it.hasNext()) {
  Object elm = it.next();
}

There is therefore no chance of switching to an entirely different value for global in the midst of reading.


All that aside, I agree with the sentiment expressed by DaoWen... is there any reason you're rolling your own data structure here when there may be an alternative available in java.util.concurrent? I figured maybe you're dealing with an older Java, since you use raw types, but it won't hurt to ask.

You can find copy-on-write collection semantics provided by CopyOnWriteArrayList, or its cousin CopyOnWriteArraySet (which implements a Set using the former).


Also suggested by DaoWen, have you considered using a ConcurrentHashMap? They guarantee that using a for loop as you've done in your example will be consistent.

Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.

Internally, an Iterator is used for enhanced for over an Iterable.

You can craft a Set from this by utilizing Collections.newSetFromMap like follows:

final Set<E> safeSet = Collections.newSetFromMap(new ConcurrentHashMap<E, Boolean>());
...
/* guaranteed to reflect the state of the set at read-time */
for (final E elem : safeSet) {
  ...
}
Inductee answered 15/8, 2012 at 4:16 Comment(6)
Thanks for your answer, there are a couple of points that aren't clear to me though. Why do you say that memory ordering is an issue? By the definition provided either the readers get the old data structure or the new data structure, right? Also why would I want to make global volatile given that my reader can make a copy of the reference in an atomic operation? I have a feeling I am missing something here...Alesandrini
I suggested you use volatile to guarantee other threads receive the latest written value upon reading the field... otherwise there are no guarantees when your readers will receive the updated reference.Inductee
@Alesandrini I have since updated my post. You are free to ask any more questions you might have.Inductee
Actually it doesn't matter to me when readers get the updates (famous last words...) so I am still of the opinion that my solution above will work. Thanks for your detailed edits, one thing to note is that I don't believe it is correct to say that I want an immutable collection (My post used remove() for example).Alesandrini
You do want immutable collections exposed via global; your modifications are all done to the copy. What I meant was global = Collections.unmodifiableSet(copy);Inductee
... but yes, your approach should work. Remember that there is no guarantee at all that the readers are ever read updated values, by the way.Inductee
L
1

I think your original idea was sound, and DaoWen did a good job getting the bugs out. Unless you can find something that does everything for you, it's better to understand these things than hope some magical class will do it for you. Magical classes can make your life easier and reduce the number of mistakes, but you do want to understand what they are doing.

ConcurrentSkipListSet might do a better job for you here. It could get rid of all your multithreading problems.

However, it is slower than a HashSet (usually--HashSets and SkipLists/Trees hard to compare). If you are doing a lot of reads for every write, what you've got will be faster. More importantly, if you update more than one entry at a time, your reads could see inconsistent results. If you expect that whenever there is an entry A there is an entry B, and vice versa, the skip list could give you one without the other.

With your current solution, to the readers, the contents of the map are always internally consistent. A read can be sure there's an A for every B. It can be sure that the size() method gives the precise number of elements that will be returned by the iterator. Two iterations will return the same elements in the same order.

In other words, allUpdatesGoThroughHere and ConcurrentSkipListSet are two good solutions to two different problems.

Laughing answered 15/8, 2012 at 20:25 Comment(2)
I was vaguely aware of CSLs but not their performance, I'll have a read of research.ibm.com/people/m/michael/spaa-2002.pdf to see what I can get out of them. The intent of the question was to get the best 'standard' solution for my particular question, but I think one of the fun parts of SO is giving and receiving guidance on the 'how' rather than just suggesting a 'what', which is why I kept asking the initial magical data structure answers for more details. Glad for your answer even if I can't set it as the accepted one.Alesandrini
@Alesandrini My answer's just idle chatter compared to all the other hard work done here, but I thought it worth adding. My own testing shows CSLs to be 20 to 30% slower than TreeMaps, perhaps because they're skip lists instead of trees and mostly because of their fancy memory access. TreeMaps and HashMaps are hard to compare because the HMs often have to allocate new arrays and can mess up GC's when they get large, but are much faster than TMs when they don't do those things.Laughing
M
-1

Replace the synchronized by making global volatile and you'll be alright as far as the copy-on-write goes.

Although the assignment is atomic, in other threads it is not ordered with the writes to the object referenced. There needs to be a happens-before relationship which you get with a volatile or synchronising both reads and writes.

The problem of multiple updates happening at once is separate - use a single thread or whatever you want to do there.

If you used a synchronized for both reads and writes then it'd be correct but the performance may not be great with reads needing to hand-off. A ReadWriteLock may be appropriate, but you'd still have writes blocking reads.

Another approach to the publication issue is to use final field semantics to create an object that is (in theory) safe to be published unsafely.

Of course, there are also concurrent collections available.

Malone answered 15/8, 2012 at 4:9 Comment(5)
You are definitely right about needing the volatile keyword, but I don't think it's safe to remove synchronized. Without a lock you'll have a race condition between where copy is created and assigned to global, possibly resulting in lost updates.Glomerulonephritis
@Glomerulonephritis Oh possibly depending upon what you are doing. As far as updating making sure the visible state is good, it isn't necessary. If it's really a performance issue as suggested, a ReadWriteLock would be more appropriate.Malone
In the example given there would definitely be a race condition. I would agree on the read-write lock—but then again those typically also have the problem of possible starvation of writers by infinite streams of readers. So, like you said, it all depends.Glomerulonephritis
The synchronized can't be removed, as I need a happens-before to ensure my operations modifying the data structure complete before I do the assignment. What is the advantage of making global volatile? The operation to commit the global state, and the operation to copy the global state reference before reading, are already atomic right? I don't need to impose ordering between the two, and as there are two different memory areas I would have thought that all readers can continue to operate on their area oblivious to the fact that the global reference is being swapped?Alesandrini
P.S I didn't vote you down, I think bringing volatile and ReadWriteLock into the discussion was valuable even if the removal of synchronized wouldn't work.Alesandrini
E
-1

Can you use the Collections.synchronizedSet method? From HashSet Javadoc http://docs.oracle.com/javase/6/docs/api/java/util/HashSet.html

Set s = Collections.synchronizedSet(new HashSet(...));
Endure answered 15/8, 2012 at 4:15 Comment(3)
This applies to access to the collection itself, not the reference stored in the field.Inductee
@veer, from the java doc it states public static <T> Set<T> synchronizedSet(Set<T> s) Returns a synchronized (thread-safe) set backed by the specified set. In order to guarantee serial access, it is critical that all access to the backing set is accomplished through the returned set.Endure
yes, which is not what the OP wants. I didn't downvote you, by the way.Inductee

© 2022 - 2024 — McMap. All rights reserved.