In Java (1.5 or later), what is the best performing way to fetch an (any) element from a Set?
Asked Answered
P

5

8

In the code below, I needed to fetch an element, any element, from toSearch. I was unable to find a useful method on the Set interface definition to return just a single (random, but not required to be random) member of the set. So, I used the toArray()[0] technique (present in the code below).

private Set<Coordinate> floodFill(Value value, Coordinate coordinateStart)
{
    Set<Coordinate> result = new LinkedHashSet<Coordinate>();

    Set<Coordinate> toSearch = new LinkedHashSet<Coordinate>();
    toSearch.add(coordinateStart);
    while (toSearch.size() > 0)
    {
        Coordinate coordinate = (Coordinate)toSearch.toArray()[0];
        result.add(coordinate);
        toSearch.remove(coordinate);
        for (Coordinate coordinateAdjacent: getAdjacentCoordinates(coordinate))
        {
            if (this.query.getCoordinateValue(coordinateAdjacent) == value)
            {
                if (!result.contains(coordinateAdjacent))
                {
                    toSearch.add(coordinateAdjacent);
                }
            }
        }
    }

    return result;
}

The other technique I have seen discussed is to replace "(Coordinate)toSearch.toArray()[0]" with "toSearch.iterator().next()". Which technique, toArray() or iterator(), is the most likely to execute the most quickly with the least GC (Garbage Collection) impact?

My intuition (after composing this question) is that the second technique using the Iterator will be both faster in execution and lower overhead for the GC. Given I don't know the implementation of the Set being passed (assuming HashSet or LinkedHashSet as most likely), how much overhead is incurred in each of the toArray() or iterator() methods? Any insights on this would be greatly appreciated.

Questions (repeated from above):

  1. Which technique, toArray() or iterator(), is the most likely to execute the most quickly with the least GC (Garbage Collection) impact?
  2. Given I don't know the implementation of the Set being passed (assuming HashSet or LinkedHashSet as most likely), how much overhead is incurred in each of the toArray() and iterator() methods?
Perfectionist answered 4/12, 2010 at 23:52 Comment(0)
P
9

toSearch.iterator().next() will be faster and less memory-intensive because it does not need to copy any data, whereas toArray will allocate and copy the contents of the set into the array. This is irrespective of the actual implementation: toArray will always have to copy data.

Phototherapy answered 4/12, 2010 at 23:57 Comment(4)
That makes sense to me. And is it even further accurate to assume that the when toArray() is generating the array, it's likely using the very same iterator implementation to fill it? If so, then it's pretty obvious to me the iterator approach is way preferred.Perfectionist
It may or may not use the iterator to fill the array. An ArrayList probably won't - it can use System.arrayCopy to do a fast copy. You don't want to be copying data if you can avoid it in any case.Phototherapy
It's difficult for me to believe I am the first person to encounter this dilemma. It sure would have been nice if the Set interface had a method defined, say iteratorFirstElement() which had implementations default to iterator().next(). Instead, I have to put this snippet all over my code base. Ugh!Perfectionist
You're not the first :) However, in your case it looks like @Petro is right: you could replace the Set with a Queue and avoid the problem. You can use a Set to hold visited nodes in your search to avoid cycles and a Queue to hold the open node set.Phototherapy
R
1

From what I can see you are doing Breadth First Search

Below is the example how it could be implemented without using toArray:

    private Set<Coordinate> floodFill(Value value, Coordinate coordinateStart) {
    final Set<Coordinate> visitedCoordinates = new LinkedHashSet<Coordinate>();
    final Deque<Coordinate> deque = new ArrayDeque<Coordinate>();

    deque.push(coordinateStart);

    while (!deque.isEmpty()) {
        final Coordinate currentVertex = deque.poll();
        visitedCoordinates.add(currentVertex);
        for (Coordinate coordinateAdjacent : getAdjacentCoordinates(currentVertex)) {
            if (this.query.getCoordinateValue(coordinateAdjacent) == value) {
                if (!visitedCoordinates.contains(coordinateAdjacent)) {
                    deque.add(coordinateAdjacent);
                }
            }
        }
    }

    return visitedCoordinates;
}

Implementation notes:

And now I am concerned that the contains() method's implementation on LinkedList could be doing up to a a full scan of the contents before returning the answer.

You are right about full scan (aka linear search). Nevertheless, In your case it's possible to have additional set for tracking already visited vertexes(btw, actually it's your result!), that would solve issue with contains method in O(1) time.

Cheers

Rubie answered 5/12, 2010 at 0:11 Comment(6)
I just copied the method and reimplemented using a Queue<Coordinate> toSearch = new LinkedList<Coordinate>(); By doing that, though, I lost the automatic "duplicate elimination" that comes from add() when the member is already present. So, I had to add an additional if (!toSearch.contains(coordinateAdjacent)) statement to prevent elements from being duplicated in the Queue. Without testing, it feels like I made the method more costly (how fast is the contains() method on a LinkedList (my memory from many years ago is that it is very poor).Perfectionist
I'll post my reimplementation with Queue as an answer.Perfectionist
Nice implementation. Question: Why did you use Deque as opposed to Queue? Is there a subtle advantage I am not understanding?Perfectionist
Good question. I think in this particular case there is no advantage. Deque supports element insertion and removal at both ends(as opposed to queue). I usually using this general interface when I need stack/queue. You could safely use Queue here. Cheers.Rubie
Petro, one more question - why the liberal use of final? I thought I had read elsewhere that it was now considered "bad practice" to do this as the JVMs were able to derive most of the assumptions from path analysis. I am getting ready to test different variations of the rewrite (I just posted the rewrite as an answer with the method named floodFind3). So, if you think my adding final ought to have a performance impact, I will add it to my empirical testing.Perfectionist
I don't think that I could outsmart JVM in path analysis and final is not for JVM, it's for me :-). I like idea of having immutable objects. If you are interested you could take a look into presentations from which I borrowed idea of having as much immutability as possible. ## Persistent Data Structures and Managed References. infoq.com/presentations/Value-Identity-State-Rich-Hickey ## Google Collections Library for Java (1 of 2). youtube.com/watch?v=ZeO_J2OcHYM. ## Google Collections Library for Java (2 of 2). youtube.com/watch?v=9ni_KEkHfto ##Rubie
E
1

Here's how I'd implement this:

private Set<Coordinate> floodFill(Value value, Coordinate start) {
    Set<Coordinate> result = new LinkedHashSet<Coordinate>();
    LinkedList<Coordinate> toSearch = new LinkedList<Coordinate>();
    toSearch.add(start);
    do {
        Coordinate coordinate = toSearch.removeFirst();
        if (result.add(coordinate)) {
            for (Coordinate ajacent: getAdjacentCoordinates(coordinate)) {
                if (this.query.getCoordinateValue(adjacent) == value) {
                    toSearch.add(adjacent);
                }
            }
        }
    } while (!toSearch.isEmpty());
    return result;
}

Notes:

  1. If you think about it, the toSearch data structure doesn't need to contain unique elements.
  2. Using a LinkedList for toSearch means that there is a simple method to get an element and remove it in one go.
  3. We can use the fact that Set.add(...) returns a boolean to have the number of lookups in the result set ... compared with using Set.contains().
  4. It would be better to use HashSet rather than LinkedHashSet for the results ... unless you need to know the order in which coordinates were added by the fill.
  5. Using == to compare Value instances is potentially a bit dodgy.
Ephor answered 5/12, 2010 at 13:28 Comment(16)
Nice! Regarding point "5. Using == to compare Value is potentially a bit dodgy." - Value is an Enum. I could turn it into a equals() and as long as HotSpot inlined the code, it ought to result in the same comparison (given my understanding of the how Enum was implemented).Perfectionist
Regarding point 4, I am using a LinkedHashSet so that I have "reproducability". HashSet's iterator() has no ordering constraints which makes it very difficult to re-create a precise context when I uncover an issue.Perfectionist
1) Calls to Value.equals(...) would be inlined if it is just ==. 2) The fact that the HashSet iterator is not reproducible suggests that Coordinate does not overload equals and hashCode. That means that getAdjacentCoordinates must always return the same coordinate object instances for a given coordinate.Ephor
If Coordinate.equals and hashcode used the x & y indexes, then the iteration order of a HashSet for Coordinates would be 100% reproducible, even though it looked "random". (I'm assuming that the visit order stays the same.)Ephor
My direct experience with HashSet is that the order of returned instances from next() on iterator() is not deterministic (even if equals and hashCode validly implemented AND using the same insertion order). Here's a recent SOF thread indicating this is a pretty common problem: https://mcmap.net/q/544512/-iteration-order-of-hashsetPerfectionist
I'd like to see example code. Also, the accepted answer confirms my understanding ... that there is nothing inherently non-deterministic about HashMap.Ephor
The code I lasted tested in this area (6 years ago, Java 1.3) involved passing an instance of Random throughout the framework so I could reliably reproduce the random number sequence. And then I couldn't get the experiment to reproduce the same results with the same starting seed (new Random(0)). It was very frustrating to eventually figure out (after +6 hours of single stepping a huge amount of iterative code) to finally grok that iterator() on HashSet was the culprit. Ultimately, the randomness came from the GC itself. Somehow, the guarantee of zeroed memory didn't hold for iterator().Perfectionist
This is a new experiment using Java 1.6. And since you obviously have had a different (newer?) experience in this area, you have me quite curious. I am willing to spend a couple hours converting back to HashSet and seeing if I get reproducability over +100 million invocations across dozens of classes and hundreds of methods. I will post here and let you know the result. My guestimated ETA is by EOD this Sunday, Dec/12th. If I do get reproducability, I think I ALSO get some form of performance gain, too. Unfortunately, I won't really know how to empiracally measure/compare it to LinkedHashSet.Perfectionist
@Stephen: After extensive testing cycles, I was able to determine your approach (even with very maze-like patterns in the data) was anywhere from 4-12% faster than a number of variations using the extra tracking set. And I would set your answer to be "the correct answer" if I had worded my original question a bit better. Technically, the checked answer directly addresses my question. Your answer provided an even more efficient way. Thank you for sharing.Perfectionist
@Stephen: I am still trying to find the time to do a set of tests validating/invaliding the HashSet/HashMap iteration order as random even in the face of all stored instances having correctly overridden equals() and hashcode() methods.Perfectionist
@chaotic3equalibrium - make sure those tests use objects that have the same hash codes across different runs (e.g. no identityHashCodes that may be different in different runs), and that they are added to / removed from clean hashsets with the same initial size / load factor in the same order. Under those conditions, iteration order should be repeatable, and theoretically predictable. (Alternatively, read and understand the source code for HashMap / HashSet.)Ephor
@Stephen: I have just completed my run. And randomness showed up. UGH! I did as you said and overrode hashCode() in all classes which were being stored in HashSet and HashMap (as keys). I ensured the class instances had exactly the same hashCode() return value and that the value was identical across runs. And I used the default initialization (empty parameters) for every instance created so as to ensure identical initial size and load factor. This included replacing two Enum classes with a custom enum implementation (hashCode() is final in Enum and can change value between runs). More...Perfectionist
Continued... I have been meticulous across my code base (using Eclipse to search my project for all instances). And the randomness showed up in my very first run (three hour process). Every possible point of randomness I know of to eliminate has been replaced with a deterministic mechanism. I am open to any further suggested pathways to try. I #hate# the unexplained presence of an unpredictibility. And I am happy to keep hunting to isolate and explain how/why this is occurring. I don't want to use LinkedHash* as my solution when it's not technically needed. Any feedback is appreciated.Perfectionist
@Stephen: A moderate review of HashSet (which refers to HashMap for it's implementation), HashMap (which depends upon AbstractMap) and AbstractMap, I cannot find any point where randomness in the iterator would be introduced. It's extraordinarily dense code (javakey.net/source/jdk/1.6/index.html - java.util.HashMap), so I don't think it will be very easy to detect where the randomness is introduced without wrapping it in a telemetry class and attempting to find the exact instruction cascade that generates the bit of randomness (maybe related to the load factor floating point value)?Perfectionist
@chaotic3equalibrium - I don't know what to say. How can you find a point of randomness that isn't there? Or prove that there isn't one? If I was trying to track this down, I'd create a wrapper for HashMap that logged the hashcode of each and every object added and removed, and compare the logs from two runs of the algorithm.Ephor
@Stephen: Good idea! I'll try that. May take a bit of work to set up (that's quite a bit of storage - three hours of straight run through). Or, I may to create a small test app that forces the issue to occur and then created the wrapper in that. I hope to be able to do that today. I'll report back.Perfectionist
P
0

After Petro's response, I copied the method and reimplemented it according to his suggestions. It looks like this:

private Set<Coordinate> floodFind2(Value value, Coordinate coordinateStart)
{
    Set<Coordinate> result = new LinkedHashSet<Coordinate>();

    Queue<Coordinate> toSearch = new LinkedList<Coordinate>();
    toSearch.add(coordinateStart);
    while (!toSearch.isEmpty())
    {
        Coordinate coordinate = toSearch.remove();
        result.add(coordinate);
        for (Coordinate coordinateAdjacent: getAdjacentCoordinates(coordinate))
        {
            if (getCoordinateValue(coordinateAdjacent).equals(value))
            {
                if (!result.contains(coordinateAdjacent))
                {
                    if (!toSearch.contains(coordinateAdjacent))
                    {
                        toSearch.add(coordinateAdjacent);
                    }
                }
            }
        }
    }

    return result;
}

By moving from Set to Queue, my efficiency questions shifted to the new conditional check I had to add, "if (!toSearch.contains(coordinateAdjacent))". Using the Set interface, it silently stopped me from adding duplicates. Using the Queue interface, I have to check to ensure I'm not adding a duplicate.

And now I am concerned that the contains() method's implementation on LinkedList could be doing up to a a full scan of the contents before returning the answer. So, comparing this method to the one I originally posted, which is likely to be more efficient (before I go spend a good quantity of time doing the empirical testing)?

Perfectionist answered 5/12, 2010 at 1:10 Comment(5)
Looks like this might be a more common problem than I thought initially: #2319586Perfectionist
It's really impossible to say which variant would be faster without trying this out empirically. For one thing, it depends on how the data is disctributed, will toSearch.contains() find the item fast, or will it be more often searching through the list?Apicella
Is this an answer? It doesn't look like one to me.Ephor
I'd suggest you also maintain a Set of already-explored nodes. Your current implementation will enter an infinite loop if you have 4 nodes, A, B, C and D, such that they form a square (i.e. A -> B, B -> C, C -> D, D -> A and the reverse edges). You can add each node as you visit it to the explored set and check that new neighbours have not already been explored. These two operations are both constant time with a HashSet.Phototherapy
Oh! I just noticed the if (!result.contains(...)) bit. Ignore that last comment.Phototherapy
P
0

Okay, below is my latest implementation incorporating feedback (mainly from Stephen, Cameron and Petro) which includes completely eliminating the toArray()[]-vs-interator().next() conflict. And I have sprinkled in comments to more accurately distinguish what's occurring and why. And to better clarify why I concretely implemented Petro's original "use a tracking Set" advice (seconded by Cameron). And just after the code snippet, I will contrast it with the other proposed solutions.

private Set<Coordinate> floodFind3(Coordinate coordinate)
{
    Set<Coordinate> area = new LinkedHashSet<Coordinate>(); //includes only area of value (which is the same as at coordinate)

    area.add(coordinate);
    Value value = getCoordinateValue(coordinate); //value upon which to expand area
    Set<Coordinate> checked = new LinkedHashSet<Coordinate>(); //every coordinate evaluated regardless of value
    checked.add(coordinate);
    Queue<Coordinate> candidates = new LinkedList<Coordinate>(); //coordinates evaluated, were of value, and are queued to iterate through their adjacents
    candidates.add(nordinate);
    while (!candidates.isEmpty())
    {
        for (Nordinate coordinateAdjacent: this.query.getNordinates().getAdjacent(candidates.remove()).getOrthogonal())
        {
            if (checked.add(coordinateAdjacent)) //only expands containing value and !value
            {
                if (getCoordinateValue(coordinateAdjacent) == value)
                {
                    area.add(coordinateAdjacent); //only expands containing value
                    candidates.add(coordinateAdjacent); //expands and contracts containing value
                }
            }
        }
    }

    return area;
}

I have updated the method several significant ways:

  1. One less method parameter: I removed a parameter as it was derivable from the search and eliminated a possible logical issue where the starting coordinate is pointing at a location containing !value.
  2. Three collections track the search; area (Set), checked (Set) and candidates (Queue). The code comments clarify the specific use of each. Used LinkedHashSet for reliable reproducability while chasing bugs and performance issues (https://mcmap.net/q/544512/-iteration-order-of-hashset). Once stable, I will likely revert to faster HashSet implementation.
  3. Reordered the "check if already evaluated" test prior to the "is value" test to only visit every coordinate exactly once. This avoids revisiting !value adjacent coordinates more than once. Also incorporated Stephen's clever double use of the Set add() method. This becomes very important as the area to flood becomes more maze-like (snakely/spidery).
  4. Kept the "==" for checking value forcing a reference comparison. Value is defined to be a Java 1.5 Enum and I didn't want to depend upon HotSpot to both inline the .equals() method call and reduce it to a reference comparison. If Value were ever to change from being an Enum, this choice could come back to bite me. Tyvm to Stephen for pointing this out.

Petro's and Stephan's solutions visit the coordinates containing value just once, but require revisiting the coordinates containing !value more than once, which can result in quite a few duplicate fetches/value checks for areas consisting of long maze-like tunnels. While "long maze-like tunnels" may be considered a pathological case, it is more typical of the particular domain for which I need this method. And my "2nd" attempted solution (which had the poor performance LinkedList contains() call) was questionable as a real answer ({nod} to Stephen on that one).

Thank you for all your feedback.

Next up, lots of empirical testing with single variations/changes over hundreds of millions of invocations. I'll update this answer with details sometime this weekend.

Perfectionist answered 7/12, 2010 at 4:54 Comment(4)
Since you are starting from graph and building tree from it I'd recommend to use TreeSet for area. Order guaranteed.Rubie
Petro, interesting. I will check it out and see how it affects performance.Perfectionist
Generally you'll get O(log(N)) during searching/checking/inserting operations. HashSet is cheaper and have O(1) for all of these. The bigger N the smaller performance difference you'll get.Rubie
After extensive testing in a number of configurations, it turns out that Stephen C's implementation above is between 4-12% faster than any of the other solution variations.Perfectionist

© 2022 - 2024 — McMap. All rights reserved.