Which is more efficient : using removeAll() or using the following HashMap technique to retain only changed records in an ArrayList
Asked Answered
D

4

14

I have 2 ArrayLists A and B of the same datastructure C (hashCode() and equals() overridden). C represents a student's record. The two lists are of the same size and represent new student records and old ones respectively (the students are the same in both the lists, ordering might be different). I wish to keep only those records in A that have been changed. As such, I do :

 A.removeAll(B)

As per the javadocs, this would take each record of A and compare with each record of B, and if it finds both equal, it will remove the record from A. If a record of A is not found to be equal to any record in B, and since all students in A are also in B, it means that that record of A has changed. The problem is that its easily of n square complexity.

Another approach can be :

Map<C> map = new HashMap<C>();
for (C record : B){
    map.add(record.getStudentId(),record);
}
List<C> changedRecords = new ArrayList<C>();
for (C record : A){
    if (record.equals(map.get(record.getStudentId())){
        changedRecords.add(record);
    }
}

I think this might be of a lower complexity than the above solution. Is that correct ?

Diplodocus answered 3/4, 2012 at 7:27 Comment(1)
Forget about efficiency, your original solution is far more readable. Only if it turns out to be a bottleneck should you even consider the second.Continuant
N
11

Yes the latter algorithm is better than O(n^2), since you have two loops, one ranging over B and another over A and you do (amortized) constant work in each loop, your new solution runs in O(|A| + |B|).

I suspect that you don't have any duplicate entries though. If this is the case, you could also go via a HashSet (change to LinkedHashSet if you want to preserve the order in A):

HashSet<C> tmp = new HashSet<C>(A);
tmp.removeAll(B);                     // Linear operation
A = new ArrayList<C>(tmp);

(Or if order doesn't matter to you, you could use HashSets all the way through.)


As pointed out by @Daud in the comments below, HashSet.removeAll(Collection c) actually calls c.contains repeatedly if the size of the hash set is smaller than the collection which affects the complexity (at least in OpenJDK). This is because the implementation always chooses to iterate over the smaller collection.

Nominative answered 3/4, 2012 at 7:29 Comment(6)
do you mean performance difference? I don't think so because in java HashSet is built on top of HashMap :)Oscular
I saw the source code of HashSet and it seems that for removeAll(), it would iterate through tmp and call contains() method on the argument passed to removeAll with the current value of tmp as parameter. Since the argument passed to removeAll() is an ArrayList, its contains method would take O(n)... thus making the entire operation O(n^2) ?Diplodocus
The contains method of HashSet runs in constant time (amortized).Nominative
Its not HashSet whose contains method is called.. its that of the Collection which is passed as argument (ArrayList in this case)... Perhaps tmp should be an ArrayList and the argument to removeAll be a HashSet.Diplodocus
I had a look at the code and you're right. This was very surprising to me. I'll update the answer with your find.Nominative
Correct me if I am wrong.. but perhaps even the contains() method of HashSet doesn't run in constant time. It ultimately uses the getEntry() method of HashMap which uses a loop. So I think I should go with the second solution I mentioned in my question for a linear solutionDiplodocus
S
1

What you may save on complexity you could be losing in memory allocation so isn't necessarily more efficient. Arrraylist uses something similar to an in place partitioning algorithm to run down the backing array and test against the compare.

When comparing it simply looks to find the index of the first occurrence of a match against the backing array Object[]. The algorithm maintains two indexes, one to iterate through the backing array and one as a placeholder for matches. In the case of a match it simply moves the index on the backing array and carries on to the next incoming element; this is relatively cheap.

If it comes to a point where it finds that the incoming collection doesn't contain the value at the current index in the backing array it simply overwrites the element where the last match occurred with the element at the current index without incurring a new memory allocation. This pattern repeats until all elements in the ArrayList have been compared against the incoming collection, hence the complexity you are concerned about.

For example: Consider an arraylist A with 1,2,4,5 and a collection 'C' with 4,1 that we match against; wanting to remove 4 and 1. here is each iteration on the for loop that would go 0 -> 4

Iteration: r is the for loop index on arraylist a for (; r < size; r++)

r = 0 (does C contain 1? Yes, skip to the next one) A: 1,2,4,5 w = 0

r = 1 (Does C contain 2? No, copy the value at r into the spot pointed to by w++) A: 2,2,4,5 w=1

r = 2 (Does C contain 4?, Yes skip) A: 2,2,4,5 w=1

r = 3 (Does C contain 5? No, copy the value at r into the spot pointed to by w++)

A: 2,5,4,5 w=2

r=4, stop

Compare w to the size of the backing array which is 4. Since they are not equal Null out the values from w on to the end of the array and reset the size.

A: 2,5 size of 2

The built in removeAll also considers that ArrayLists can contain null. You could throw an NPE on record.getStudentId() in your solution above. Finally, removeAll protects against exceptions in the compare on Collection.contains. if that happens it uses finally to do a native memcopy which protects the backing array from corruption in a highly efficient manner.

Stent answered 3/4, 2012 at 9:44 Comment(0)
O
1

Definitely second 'algorithm' is better than first considering amortized analysis. is it the best way? do you need that? will it cause any visible impact to user in terms of performance does the number of items in list grow so huge, that this becomes a bottleneck in the system?

First approach is more readable, conveys your intention to people who maintain the code. Also it is preferable to use 'tested' API instead of re-inventing the wheel (unless absolutely necessary) Computers have become so fast that we shouldn't do any premature optimizations.

if seen essential I might go with a solution using Set, similar to @aioob's

Oscular answered 3/4, 2012 at 11:3 Comment(0)
N
1

I have encountered a performance bottleneck in member removeAll in some instances (EMF model manipulation related). For ArrayList as mentionned above, just use standard removeAll, but if A is for instance an EList, n^2 can be encountered.

Hence,avoid relying on hidden good properties of specific implementations of List< T > ; Set.contains() O(1) is a guarantee (if you use a HashSet and have a decent hashCode, log2(n) for TreeSet with ordering relation), use that to bound algorithmic complexity.

I use the following code that avoids useless copies; intention is that you are scanning a data structure finding irrelevant elements you don't want and adding them to "todel".

For some reason like avoiding concurrent modifications, you are navigating a tree etc..., you cannot remove elements as you are doing this traversal. So, we cumulate them into a HashSet "todel".

In the function, we need to modify "container" in place, since it is typically an attribute of the caller, but using remove(int index) on "container" might induce a copy because of left shift of elements. We use a copy "contents" to achieve this.

Template argument is because during selection process, I often get subtypes of C, but feel free to use < T > everywhere.

/**
 * Efficient O (n) operation to removeAll from an aggregation.
 * @param container a container for a set of elements (no duplicates), some of which we want to get rid of
 * @param todel some elements to remove, typically stored in a HashSet.
 */
public static <T> void removeAll ( List<T> container, Set<? extends T> todel ) {
    if (todel.isEmpty())
        return;
    List<T> contents = new ArrayList<T>(container);
    container.clear();
    // since container contains no duplicates ensure |B| max contains() operations
    int torem = todel.size();
    for (T elt : contents) {
        if ( torem==0 || ! todel.contains(elt) ) {
            container.add(elt);
        } else {
            torem--;
        }
    }
}

So in your case you would invoke with : removeAll(A, new HashSet < C >(B)); paying one copy of B if you really cannot accumulate into a Set< C > during the selection phase.

Place it in a utility class and static import for ease of use.

Nw answered 1/7, 2015 at 19:7 Comment(2)
Set.contains() is not O(1)-guaranteed at all. First of all, it is only expected for hash-based sets. But bad hashCode() function can completely ruin that. For other sets (like TreeSet), it is not even expected to be O(1).Gyve
Agreed, O(1) for hashsets not Set, with semi-decent hashCode function. Answer edited slightly.Nw

© 2022 - 2024 — McMap. All rights reserved.