How to check if exists any duplicate in Java 8 Streams?
Asked Answered
S

7

69

In java 8, what's the best way to check if a List contains any duplicate?

My idea was something like:

list.size() != list.stream().distinct().count()

Is it the best way?

Stonemason answered 5/5, 2015 at 12:50 Comment(1)
if you are not interested in knowing what are those duplicates , then it's the best way !!Estriol
A
75

Your code would need to iterate over all elements. If you want to make sure that there are no duplicates simple method like

public static <T> boolean areAllUnique(List<T> list){
    Set<T> set = new HashSet<>();

    for (T t: list){
        if (!set.add(t))
            return false;
    }
    
    return true;
}

would be more efficient since it can give you false immediately when first non-unique element would be found.

This method could also be rewritten using Stream#allMatch which also is short-circuit (returns false immediately for first element which doesn't fulfill provided condition)
(assuming non-parallel streams and thread-safe environment)

public static <T> boolean areAllUnique(List<T> list){
    Set<T> set = new HashSet<>();
    return list.stream().allMatch(t -> set.add(t));
}

which can be farther shortened as @Holger pointed out in comment

public static <T> boolean areAllUnique(List<T> list){
    return list.stream().allMatch(new HashSet<>()::add);
}
Abseil answered 5/5, 2015 at 13:3 Comment(25)
This can be even a one-liner: return list.stream().allMatch(new HashSet<>()::add);Sisal
@Sisal Wow, I never though about that. Every day I learn something new :) Thanks!Abseil
Seems a bit dangerous to use a predicate with side effects here.Grogan
@StuartMarks Can you say something more about this subject? Are you referring to both allMatch solutions or only allMatch(new HashSet<>()::add)?Abseil
Both. The spec requires the predicate to be stateless. The usual caveat about running in parallel applies. ConcurrentHashMap might help. There may be other issues but I haven't had coffee yet. :-)Grogan
Yes, concurrency can make life harder. My answer is based on big assumption that it will be used by single thread without parallel stream, just like shown in first code example. This means that list will also not change while iterating.Abseil
@Stuart Marks: I think, for a one liner which spans the entire life cycle of the Stream, it’s acceptable to rely on the programmer to recognize whether that stream use is multi-threaded or not. Of course, if in doubt, using a ConcurrentMap might help. And a stateful predicate might not be the best thing, but is sometimes unavoidable, distinct tests being the common example. Maybe you remember this one ;^)Sisal
@Sisal Of course I remember. :-) But using a stateful predicate requires a certain amount of caution. I think it's likely that somebody will copy and paste this code into a different context, without understanding its restrictions, and introduce bugs that way.Grogan
Perhaps a brief note is due in the answer regarding @StuartMarks' concern.Narceine
thank you all. just for some context, I was thinking just in a simple case, no concurrency problems :)Stonemason
@Holger, @Pshemo, @StuartMarks, why not change to stream.seguential().allMatch(new HashSet<>()::add);? Will it be always correct? And I guess no overhead?Mines
whats wrong with return list.size() == new HashSet<>(list).size(); as suggested by @Sasha?Lotze
@MubasharAhmad It is not that it is wrong, but new HashSet<>(list) will iterate over entire list, while .allMatch(new HashSet<>()::add) is short-circuit, it will return false at first occurrence of duplicate element.Abseil
@Pshemo: Thanks this was the answer I was looking forLotze
by the way what if someone is looking for the inverse version of it e.g. it should return true if any value is duplicate? one way is to use the above with not sign at the start but should I try anyMatch function. Just curiousLotze
Using Set and anyMatch would mean Predicate should return false for every successful adding to set and true otherwise. So we can't use nice new HashSet<>()::add if we want to add negation there (unless we will explicitly cast it to Predicate<T> to be able to use negate() method like list.stream().anyMatch(((Predicate<T>) new HashSet<>()::add).negate()). Without it we would need to create Set before stream and use lambda like list.stream().anyMatch(elem -> !set.add(elem)).Abseil
If the list is sorted, you only have to look at two items next to each other, not the whole result.Casease
@Sisal Can you help me understand how the one-liner works? I can see how it would work if we had a final HashSet the predicate refers, but how come the elements are being added to the same HashSet with the current way?Edana
Instead of list.stream().allMatch(t -> set.add(t)); you can also do list.stream().allMatch(set::add))Edana
@KorayTugay a method reference of the form expression::name will evaluate expression and capture its result when the instance of the functional interface (here, a Predicate) is created and then reused throughout the entire lifetime of that object. See What is the equivalent lambda expression for System.out::println or this Q&ASisal
Btw if List is empty it will return "true" which is not correct....Obligate
@Obligate Hello. Regarding "if List is empty it will return "true" which is not correct...." could you clarify what makes you think that such result is not correct? Anyway this may interest you: Why does Stream.allMatch() return true for an empty stream?Abseil
Ok didn't know that, thanks for sharing. I assumed that allMatch in empty List should return false, so it's important to highlight that it isn't the case.Obligate
The allMatch solution isn't guaranteed to work (though it might for current stream implementations), since the API makes no guarantees that sequential streams will perform all operations in the same thread (though current implementations might), and HashSet.add is not thread safe.Seasoning
@M.Justin True, that is why before that part in answer there is clause "(assuming non-parallel streams and thread-safe environment)"Abseil
M
21

Out of the three following one-liners

  1. return list.size() == new HashSet<>(list).size();
  2. return list.size() == list.stream().distinct().count();
  3. return list.stream().allMatch(ConcurrentHashMap.newKeySet()::add);

the last one (#3) has the following benefits:

  • Performance: it short-circuits, i.e. it stops on the first duplicate (while #1 and #2 always iterate till the end) – as Pshemo commented.
  • Versatility: it allows to handle not only collections (e.g. lists), but also streams (without explicitly collecting them).

Additional notes:
 • ConcurrentHashMap.newKeySet() is a way to create a concurrent hash set.
 • Previous version of this answer suggested the following code: return list.stream().sequential().allMatch(new HashSet<>()::add); — but, as M. Justin commented, calling BaseStream#sequential() doesn't guarantee that the operations will be executed from the same thread and doesn't eliminate the need for synchronization.
 • Performance benefit from short-circuiting won't usually matter, because we typically use such code in assertions and other correctness checks, where items are expected to be unique; in contexts where duplicates are really a possibility, we typically not only check for their presence but also to find and handle them.
 • M. Justin's provides a clean (not using side-effects) short-circuiting solution for Java 22, but that's not a one-liner.

Mines answered 5/12, 2016 at 15:0 Comment(3)
#3 operates via side effects, which is potentially problematic with streams. Additionally, the API makes no guarantees that sequential streams will perform all operations in the same thread (though current implementations might), and HashSet.add is not thread safe.Seasoning
@M.Justin, thanks. I assume you're right (though I'm not expert in that; and, personally for me, appearing of an “unusual” implementation in future looks as probable as narrowing of the specification in future; but, of course, it's anyway better for the code to be theoretically correct, not just practically working). I've tried to correct the answer.Mines
That's a bunch of nonsense. If it were true, you would have to synchronize every forEach call with side-effects - which, by design, is most of them. You had better be able to safely assume List.stream() is run on the initiating thread, or the entire Java ecosystem would crumble. There's a lot of code out there that makes this assumption.Halfbeak
D
10

You can use the counting collector.

Stream.of(1, 3, 4, 6, 7, 5, 6)
            .collect(Collectors.groupingBy(
                    Function.identity(), Collectors.counting()))
            .entrySet().stream().anyMatch(e -> e.getValue() > 1)
Dosh answered 17/5, 2017 at 18:41 Comment(0)
O
4

Given array arr,

arr.length != Arrays.stream(arr).distinct().count()

will help check for duplicates

Oringa answered 17/11, 2019 at 19:16 Comment(1)
The question was not about arrays.Nephrolith
T
2

Started this class as a StreamTool, but I think there must be an even better way with reduce or similar:

public class StreamTool {

    /**
     * Whether stream records are unique in that stream.
     * @param <T> Type of records
     * @param records
     * @return true if there are no duplicates, false otherwise
     */
    public static <T> boolean isUnique(Stream<T> records) {
        return records.allMatch(new HashSet<>()::add);
    }
}
Toodleoo answered 23/12, 2015 at 3:43 Comment(0)
S
1

Here is a short-circuiting solution that uses a custom gatherer, using the JEP 461: Stream Gatherers Java 22 preview language feature:

List<Integer> list = List.of(1, 2, 4, 1, 3, 4, 4, 1);

// A parallel stream must be used until JDK-8328316 is fixed
boolean hasDuplicate =
        list.stream().parallel().gather(hasDuplicate()).findAny().orElseThrow();
static <T> Gatherer<T, ?, Boolean> hasDuplicate() {
    class State {
        boolean hasDuplicate = false;
        final Set<T> elements = new HashSet<>();
    }

    Gatherer.Integrator<State, T, Boolean> integrator =
            (state, element, downstream) -> {
                if (state.elements.add(element)) {
                    return true;
                } else {
                    state.hasDuplicate = true;
                    return false;
                }
            };
    return Gatherer.ofSequential(
            State::new,
            integrator,
            (state, downstream) -> downstream.push(state.hasDuplicate)
    );
}

This custom gatherer produces a stream with a single Boolean element indicating whether a duplicate was found. It keeps track of each element encountered in a Set, and whether a duplicate has been found as a boolean. It short circuits immediately with true when a duplicate element is encountered, or returns false if the end of the stream is reached without encountering any duplicates.

Note that this doesn't work on sequential streams in Java 22 due to JDK-8328316, but it does work in Java 23 starting with b16.

This gatherer can be made parallel (at the cost of some increased complexity) by adding a combiner which can merge two states together, e.g.:

BinaryOperator<State> combiner = (s1, s2) -> {
    if (s1.hasDuplicate || s2.hasDuplicate) {
        s1.elements.clear(); // No further need to keep these in memory
        s1.hasDuplicate = true;
    } else {
        int totalSize = s1.elements.size() + s2.elements.size();
        s1.elements.addAll(s2.elements);
        s1.hasDuplicate = s1.elements.size() != totalSize;
    }
    return s1;
};
return Gatherer.of(
        State::new,
        integrator,
        combiner,
        (state, downstream) -> downstream.push(state.hasDuplicate)
);

Gatherer:

An intermediate operation that transforms a stream of input elements into a stream of output elements, optionally applying a final action when the end of the upstream is reached. […]

[…]

There are many examples of gathering operations, including but not limited to: grouping elements into batches (windowing functions); de-duplicating consecutively similar elements; incremental accumulation functions (prefix scan); incremental reordering functions, etc. The class Gatherers provides implementations of common gathering operations.

API Note:

A Gatherer is specified by four functions that work together to process input elements, optionally using intermediate state, and optionally perform a final action at the end of input. They are:

Stream.gather(Gatherer<? super T,?,R> gatherer):

Returns a stream consisting of the results of applying the given gatherer to the elements of this stream.

Gatherer.ofSequential(initializer, integrator, finisher)

Returns a new, sequential, Gatherer described by the given initializer, integrator, combiner, and finisher.

Gatherer.of(initializer, integrator, combiner, finisher)

Returns a new, sequential, Gatherer described by the given initializer, integrator, and finisher.

Seasoning answered 15/3, 2024 at 18:50 Comment(4)
The finisher is invoked even if the integrator short-circuits, so you'll always get false emitted. I'd probably define a duplicates-Gatherer like so: ``` static <T> Gatherer<T, ?, T> duplicates() { return Gatherer.ofSequential( HashSet::new, (state, element, downstream) -> state.add(element) || downstream.push(element) ); } ``` And then do: ``` var hasDuplicates = stream.gather(duplicates).findFirst().isPresent(); ```Pallbearer
@ViktorKlang You're right that it needs to be fixed, but it does currently work as I've coded it in Java 22 due to a bug in the JDK gatherer code: JDK-8328316. I'll get a fixed version up shortly.Seasoning
@ViktorKlang I've updated my solution to only emit a single element, though it does not (yet) work for sequential streams due to JDK-8328316.Seasoning
An alternative solution that apparently does work in Java 22 is to use a Collector (but not Stream.toList): list.stream().gather(hasDuplicate()).collect(Collectors.toList()).getFirst().Seasoning
C
-1

Use set.add() it is faster.

Set<T> items = new HashSet<>();
list.stream().filter(n -> !items.add(n)) 
            .collect(Collectors.toSet());
Capua answered 18/5, 2021 at 10:17 Comment(1)
What is the point of creating two Set objects (one via new HashSet and other via Collections.toSet()? Also above code could be reduce to new HashSet<>(list), but question is not about removing duplicates, it is about checking if duplicate elements exist.Abseil

© 2022 - 2025 — McMap. All rights reserved.