How to count the frequency of words of List in Java 8?
List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");
The result must be:
{ciao=2, hello=1, bye=2}
How to count the frequency of words of List in Java 8?
List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");
The result must be:
{ciao=2, hello=1, bye=2}
I want to share the solution I found because at first I expected to use map-and-reduce methods, but it was a bit different.
Map<String,Long> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.counting() ));
Or for Integer values:
Map<String,Integer> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.summingInt(e -> 1) ));
EDIT
I add how to sort the map by value:
LinkedHashMap<String, Long> countByWordSorted = collect.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.collect(Collectors.toMap(
Map.Entry::getKey,
Map.Entry::getValue,
(v1, v2) -> {
throw new IllegalStateException();
},
LinkedHashMap::new
));
(NOTE: See the edits below)
As an alternative to Mounas answer, here is an approach that does the word count in parallel:
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class ParallelWordCount
{
public static void main(String[] args)
{
List<String> list = Arrays.asList(
"hello", "bye", "ciao", "bye", "ciao");
Map<String, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
EDIT In response to the comment, I ran a small test with JMH, comparing the
toConcurrentMap
and thegroupingByConcurrent
approach, with different input list sizes and random words of different lengths. This test suggested that thetoConcurrentMap
approach was faster. When considering how different these approaches are "under the hood", it's hard to predict something like this.As a further extension, based on further comments, I extended the test to cover all four combinations of
toMap
,groupingBy
, serial and parallel.The results are still that the
toMap
approach is faster, but unexpectedly (at least, for me) the "concurrent" versions in both cases are slower than the serial versions...:
(method) (count) (wordLength) Mode Cnt Score Error Units
toConcurrentMap 1000 2 avgt 50 146,636 ± 0,880 us/op
toConcurrentMap 1000 5 avgt 50 272,762 ± 1,232 us/op
toConcurrentMap 1000 10 avgt 50 271,121 ± 1,125 us/op
toMap 1000 2 avgt 50 44,396 ± 0,541 us/op
toMap 1000 5 avgt 50 46,938 ± 0,872 us/op
toMap 1000 10 avgt 50 46,180 ± 0,557 us/op
groupingBy 1000 2 avgt 50 46,797 ± 1,181 us/op
groupingBy 1000 5 avgt 50 68,992 ± 1,537 us/op
groupingBy 1000 10 avgt 50 68,636 ± 1,349 us/op
groupingByConcurrent 1000 2 avgt 50 231,458 ± 0,658 us/op
groupingByConcurrent 1000 5 avgt 50 438,975 ± 1,591 us/op
groupingByConcurrent 1000 10 avgt 50 437,765 ± 1,139 us/op
toConcurrentMap 10000 2 avgt 50 712,113 ± 6,340 us/op
toConcurrentMap 10000 5 avgt 50 1809,356 ± 9,344 us/op
toConcurrentMap 10000 10 avgt 50 1813,814 ± 16,190 us/op
toMap 10000 2 avgt 50 341,004 ± 16,074 us/op
toMap 10000 5 avgt 50 535,122 ± 24,674 us/op
toMap 10000 10 avgt 50 511,186 ± 3,444 us/op
groupingBy 10000 2 avgt 50 340,984 ± 6,235 us/op
groupingBy 10000 5 avgt 50 708,553 ± 6,369 us/op
groupingBy 10000 10 avgt 50 712,858 ± 10,248 us/op
groupingByConcurrent 10000 2 avgt 50 901,842 ± 8,685 us/op
groupingByConcurrent 10000 5 avgt 50 3762,478 ± 21,408 us/op
groupingByConcurrent 10000 10 avgt 50 3795,530 ± 32,096 us/op
I'm not so experienced with JMH, maybe I did something wrong here - suggestions and corrections are welcome:
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
@State(Scope.Thread)
public class ParallelWordCount
{
@Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"})
public String method;
@Param({"2", "5", "10"})
public int wordLength;
@Param({"1000", "10000" })
public int count;
private List<String> list;
@Setup
public void initList()
{
list = createRandomStrings(count, wordLength, new Random(0));
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{
if (method.equals("toMap"))
{
Map<String, Integer> counts =
list.stream().collect(
Collectors.toMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("toConcurrentMap"))
{
Map<String, Integer> counts =
list.parallelStream().collect(
Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("groupingBy"))
{
Map<String, Long> counts =
list.stream().collect(
Collectors.groupingBy(
Function.identity(), Collectors.<String>counting()));
bh.consume(counts);
}
else if (method.equals("groupingByConcurrent"))
{
Map<String, Long> counts =
list.parallelStream().collect(
Collectors.groupingByConcurrent(
Function.identity(), Collectors.<String> counting()));
bh.consume(counts);
}
}
private static String createRandomString(int length, Random random)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
int c = random.nextInt(26);
sb.append((char) (c + 'a'));
}
return sb.toString();
}
private static List<String> createRandomStrings(
int count, int length, Random random)
{
List<String> list = new ArrayList<String>(count);
for (int i = 0; i < count; i++)
{
list.add(createRandomString(length, random));
}
return list;
}
}
The times are only similar for the serial case of a list with 10000 elements, and 2-letter words.
It could be worthwhile to check whether for even larger list sizes, the concurrent versions eventually outperform the serial ones, but currently don't have the time for another detailed benchmark run with all these configurations.
ConcurrentMap
though I wouldn’t call storing 1
s “counting”. So, counting thousand occurrences of the same word might give a different picture… –
Immature groupingBy
approach can work in parallel and it would be great to expand the benchmark to see the differences between all variants (groupingBy
vs toMap
)×(ordinary vs. Concurrent
)… –
Immature Find most frequent item in collection, with generics:
private <V> V findMostFrequentItem(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Functions.identity(), Collectors.counting()))
.entrySet()
.stream()
.max(Comparator.comparing(Entry::getValue))
.map(Entry::getKey)
.orElse(null);
}
Compute item frequencies:
private <V> Map<V, Long> findFrequencies(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
If you use Eclipse Collections, you can just convert the List
to a Bag
.
Bag<String> words =
Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag();
Assert.assertEquals(2, words.occurrencesOf("ciao"));
Assert.assertEquals(1, words.occurrencesOf("hello"));
Assert.assertEquals(2, words.occurrencesOf("bye"));
You can also create a Bag
directly using the Bags
factory class.
Bag<String> words =
Bags.mutable.with("hello", "bye", "ciao", "bye", "ciao");
This code will work with Java 5+.
Note: I am a committer for Eclipse Collections
I'll present the solution here which I made (the one with grouping is much better :) ).
static private void test0(List<String> input) {
Set<String> set = input.stream()
.collect(Collectors.toSet());
set.stream()
.collect(Collectors.toMap(Function.identity(),
str -> Collections.frequency(input, str)));
}
Just my 0.02$
Here's a way to create a frequency map using map functions.
List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList());
Map<String, Integer> frequencyMap = new HashMap<>();
words.forEach(word ->
frequencyMap.merge(word, 1, (v, newV) -> v + newV)
);
System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}
Or
words.forEach(word ->
frequencyMap.compute(word, (k, v) -> v != null ? v + 1 : 1)
);
you can use the Java 8 Streams
Arrays.asList(s).stream()
.collect(Collectors.groupingBy(Function.<String>identity(),
Collectors.<String>counting()));
I am surprised no one mentioned solution using Map.getOrDefault
:
Map<String, Integer> res = new HashMap<>();
wordsList.forEach(w -> res.put(w, res.getOrDefault(w, 0) + 1));
default V getOrDefault(Object key, V defaultValue)
Another 2 cent of mine, given an array:
import static java.util.stream.Collectors.*;
String[] str = {"hello", "bye", "ciao", "bye", "ciao"};
Map<String, Integer> collected
= Arrays.stream(str)
.collect(groupingBy(Function.identity(),
collectingAndThen(counting(), Long::intValue)));
public class Main {
public static void main(String[] args) {
String testString ="qqwweerrttyyaaaaaasdfasafsdfadsfadsewfywqtedywqtdfewyfdweytfdywfdyrewfdyewrefdyewdyfwhxvsahxvfwytfx";
long java8Case2 = testString.codePoints().filter(ch -> ch =='a').count();
System.out.println(java8Case2);
ArrayList<Character> list = new ArrayList<Character>();
for (char c : testString.toCharArray()) {
list.add(c);
}
Map<Object, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
I think there is a more readable way:
var words = List.of("my", "more", "more", "more", "simple", "way");
var count = words.stream().map(x -> Map.entry(x, 1))
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, Integer::sum));
Similar to map-reduce approach, first map each word w to a (w, 1). Then aggregate(reduce part) all pairs' count(Map.Entry::getValue
) where their key(word w) is similar, (Map.Entry::getKey
) and calculate the sum( Integer::sum
).
The final terminal operation will return a HashMap<String, Integer>
:
{more=3, simple=1, my=1, way=1}
public static void main(String[] args) {
String str = "Hi Hello Hi";
List<String> s = Arrays.asList(str.split(" "));
Map<String, Long> hm =
s.stream().collect(Collectors.groupingBy(Function.identity(),
Collectors.counting()));
hm.entrySet().forEach(entry -> {
System.out.println(entry.getKey() + " " + entry.getValue());
});
}
Little late in the game, but I got another way using overloaded stream.collect(supplier, accumulator, combiner). This method is cool because it is generic. One doesn't need to memorize all the collectors. One can use this approach, and then modify it to be more concise using other kind of Collectors. Here remember, the combiner is required at runtime only in parallel stream. It is not used in non parallel. In non parallel, it is required only for compilation.
List <String> li = List.of("hello", "bye", "ciao", "bye", "ciao");
var mp = li.stream()
.parallel()
.collect(
HashMap<String, Integer>::new, // supplier
(HashMap<String, Integer>m, String s2) -> m.put(s2, m.getOrDefault(s2, 0) + 1), // accumulator
(HashMap<String, Integer>m1, HashMap<String, Integer>m2) -> { // combiner
for(var k: m2.keySet()){
m1.put(k, m1.getOrDefault(k, 0) + m2.get(k));
}
});
System.out.println(mp);
© 2022 - 2025 — McMap. All rights reserved.
Collectors.groupingByConcurrent(w->w, Collectors.counting())
will be more efficient. – Immature