Word frequency count Java 8
Asked Answered
T

13

79

How to count the frequency of words of List in Java 8?

List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");

The result must be:

{ciao=2, hello=1, bye=2}
Teodoor answered 18/3, 2015 at 12:43 Comment(0)
T
118

I want to share the solution I found because at first I expected to use map-and-reduce methods, but it was a bit different.

Map<String,Long> collect = wordsList.stream()
    .collect( Collectors.groupingBy( Function.identity(), Collectors.counting() ));

Or for Integer values:

Map<String,Integer> collect = wordsList.stream()
     .collect( Collectors.groupingBy( Function.identity(), Collectors.summingInt(e -> 1) ));

EDIT

I add how to sort the map by value:

LinkedHashMap<String, Long> countByWordSorted = collect.entrySet()
            .stream()
            .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
            .collect(Collectors.toMap(
                    Map.Entry::getKey,
                    Map.Entry::getValue,
                    (v1, v2) -> {
                        throw new IllegalStateException();
                    },
                    LinkedHashMap::new
            ));
Teodoor answered 18/3, 2015 at 12:43 Comment(0)
T
38

(NOTE: See the edits below)

As an alternative to Mounas answer, here is an approach that does the word count in parallel:

import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

public class ParallelWordCount
{
    public static void main(String[] args)
    {
        List<String> list = Arrays.asList(
            "hello", "bye", "ciao", "bye", "ciao");
        Map<String, Integer> counts = list.parallelStream().
            collect(Collectors.toConcurrentMap(
                w -> w, w -> 1, Integer::sum));
        System.out.println(counts);
    }
}

EDIT In response to the comment, I ran a small test with JMH, comparing the toConcurrentMap and the groupingByConcurrent approach, with different input list sizes and random words of different lengths. This test suggested that the toConcurrentMap approach was faster. When considering how different these approaches are "under the hood", it's hard to predict something like this.

As a further extension, based on further comments, I extended the test to cover all four combinations of toMap, groupingBy, serial and parallel.

The results are still that the toMap approach is faster, but unexpectedly (at least, for me) the "concurrent" versions in both cases are slower than the serial versions...:

             (method)  (count) (wordLength)  Mode  Cnt     Score    Error  Units
      toConcurrentMap     1000            2  avgt   50   146,636 ±  0,880  us/op
      toConcurrentMap     1000            5  avgt   50   272,762 ±  1,232  us/op
      toConcurrentMap     1000           10  avgt   50   271,121 ±  1,125  us/op
                toMap     1000            2  avgt   50    44,396 ±  0,541  us/op
                toMap     1000            5  avgt   50    46,938 ±  0,872  us/op
                toMap     1000           10  avgt   50    46,180 ±  0,557  us/op
           groupingBy     1000            2  avgt   50    46,797 ±  1,181  us/op
           groupingBy     1000            5  avgt   50    68,992 ±  1,537  us/op
           groupingBy     1000           10  avgt   50    68,636 ±  1,349  us/op
 groupingByConcurrent     1000            2  avgt   50   231,458 ±  0,658  us/op
 groupingByConcurrent     1000            5  avgt   50   438,975 ±  1,591  us/op
 groupingByConcurrent     1000           10  avgt   50   437,765 ±  1,139  us/op
      toConcurrentMap    10000            2  avgt   50   712,113 ±  6,340  us/op
      toConcurrentMap    10000            5  avgt   50  1809,356 ±  9,344  us/op
      toConcurrentMap    10000           10  avgt   50  1813,814 ± 16,190  us/op
                toMap    10000            2  avgt   50   341,004 ± 16,074  us/op
                toMap    10000            5  avgt   50   535,122 ± 24,674  us/op
                toMap    10000           10  avgt   50   511,186 ±  3,444  us/op
           groupingBy    10000            2  avgt   50   340,984 ±  6,235  us/op
           groupingBy    10000            5  avgt   50   708,553 ±  6,369  us/op
           groupingBy    10000           10  avgt   50   712,858 ± 10,248  us/op
 groupingByConcurrent    10000            2  avgt   50   901,842 ±  8,685  us/op
 groupingByConcurrent    10000            5  avgt   50  3762,478 ± 21,408  us/op
 groupingByConcurrent    10000           10  avgt   50  3795,530 ± 32,096  us/op

I'm not so experienced with JMH, maybe I did something wrong here - suggestions and corrections are welcome:

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;

@State(Scope.Thread)
public class ParallelWordCount
{

    @Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"})
    public String method;

    @Param({"2", "5", "10"})
    public int wordLength;

    @Param({"1000", "10000" })
    public int count;

    private List<String> list;

    @Setup
    public void initList()
    {
         list = createRandomStrings(count, wordLength, new Random(0));
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    public void testMethod(Blackhole bh)
    {

        if (method.equals("toMap"))
        {
            Map<String, Integer> counts =
                list.stream().collect(
                    Collectors.toMap(
                        w -> w, w -> 1, Integer::sum));
            bh.consume(counts);
        }
        else if (method.equals("toConcurrentMap"))
        {
            Map<String, Integer> counts =
                list.parallelStream().collect(
                    Collectors.toConcurrentMap(
                        w -> w, w -> 1, Integer::sum));
            bh.consume(counts);
        }
        else if (method.equals("groupingBy"))
        {
            Map<String, Long> counts =
                list.stream().collect(
                    Collectors.groupingBy(
                        Function.identity(), Collectors.<String>counting()));
            bh.consume(counts);
        }
        else if (method.equals("groupingByConcurrent"))
        {
            Map<String, Long> counts =
                list.parallelStream().collect(
                    Collectors.groupingByConcurrent(
                        Function.identity(), Collectors.<String> counting()));
            bh.consume(counts);
        }
    }

    private static String createRandomString(int length, Random random)
    {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < length; i++)
        {
            int c = random.nextInt(26);
            sb.append((char) (c + 'a'));
        }
        return sb.toString();
    }

    private static List<String> createRandomStrings(
        int count, int length, Random random)
    {
        List<String> list = new ArrayList<String>(count);
        for (int i = 0; i < count; i++)
        {
            list.add(createRandomString(length, random));
        }
        return list;
    }
}

The times are only similar for the serial case of a list with 10000 elements, and 2-letter words.

It could be worthwhile to check whether for even larger list sizes, the concurrent versions eventually outperform the serial ones, but currently don't have the time for another detailed benchmark run with all these configurations.

Tetrahedron answered 18/3, 2015 at 16:18 Comment(7)
I guess, Collectors.groupingByConcurrent(w->w, Collectors.counting()) will be more efficient.Immature
@Immature I added an EDIT about this.Tetrahedron
Your should also care about the number of equal words. The contention for a single map entry might have a significant impact. Counting thousand distinct words might work without any contention in Java 8’s ConcurrentMap though I wouldn’t call storing 1s “counting”. So, counting thousand occurrences of the same word might give a different picture…Immature
@Immature Sure, that's what I tried to cover (as far as reasonably possible) with a list size of 10000 containing random words of length 2: There are between 6 and 32 (average: ~15) occurrances of the same words. A quick (!) test with 100000 times "aa" looks like the timings are more similar then, but this test does not tell much about the performance in a realistic application case.Tetrahedron
It would be great if you could post your benchmark code.Immature
@Immature I didn't want to bloat the answer with something that does not really answer the actual question, but I'll clean up and post the code later today.Tetrahedron
Understandable, but when discussing alternatives, having a comparison will be useful. After all, even the OP’s own groupingBy approach can work in parallel and it would be great to expand the benchmark to see the differences between all variants (groupingBy vs toMap)×(ordinary vs. Concurrent)…Immature
C
11

Find most frequent item in collection, with generics:

private <V> V findMostFrequentItem(final Collection<V> items)
{
  return items.stream()
      .filter(Objects::nonNull)
      .collect(Collectors.groupingBy(Functions.identity(), Collectors.counting()))
      .entrySet()
      .stream()
      .max(Comparator.comparing(Entry::getValue))
      .map(Entry::getKey)
      .orElse(null);
}

Compute item frequencies:

private <V> Map<V, Long> findFrequencies(final Collection<V> items)
{
  return items.stream()
      .filter(Objects::nonNull)
      .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
Correll answered 27/2, 2018 at 14:33 Comment(0)
P
4

If you use Eclipse Collections, you can just convert the List to a Bag.

Bag<String> words = 
    Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag();

Assert.assertEquals(2, words.occurrencesOf("ciao"));
Assert.assertEquals(1, words.occurrencesOf("hello"));
Assert.assertEquals(2, words.occurrencesOf("bye"));

You can also create a Bag directly using the Bags factory class.

Bag<String> words = 
    Bags.mutable.with("hello", "bye", "ciao", "bye", "ciao");

This code will work with Java 5+.

Note: I am a committer for Eclipse Collections

Patti answered 19/3, 2015 at 2:59 Comment(0)
B
3

I'll present the solution here which I made (the one with grouping is much better :) ).

static private void test0(List<String> input) {
    Set<String> set = input.stream()
            .collect(Collectors.toSet());
    set.stream()
            .collect(Collectors.toMap(Function.identity(),
                    str -> Collections.frequency(input, str)));
}

Just my 0.02$

Bouchard answered 19/3, 2015 at 19:36 Comment(0)
M
3

Here's a way to create a frequency map using map functions.

List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList());
Map<String, Integer> frequencyMap = new HashMap<>();

words.forEach(word ->
        frequencyMap.merge(word, 1, (v, newV) -> v + newV)
);

System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}

Or

words.forEach(word ->
       frequencyMap.compute(word, (k, v) -> v != null ? v + 1 : 1)
);
Misdirect answered 24/2, 2018 at 16:45 Comment(2)
I like map.merge, however not sure why you did all that work to create List<String> words why not just do : List<String> list = Arrays.asList("hello", "bye", "ciao", "bye", "ciao")Despain
@Despain Thanks! About the list creation code, I stay away from Arrays.asList because it returns a java.util.Arrays$ArrayList instead of the normal ArrayList. I prefer not using it because eventually if the data is going to be serialized in your app, it creates unwanted serialized structure with Arrays.asList.Misdirect
M
2

you can use the Java 8 Streams

    Arrays.asList(s).stream()
          .collect(Collectors.groupingBy(Function.<String>identity(), 
          Collectors.<String>counting()));
Mccomb answered 13/1, 2020 at 2:30 Comment(0)
Y
1

I am surprised no one mentioned solution using Map.getOrDefault:

Map<String, Integer> res = new HashMap<>();
wordsList.forEach(w -> res.put(w, res.getOrDefault(w, 0) + 1));

default V getOrDefault(Object key, V defaultValue)

  • Returns the value to which the specified key is mapped, or defaultValue if this map contains no mapping for the key. So, it's pretty easy to use it for frequency calculation.
Yesman answered 3/9, 2023 at 19:47 Comment(0)
I
0

Another 2 cent of mine, given an array:

import static java.util.stream.Collectors.*;

String[] str = {"hello", "bye", "ciao", "bye", "ciao"};    
Map<String, Integer> collected 
= Arrays.stream(str)
        .collect(groupingBy(Function.identity(), 
                    collectingAndThen(counting(), Long::intValue)));
Insurrection answered 13/11, 2017 at 21:3 Comment(0)
A
0
public class Main {

    public static void main(String[] args) {


        String testString ="qqwweerrttyyaaaaaasdfasafsdfadsfadsewfywqtedywqtdfewyfdweytfdywfdyrewfdyewrefdyewdyfwhxvsahxvfwytfx"; 
        long java8Case2 = testString.codePoints().filter(ch -> ch =='a').count();
        System.out.println(java8Case2);

        ArrayList<Character> list = new ArrayList<Character>();
        for (char c : testString.toCharArray()) {
          list.add(c);
        }
        Map<Object, Integer> counts = list.parallelStream().
            collect(Collectors.toConcurrentMap(
                w -> w, w -> 1, Integer::sum));
        System.out.println(counts);
    }

}
Avid answered 30/5, 2019 at 10:59 Comment(0)
S
0

I think there is a more readable way:

var words = List.of("my", "more", "more", "more", "simple", "way");
var count = words.stream().map(x -> Map.entry(x, 1))
                    .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, Integer::sum));

Similar to map-reduce approach, first map each word w to a (w, 1). Then aggregate(reduce part) all pairs' count(Map.Entry::getValue) where their key(word w) is similar, (Map.Entry::getKey) and calculate the sum( Integer::sum).

The final terminal operation will return a HashMap<String, Integer>:

{more=3, simple=1, my=1, way=1}
Stretchy answered 3/10, 2022 at 6:57 Comment(0)
B
0
  public static void main(String[] args) {
    String str = "Hi Hello Hi";
    List<String> s = Arrays.asList(str.split(" "));
    Map<String, Long> hm = 
              s.stream().collect(Collectors.groupingBy(Function.identity(), 
              Collectors.counting()));

              hm.entrySet().forEach(entry -> {

             System.out.println(entry.getKey() + " " + entry.getValue());
              });

}
Bikales answered 2/2, 2023 at 4:33 Comment(0)
D
0

Little late in the game, but I got another way using overloaded stream.collect(supplier, accumulator, combiner). This method is cool because it is generic. One doesn't need to memorize all the collectors. One can use this approach, and then modify it to be more concise using other kind of Collectors. Here remember, the combiner is required at runtime only in parallel stream. It is not used in non parallel. In non parallel, it is required only for compilation.

        List <String> li = List.of("hello", "bye", "ciao", "bye", "ciao");
    var mp = li.stream()
    .parallel()
        .collect(
            HashMap<String, Integer>::new, // supplier
            (HashMap<String, Integer>m, String s2) -> m.put(s2, m.getOrDefault(s2, 0) + 1), // accumulator
            (HashMap<String, Integer>m1, HashMap<String, Integer>m2) -> { // combiner
                for(var k: m2.keySet()){
                    m1.put(k, m1.getOrDefault(k, 0) + m2.get(k));
                }
            });
    System.out.println(mp);
Discussant answered 23/3, 2024 at 17:44 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.