Java: how to find top 10 most common String + frequency in ArrayList?
Asked Answered
J

4

5

I am trying to find the top 10 most common Strings in an ArrayList + their count (frequency of occurrence).

How may I do this with the best time complexity?

The below code finds the most common word + frequency in the form (String=int)

E.g. the=2

  public static Entry<String, Integer> get10MostCommon(WordStream words) {

    ArrayList<String> list = new ArrayList<String>();
    Map<String, Integer> stringsCount = new HashMap<>();
    Map.Entry<String, Integer> mostRepeated = null;

    for (String i : words) {
      list.add(i);
    }

    for (String s : list) {
      Integer c = stringsCount.get(s);
      if (c == null)
        c = new Integer(0);
      c++;
      stringsCount.put(s, c);
    }

    for (Map.Entry<String, Integer> e : stringsCount.entrySet()) {
      if (mostRepeated == null || mostRepeated.getValue() < e.getValue())
        mostRepeated = e;
    }
    return mostRepeated;
  }
Jetpropelled answered 14/3, 2016 at 16:26 Comment(0)
S
14

You could do it in two steps, using Java 8 streams:

Map<String, Long> map = list.stream()
        .collect(Collectors.groupingBy(w -> w, Collectors.counting()));

List<Map.Entry<String, Long>> result = map.entrySet().stream()
        .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
        .limit(10)
        .collect(Collectors.toList());

The first stream maps words to their frequency by using Collectors.groupingBy() along with Collectors.counting().

This returns a map, whose entries are streamed and sorted by map entry value, in reverse order. Then, the stream is limited to keep only 10 elements, which are finally collected to a list.

Stinkhorn answered 14/3, 2016 at 17:26 Comment(7)
upvote for Collectors.groupingBy(), I totally forgot about this nice function.Osanna
That's the way to do it. Succinct and elegant.Imposture
@Jetpropelled Actually, this code is simple and succinct, but it would be better if the first map would be ordered by word frequency from the beggining (this is, when collecting). Then, we'd only need to keep its first 10 entries.Stinkhorn
@Imposture see above commentStinkhorn
@Mr.Yetti see above commentStinkhorn
@FedericoPeraltaSchaffner - Sorry, I still like the original solution. I don't think the first n entries is going to be hard and fast.Imposture
@Imposture Now that I think it, my solution is O(nlogn), and if I collect to an ordered list of entries (ordered by value), each insertion within the collect phase would be O(logn), and this would be O(nlogn) as well. So I'd rather keep things simple. Thanks for your input!Stinkhorn
O
3

You will have to iterate the words from start to end at least once, so you won't end better than O(n), where n is the words size. Then there's extraction of m top entries (10 in your case). Suppose you have k unique words in your n words in total, to find m top entries, you need to run max search m times on k entries, which results in m * k operations, giving you O(m * n) in worst case (when all words are unique). In total, this gives you O(n * (m + 1)) operations, or O(11 * n) in your case (10 times max search plus the initial grouping run).

Here's my try (JDK8+, not tested):

public static Collection<Map.Entry<String, Integer>> topOccurences(List<String> words, int topThreshold) {
    Map<String, Integer> occurences = new HashMap<>();
    words.stream().forEach((word) -> {
        int count = 1;
        if (occurences.containsKey(word)) {
            count = occurences.get(word) + 1;
        }
        occurences.put(word, count);
    });

    List<Map.Entry<String, Integer>> entries = new LinkedList<>(occurences.entrySet());
    List<Map.Entry<String, Integer>> tops = new LinkedList<>();
    Comparator<Map.Entry<String, Integer>> valueComp = Comparator.comparing((Map.Entry<String, Integer> t) -> t.getValue());
    int topcount = 0;
    while (topcount < topThreshold && !entries.isEmpty()) {
        Map.Entry<String, Integer> max = Collections.max(entries, valueComp);
        tops.add(max);
        entries.remove(max);
        topcount++;
    }
    return tops;
}
Osanna answered 14/3, 2016 at 17:3 Comment(0)
I
1

I would decompose this into two methods.

The first would do nothing but create the Map of word frequencies.

The second would return the n highest frequency words.

What should your code do if you ask for n most frequent words but the Map has fewer than that number as keys?

It's your chance to try JDK 8 lambdas and filter the frequency Map efficiently.

import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;

/**
 * Calculate word frequencies from a List of words
 * User: mduffy
 * Date: 3/14/2016
 * Time: 1:07 PM
 * @link https://mcmap.net/q/1869510/-java-how-to-find-top-10-most-common-string-frequency-in-arraylist/35993252#35993252
 */
public class WordFrequencyDemo {

    public static void main(String[] args) {
        List<String> words = Arrays.asList(args);
        Map<String, Integer> wordFrequencies = WordFrequencyDemo.getWordFrequencies(words);
        System.out.println(wordFrequencies);
    }

    public static Map<String, Integer> getWordFrequencies(List<String> words) {
        Map<String, Integer> wordFrequencies = new LinkedHashMap<String, Integer>();
        if (words != null) {
            for (String word : words) {
                if (word != null) {
                    word = word.trim();
                    if (!wordFrequencies.containsKey(word)) {
                        wordFrequencies.put(word, 0);
                    }
                    int count = wordFrequencies.get(word);
                    wordFrequencies.put(word, ++count);
                }
            }
        }
        return wordFrequencies;
    }
}
Imposture answered 14/3, 2016 at 16:42 Comment(0)
T
1

You would always use a hash to count the words first, which will of course use O(n) time and O(n) space. This is the first step.

Then it comes to how to select the top 10. You could use a sorting which takes at least O(nlogn) time. But there is a better way, which is to use a heap. Let's say k = 10 in your case. You need to add pair object of word and its frequency into a min heap of size k where we use the frequency as the key for the min heap. If the heap is full then remove the minimum element (top) from the heap and add the new word-frequency pair only if the frequency of this word has frequency greater than the top word in the heap. Once we scanned all the words in the map and the heap is properly updated then the elements contained in the min heap are the top k most frequents. Below is the example code. Just modify the code a little bit to take a ArrayList rather than an array will do your work.

class Pair {
    String key;
    int value;

    Pair(String key, int value) {
        this.key = key;
        this.value = value;
    }
}

public class Solution {
    /**
     * @param words an array of string
     * @param k an integer
     * @return an array of string
     */

    private Comparator<Pair> pairComparator = new Comparator<Pair>() {
        public int compare(Pair left, Pair right) {
            if (left.value != right.value) {
                return left.value - right.value;
            }
            return right.key.compareTo(left.key);
        }
    };

    public String[] topKFrequentWords(String[] words, int k) {
        if (k == 0) {
            return new String[0];
        }

        HashMap<String, Integer> counter = new HashMap<>();
        for (String word : words) {
            if (counter.containsKey(word)) {
                counter.put(word, counter.get(word) + 1);
            } else {
                counter.put(word, 1);
            }
        }

        PriorityQueue<Pair> Q = new PriorityQueue<Pair>(k, pairComparator);
        for (String word : counter.keySet()) {
            Pair peak = Q.peek();
            Pair newPair = new Pair(word, counter.get(word));
            if (Q.size() < k) {
                Q.add(newPair);
            } else if (pairComparator.compare(newPair, peak) > 0) {
                Q.poll();
                Q.add(new Pair(word, counter.get(word)));
            }
        }

        String[] result = new String[k];
        int index = 0;
        while (!Q.isEmpty()) {
            result[index++] = Q.poll().key;
        }

        // reverse
        for (int i = 0; i < index / 2; i++) {
            String temp = result[i];
            result[i] = result[index - i - 1];
            result[index - i - 1] = temp;
        }

        return result;
    }
}
Thwart answered 14/3, 2016 at 16:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.