Which data structure would you use: TreeMap or HashMap? (Java) [duplicate]
Asked Answered
O

14

56

Description | A Java program to read a text file and print each of the unique words in alphabetical order together with the number of times the word occurs in the text.

The program should declare a variable of type Map<String, Integer> to store the words and corresponding frequency of occurrence. Which concrete type, though? TreeMap<String, Number> or HashMap<String, Number> ?

The input should be converted to lower case.

A word does not contain any of these characters: \t\t\n]f.,!?:;\"()'

Example output |

 Word            Frequency
  a                 1
  and               5
  appearances       1
  as                1
         .
         .
         .

Remark | I know, I've seen elegant solutions to this in Perl with roughly two lines of code. However, I want to see it in Java.

Edit: Oh yeah, it be helpful to show an implementation using one of these structures (in Java).

Odyl answered 19/11, 2008 at 15:55 Comment(4)
Wow, is this blatantly fishing for the answer to a homework question?Armored
Actually, the homework question was to actually implement the two versions--it asked nothing about which data structure is better. However, in effort to maintain the lost art of semi-sincere learning, I am beating around the bush...just trying to glean some insight here!Odyl
How the heck did I get six points for that comment above. I know...question for meta..Odyl
I have similar problem, everything the same, I only want to sort by values, not keys... what would be best approach?Apc
K
63

TreeMap seems a no-brainer to me - simply because of the "in alphabetical order" requirement. HashMap has no ordering when you iterate through it; TreeMap iterates in the natural key order.

EDIT: I think Konrad's comment may have been suggesting "use HashMap, then sort." This is good because although we'll have N iterations initially, we'll have K <= N keys by the end due to duplicates. We might as well save the expensive bit (sorting) until the end when we've got fewer keys than take the small-but-non-constant hit of keeping it sorted as we go.

Having said that, I'm sticking to my answer for the moment: because it's the simplest way of achieving the goal. We don't really know that the OP is particularly worried about performance, but the question implies that he's concerned about the elegance and brevity. Using a TreeMap makes this incredibly brief, which appeals to me. I suspect that if performance is really an issue, there may be a better way of attacking it than either TreeMap or HashMap :)

Kanpur answered 19/11, 2008 at 15:56 Comment(19)
Jon, don't overestimate sorting. In my experience, sorting even gigantic data sets can be neglected. O(1) insertion beats O(logn) hands down, given enough items. But as always, this is pure speculation without profiling.Unaccompanied
If you can achieve O(1) insertion, then you've got O(n) total insertion time. If you're also suggesting there's no more work involved before you iterate through, you've somehow achieved O(n) sorting of n (relatively arbitrary) items. How does that work?Kanpur
Konrad: I think I now see your point. Editing...Kanpur
Jon, I recognize this is ancient history ;P But here's the explanation. Hashtables actually do implement sorting in expected, amortized O(n) time. The proven lower bound of n log n on sort-related operations applies to the comparison model of computation. The hashtable defeats this by taking advantage of the sequential nature of random-access memory in a conventional computer. However! In practise, comparison-based sorting is almost always faster; anecdotally, a good hashtable implementation is a constant factor of 100 more expensive than a mediocre quicksort implementation.Loudhailer
@Mark: I'm not quite sure what you're explaining or how the sequential nature of memory can avoid N log N complexity...Kanpur
Here's an edge-on answer to your second question: en.wikipedia.org/wiki/Radix_sortLoudhailer
@Mark: But surely we can't do a radix sort with strings, as we don't have integer keys. We have hash codes, but comparing those doesn't help us compare keys.Kanpur
Er, let me add to that: Briefly, if I can compute a memory offset in constant time from each value to be sorted in such a way as to guarantee that if X < Y, then addr(X) < addr(Y), then I can sort a list by scanning it, writing each value into its value-derived memory address, then doing a linear scan of the output memory region. Total complexity is O(n+m), where n is the element count and m is the size of the output memory region. A hashtable is essentially a more involved version of the above that makes m smaller.Loudhailer
@Mark: Nope, still not following it. How are you going to compute that memory offset? hashCode() certainly doesn't give you anything like that. Given that for any two strings I can construct another string between the two of them, I can't see how you can compute a memory offset...Kanpur
Radix sort on ASCII strings uses a radix of 256. If the strings are language-bound you can do it using the language character domain, which is usually better. Radix sort on unicode strings would have a radix of 64k, which is memory-prohibitive but still theoretically O(n), which is all I'm actually saying.Loudhailer
@JonSkeet let us continue this discussion in chatLoudhailer
@Mark: But the strings are still an arbitrary length, which stumps things I think. From the Wikipedia article you referenced: "Radix sort's efficiency is O(k·n) for n keys which have k or fewer digits." That's pretty significant unless you limit k, which we haven't here. I don't think you can just ignore the value of k in this case.Kanpur
Let me recap. All I'm saying is that you sort in O(n) using the nature of the random access model--not that it can be done in all cases, nor that it is a good idea. To touch on the last point, k is in practise usually quite limited in problems like these (the average word length in the "average" English document is about 5 characters; so k would be about 5 on average, making this O(n) in the expected case). Also, to touch on the hashtable argument, whether your memory addresses are sort-stable depends on your choice of hashing function. hashCode() isn't suitable, but others are.Loudhailer
@JonSkeet How do we sort HashMap? You mean, something like this? new TreeSet<MyType>(myMap.keySet())Groscr
@Jenix: I'm not sure what you're asking, to be honest. You don't sort a HashMap itself.Kanpur
That's why I'm asking.. You said 'EDIT: I think Konrad's comment may have been suggesting "use HashMap, then sort."Groscr
@Jenix: Right, But "use a hash map, then sort" isn't the same as "sorting a hash map". But yes, that would be using a HashMap then creating a TreeMap from it - but as I say, it's not clear that it's worth it.Kanpur
Yeah that's what I'm thinking but not just one, but a few more people said so on StackOverFlow, so I had to ask you. But anyway, if TreeMap is what everyone here is saying about 'use a hash map and then sort it', then I think, as I said above, creating a new TreeSet (not TreeMap) and then iterrating through the original HashMap using keys from that TreeSet is better. Because we can use less memory and also still can access faster than TreeMap.Groscr
@Jenix: I'd be surprised if there was much difference to be honest. I wouldn't want to guess without measuring.Kanpur
A
18

TreeMap beats HashMap because TreeMap is already sorted for you.

However, you might want to consider using a more appropriate data structure, a bag. See Commons Collections - and the TreeBag class:

This has a nice optimised internal structure and API:

bag.add("big")
bag.add("small")
bag.add("big")
int count = bag.getCount("big")

EDIT: The question of HashMap vs TreeMap performance was answered by Jon - HashMap and sort may be quicker (try it!), but TreeBag is easier. The same is true for bags. There is a HashBag as well as a TreeBag. Based on the implementation (uses a mutable integer) a bag should outperform the equivalent plain map of Integer. The only way to know for sure is to test, as with any performance question.

Astra answered 19/11, 2008 at 16:6 Comment(2)
Yes, but would it be faster to sort once at the end instead of incurring the cost of keeping things sorted the whole time?Hydraulics
Herms, it would depend on the efficiency of the datastructures involved. In-place sorting an array using random-pivot quicksort is a FAAAST n log n, but Java's TreeMap is an also pretty fast n log n. Java's HashMap works at a pretty slow n. Since we need random access to the elements we are choosing between a fast O(n log t) once, and a slow O(n) once plus a really fast O(t log t); where t < n. The latter would eventually surpass the former as n got large but I think n would have to get very large first. For Google, hash+sort would be better; for school, probably treemap.Loudhailer
A
11

I see quite a few people saying "TreeMap look-up takes O(n log n)"!! How come?

I don't know how it has been implemented but in my head it takes O(log n).

This is because look-up in a tree can be done in O(log n). You don't sort the entire tree every time you insert an item in it. That's the whole idea of using a tree!

Hence, going back to the original question, the figures for comparison turn out to be:

HashMap approach: O(n + k log k) average case, worst case could be much larger

TreeMap approach: O(k + n log k) worst case

where n = number of words in the text , k = number of distinct words in the text.

Aspasia answered 14/4, 2011 at 14:21 Comment(0)
N
3

Hash map should be much faster. You should not choose a container based on how you want the items to be arranged eventually; Just sort the list of (word, frequency)-pairs at the end. There will usually be less such pairs to be sorted than words in the files, so asymptotic (and real) performance with a hash map will be better.

Narine answered 19/11, 2008 at 17:35 Comment(2)
Not necessarily. In the worst case, hash tables have O(n) access time. Whereas self-balancing binary search trees are always O(log n) even in worst case.Kipton
A hashtable really only experiences worst-case performance with poor loading factors and/or a poor hashing function. Good enough hashes and good enough factors are available by default in Java, so this isn't a realistic worry.Loudhailer
E
3
import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.ObjectInputStream.GetField;
import java.util.Iterator;
import java.util.Map;
import java.util.StringTokenizer;
import java.util.TreeMap;

public class TreeMapExample {

    public static void main (String args[]){
        Map<String,Integer> tm = new TreeMap<String,Integer>();
        try {

            FileInputStream fis = new FileInputStream("Test.txt");
            DataInputStream in = new DataInputStream(fis);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String line;
            int countValue = 1;
            while((line = br.readLine())!= null ){
                line = line.replaceAll("[-+.^:;,()\"\\[\\]]","");
                StringTokenizer st = new StringTokenizer(line, " ");    
                while(st.hasMoreTokens()){
                    String nextElement = (String) st.nextElement();

                    if(tm.size()>0 && tm.containsKey(nextElement)){
                        int val = 0;
                        if(tm.get(nextElement)!= null){
                        val = (Integer) tm.get(nextElement);
                        val = val+1;
                        }
                        tm.put(nextElement, val);
                    }else{
                    tm.put(nextElement, 1);
                    }

                }
            }
            for(Map.Entry<String,Integer> entry : tm.entrySet()) {
            System.out.println(entry.getKey() + " : " + entry.getValue());
            }

        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

}
Extensometer answered 3/5, 2012 at 13:2 Comment(0)
R
2

You can't assign a TreeMap<String,Number> to a variable with the type Map<String,Integer>. Double, Long, etc. can be "put" into a TreeMap<String,Number>. When I "get" a value from a Map<String,Integer>, it must be an Integer.

Completely ignoring any i18n issues, memory constraints, and error handling, here goes:

class Counter {

  public static void main(String... argv)
    throws Exception
  {
    FileChannel fc = new FileInputStream(argv[0]).getChannel();
    ByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
    CharBuffer cb = Charset.defaultCharset().decode(bb);
    Pattern p = Pattern.compile("[^ \t\r\n\f.,!?:;\"()']+");
    Map<String, Integer> counts = new TreeMap<String, Integer>();
    Matcher m = p.matcher(cb);
    while (m.find()) {
      String word = m.group();
      Integer count = counts.get(word);
      count = (count == null) ? 1 : count + 1;
      counts.put(word, count);
    }
    fc.close();
    for (Map.Entry<String, Integer> e : counts.entrySet()) {
      System.out.printf("%s: %d%n", e.getKey(), e.getValue());
    }
  }

}
Recife answered 19/11, 2008 at 16:17 Comment(1)
Yes, you're right--I'll go back and edit that. What I mean to say is declare it as type Map<String,Integer> instead of Map<String,Number>, then TreeMap<String,Integer> is a subtype of Map<String,Integer>.Odyl
K
2

"When a key already exists it has the same performance as a HashMap." - That is just plain wrong. HashMap has O(1) insertion and TreeMap O(n log n). It'll take at least n log n checks to find out if it's in the table!

Kliber answered 14/8, 2010 at 3:46 Comment(0)
G
2

For this way, in my opinion, better use HashBag from Apache Commons Collections or HashMultiset from Guava or HashBag from Eclipse Collections (formaly GS Collections) or any following classes:

    Order    |  Guava           |   Apache  | Eclipse(GS) | JDK analog
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Not define   | HashMultiset     |   HashBag | HashBag     | HashMap<String, Integer>
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Sorted       | TreeMultiset     |   TreeBag | TreeBag     | TreeMap<String, Integer>
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Linked       |LinkedHashMultiset|     -     |     -       | LinkedHashMap<String, Integere>
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Concurrent & | ConcurrentHash-  |Synchroniz-|Synchroniz-  | Collections.synchronizedMap(
not define   | Multiset         |   edBag   | edBag       |       HashMap<String, Integer>)
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Concurrent   |         -        |Synchroniz-|Synchroniz-  | Collections.synchronizedSorted-
and sorted   |                  |edSortedBag| edSortedBag |       Map(TreeMap<>))
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Immutable and| ImmutableMultiset|Unmodifiab-|Unmodifiab-  | Collections.unmodifiableMap(
not define   |                  |   leBag   | leBag       | HashMap<String, Integer>)
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Immutable and| ImmutableSorted- |Unmodifiab-|Unmodifiab-  | Collections.unmodifiableSorted-
sorted       | Multiset         |leSortedBag| leSortedBag | Map(TreeMap<String, Integer>))
────────────────────────────────────────────────────────────────────────

Examples:

1. Using SynchronizedSortedBag from Apache:

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    Bag bag = SynchronizedSortedBag.synchronizedBag(new TreeBag(Arrays.asList(INPUT_TEXT.split(" "))));

    // Print count words
    System.out.println(bag); // print [1:All!,2:Hello,1:Hi,2:World!]- in natural (alphabet) order
    // Print all unique words
    System.out.println(bag.uniqueSet());    // print [All!, Hello, Hi, World!]- in natural (alphabet) order


    // Print count occurrences of words
    System.out.println("Hello = " + bag.getCount("Hello"));    // print 2
    System.out.println("World = " + bag.getCount("World!"));    // print 2
    System.out.println("All = " + bag.getCount("All!"));    // print 1
    System.out.println("Hi = " + bag.getCount("Hi"));    // print 1
    System.out.println("Empty = " + bag.getCount("Empty"));    // print 0

    // Print count all words
    System.out.println(bag.size());    //print 6

    // Print count unique words
    System.out.println(bag.uniqueSet().size());    //print 4

2. Using TreeBag from Eclipse(GC):

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    MutableSortedBag<String> bag =  TreeBag.newBag(Arrays.asList(INPUT_TEXT.split(" ")));

    // Print count words
    System.out.println(bag); // print [All!, Hello, Hello, Hi, World!, World!]- in natural order
    // Print all unique words
    System.out.println(bag.toSortedSet());    // print [All!, Hello, Hi, World!]- in natural order

    // Print count occurrences of words
    System.out.println("Hello = " + bag.occurrencesOf("Hello"));    // print 2
    System.out.println("World = " + bag.occurrencesOf("World!"));    // print 2
    System.out.println("All = " + bag.occurrencesOf("All!"));    // print 1
    System.out.println("Hi = " + bag.occurrencesOf("Hi"));    // print 1
    System.out.println("Empty = " + bag.occurrencesOf("Empty"));    // print 0

    // Print count all words
    System.out.println(bag.size());    //print 6

    // Print count unique words
    System.out.println(bag.toSet().size());    //print 4

3. Using LinkedHashMultiset from Guava:

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    Multiset<String> multiset = LinkedHashMultiset.create(Arrays.asList(INPUT_TEXT.split(" ")));

    // Print count words
    System.out.println(multiset); // print [Hello x 2, World! x 2, All!, Hi]- in predictable iteration order
    // Print all unique words
    System.out.println(multiset.elementSet());    // print [Hello, World!, All!, Hi] - in predictable iteration order

    // Print count occurrences of words
    System.out.println("Hello = " + multiset.count("Hello"));    // print 2
    System.out.println("World = " + multiset.count("World!"));    // print 2
    System.out.println("All = " + multiset.count("All!"));    // print 1
    System.out.println("Hi = " + multiset.count("Hi"));    // print 1
    System.out.println("Empty = " + multiset.count("Empty"));    // print 0

    // Print count all words
    System.out.println(multiset.size());    //print 6

    // Print count unique words
    System.out.println(multiset.elementSet().size());    //print 4

More examples you can find in my github projects

Geometrid answered 4/3, 2016 at 17:39 Comment(0)
S
1

I would definitely choose a TreeMap:

  • TreeMap automatically sorts new keys on insertion, no sorting afterwards is needed.
  • When a key already exists it has the same performance as a HashMap.

A TreeSet internally uses a TreeMap so why not use TreeMap directly.

Stultify answered 19/11, 2008 at 18:44 Comment(0)
G
0

Depending on what the speed requirements are, you could also use a Trie. But there's no point in implementing one of those if a TreeMap is quick enough.

Gumm answered 19/11, 2008 at 16:6 Comment(0)
D
0

consider the frequency of addition or deletion to the data structure. TreeMap would not be ideal if it is high. Apart from the search for existing entry nLn it also undergoes frequent rebalancing.

on the other hand Hash structures are bit flamboyant on memory (over allocates). If you can bite that bullet then go for hash structure and sort when required.

Directive answered 19/8, 2010 at 11:40 Comment(0)
E
0

Here is the java example for reading a text file, sorting based on key, then upon values; depending on the number of occurrence of a words in the file.

public class SortFileWords {

    public static void main(String[] args) {
        HashMap<String, Integer> map = new HashMap<String, Integer>();
        ValueCompare vc = new ValueCompare(map);
        TreeMap<String, Integer> sorted_map = new TreeMap<String, Integer>(map);
        List<String> list = new ArrayList<>();
        Scanner sc;
        try {
            sc = new Scanner(new File("c:\\ReadMe1.txt"));
            while (sc.hasNext()) {
                list.add(sc.next());
            }
            sc.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        for (String s : list) {
            if (map.containsKey(s)) {
                map.put(s, map.get(s) + 1);
            } else
                map.put(s, 1);
        }

        System.out.println("Unsorted map: " + map);
        sorted_map.putAll(map);
        System.out.println("Sorted map on keys: " + sorted_map);

        TreeMap<String, Integer> sorted_value_map = new TreeMap<>(vc);
        sorted_value_map.putAll(map);
        System.out.println("Sorted map on values: " + sorted_value_map);
    }
}

class ValueCompare implements Comparator<String> {

    Map<String, Integer> map;

    public ValueCompare(Map<String, Integer> map) {
        this.map = map;
    }

    @Override
    public int compare(String s1, String s2) {
        if (map.get(s1) >= map.get(s2))
            return -1;
        else
            return 1;
    }
}
Eldin answered 28/6, 2015 at 12:54 Comment(0)
V
-2

Why not use TreeSet?

Same ordering concept as a TreeMap, except it's a Set - which, by definition, is "A collection that contains no duplicate elements".

From your problem description, it sounds as if you need a Set, I don't see what keys and values you are mapping together.

This class implements the Set interface, backed by a TreeMap instance. This class guarantees that the sorted set will be in ascending element order, sorted according to the natural order of the elements (see Comparable), or by the comparator provided at set creation time, depending on which constructor is used.

Video answered 19/11, 2008 at 16:14 Comment(1)
Also a Java TreeSet is a Java TreeMap, so performance considerations are identical.Loudhailer
J
-3

Basically it depend on the requirement. Sometimes hash map is good sometimes treemap. but hash map is better to use only their is some constraint for overhead to sort it.

Junto answered 5/4, 2011 at 13:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.