Is there a better way to calculate the frequency of all symbols in a file?

Asked 4/10, 2011 at 19:2 Answered 4/10, 2011 at 19:13

Okay, so, say I have a text file (not necessarily containing every possible symbol) and I'd like to calculate the frequency of each symbol and, after calculating the frequency, I then need to access each symbol and its frequency from most frequent to least frequent. The symbols are not necessarily ASCII characters, they could be arbitrary byte sequences, albeit all of the same length.

I was considering doing something like this (in pseudocode):

function add_to_heap (symbol)
    freq = heap.find(symbol).frequency
    if (freq.exists? == true)
        freq++
    else
        symbol.freq = 1
        heap.insert(symbol)

MaxBinaryHeap heap
while somefile != EOF
    symbol = read_byte(somefile)
    heap.add_to_heap(symbol)
heap.sort_by_frequency()

while heap.root != empty
    root = heap.extract_root()
    do_stuff(root)

I was wondering: is there a better, simpler way to calculate and store how many times each symbol occurs in a file?

Sparteine answered 4/10, 2011 at 19:2 Comment(2)

Seems you have two choices, hashmap giving you O(1) frequency retrieval but no ordered (most frequent to least frequent) result OR O(lg n) insert and search using search trees/heap but giving you a ordered (most frequent to least frequent) result. – Foolhardy 4/10, 2011 at 20:43

A binary heap isn't a particularly good data structure for this, as finding an arbitrary node in the heap is rather expensive. You'd do better with a binary tree or, as others have pointed out, a hash table of some kind. – Kajdan 4/10, 2011 at 20:47

You can always use a HashMap isntead of the Heap. Like this you'll be performing operations that are in O(1) for each symbol found instead of O(log n) wheres n is the number of items currently on the heap.

However, if te number of distinct symbols is bounded by a reasonable number (1 Byte is ideal, 2 Byte should be still fine), you can just use an array of that size and again have O(1) but with a significantly lower constant cost.

Dungeon answered 4/10, 2011 at 19:9 Comment(0)

If you're looking for a "best" solution based on running times, here's what I'd suggest:

When you're reading the file, you should have your symbols sorted (or hashed) by the value of the symbols themselves, not their frequencies. This'll let you find the current symbol in your list of already seen symbols quickly, rather than having to search through your entire list. You should also have that initial structure be able to perform fast inserts - I'd recommend a binary tree of a hash.

Once you've read all your symbols, then you should switch your ordering to be based on the frequency counts. I'd read everything into an array and then perform an in-place sort, but there are a bunch of equivalent ways to do this.

Hope this helps!

Childe answered 4/10, 2011 at 19:13 Comment(0)

Recommended topics

Hot tags