Algorithms for compression of set tries
Asked Answered
V

3

9

I have a collection of sets that I'd like to place in a trie.

Normal tries are made of strings of elements - that is, the order of the elements is important. Sets lack a defined order, so there's the possibility of greater compression.

For example, given the strings "abc", "bc", and "c", I'd create the trie:

(*,3) -> ('a',1) -> ('b',1) -> ('c',1)
      -> ('b',1) -> ('c',1)
      -> ('c',1)

But given the sets { 'a', 'b', 'c' }, { 'b', 'c' }, { 'c' }, I could create the above trie, or any of these eleven:

(*,3) -> ('a',1) -> ('b',1) -> ('c',1)
      -> ('c',2) -> ('a',1)

(*,3) -> ('a',1) -> ('c',1) -> ('b',1)
      -> ('b',1) -> ('c',1)
      -> ('c',1)

(*,3) -> ('a',1) -> ('c',1) -> ('b',1)
      -> ('c',2) -> ('a',1)

(*,3) -> ('b',2) -> ('a',1) -> ('c',1)
                 -> ('c',1)
      -> ('c',1)

(*,3) -> ('b',1) -> ('a',1) -> ('c',1)
      -> ('c',2) -> ('b',1)

(*,3) -> ('b',2) -> ('c',2) -> ('a',1)
      -> ('c',1)

(*,3) -> ('b',1) -> ('c',1) -> ('a',1)
      -> ('c',2) -> ('b',1)

(*,3) -> ('c',2) -> ('a',1) -> ('b',1)
      -> ('b',1) -> ('c',1)

(*,3) -> ('c',2) -> ('a',1) -> ('b',1)
                 -> ('b',1)

(*,3) -> ('c',2) -> ('b',1) -> ('a',1)
      -> ('b',1) -> ('c',1)

(*,3) -> ('c',3) -> ('b',2) -> ('a',1)

So there's obviously room for compression (7 nodes to 4).

I suspect defining a local order at each node dependent on the relative frequency of its children would do it, but I'm not certain, and it might be overly expensive.

So before I hit the whiteboard, and start cracking away at my own compression algorithm, is there an existing one? How expensive is it? Is it a bulk process, or can it be done per-insert/delete?

Vestry answered 22/2, 2012 at 23:30 Comment(5)
I think trie isn't a very good structure to represent sets. Wouldn't a collection of bit arrays be better? What operations are you expecting to do? Why do you worry about the memory so much?Kentkenta
@svick: Perhaps, but my sets are drawing from a large universe of elements, so bit arrays may not be very efficient. Iterate through (subset, frequency) pairs. Because I have lots of data.Vestry
What operations do you intend to do? A traditional trie can efficiently tell you whether or not a given string is contained in the set of strings it represents. If your trie re-orders its strings to minimize the structure size, how can you actually test whether a given set of characters is contained in the trie? It seems like you'd need to search for every permutation.Discord
@Weeble: You can store the local order at each node along with the children of each node. When looking up at a node, iterate through its children, and traverse to the first one that matches an element you contain.Vestry
Are you looking for just compression for later decompression , or do you also want to perform set operations on the compressed structure?Apostatize
B
1

I think you should sort a set according to item frequency and this get a good heuristics as you suspect. The same approach using in FP-growth (frequent patterns mining) for representing in compact way the items sets.

Beforetime answered 23/2, 2012 at 16:10 Comment(2)
Full circle! I'm actually looking at this because I think the global order used in FP-growth is insufficient.Vestry
Possible you can rebuild sub-tree, according to item frequency in this sub-tree it gives you better compression but in this case we need to perform more calculation.Beforetime
P
0

Basically you should construct a dependence graph. If element y occurs only if x occurs, draw an edge from x to y (in case of equality, just order lexicographically). The resulting graph is a DAG. Now, do a topological sorting of this graph to get the order of the elements with a twist. Whenever you can choose one of the two (or more elements) choose the one with higher number of occurrences.

Priggery answered 23/2, 2012 at 0:29 Comment(0)
A
0

My suspiscion is that the maximum compression would keep the most common elements at the top (as in your last example).

The compression algorithm would start with the whole collection of sets and the top node, and recursively create nodes for each subset containing the most common elements

Compress(collection, node):
    while NOT collection.isEmpty? 
      e = collection.find_most_common_element
      c2 = collection.find_all_containing(e)
      collection = collection - c2
      if e==NIL //empty sets only
         node[END_OF_SET]=node
      else
        c2.each{set.remove(e)}
        node[e]=new Node
        Compress(c2,node[e])
      end
    end

The resulting tree would have a special End-of-set marker to signify that a complete set ends at that node. For your example it would be

 *->(C,3)->(B,2)->(A,1)->EOS
                ->EOS
          ->EOS

Deleting a set is easy, just remove it's EOS marker (and any parent nodes that become empty). You could insert on the fly - at each node, descend to the matching element with the most children until there are no matches, then use the algorithm above - but keeping it maximally compressed would be tricky. When element B gained more children than element A, you'd have to move all sets containing A & B into the B node, which would involve a full search of all of A's children. But if you don't keep it compressed, then the inclusion searches are no longer linear with the set size.

Apostatize answered 23/2, 2012 at 23:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.