Is it possible to achieve Huffman decoding in GPU?

Asked 10/6, 2010 at 10:5 Answered 10/10, 2012 at 21:7

We have a database encoded with Huffman coding. The aim here is to copy on the GPU it with its associated decoder; then on the GPU, decod the database and do stuff on this decoded database without copying back it on the CPU.

I am far to be a Huffman specialist, but the few I know shows that it seems to be an algorithm essentially based on control structures. With the basic algorithm, I am afraid that there will be a lot of serialized operations.

My 2 questions are:

do you know if there exists any efficient GPU version for Huffman coding
if not, do you think there exists a Huffman algorithm which be adapted on GPU (ie. with less control structures). Or maybe you know (and you could provide a reference) that efficient Huffman decoding can not be efficient on GPU.

I see other constraints, but they are not critical: - GPU could not be very efficient to handle tree: binary tree can be stored in a classical array - workload could be difficult to balance: we'll see after

Angell answered 10/6, 2010 at 10:5 Comment(3)

I doubt you'll see any real benefit by implementing this on a GPU - CUDA or otherwise. GPUs are really only good for a subset of problems where there is parallelism and homogeneous operation on multiple data points. – Jelena 10/6, 2010 at 11:9

Huffman as I know it is completely serial. You can't split up the code to be decoded at all because you don't know where a break is until you've processed all of the code before the break. – Afraid 10/6, 2010 at 14:36

An example implementation (linked) on iOS Metal shows that decoding multiple blocks at the same time is much faster than executing the logic on the CPU. One must create a per block lookup table, so there is a bit of overhead. See https://mcmap.net/q/1011315/-using-huffman-coding-to-compress-images-taken-by-the-iphone-camera – Chadburn 28/12, 2017 at 1:40

The problem with Huffman coding is that you can't fast-forward. ie: you have to decode bit by bit, linearly.

As such it's not ideal for parallelism.

If you can decide on the encoding, you could perfectly encode chunk by chunk so as to be able to decode each chunk independently.

Nightie answered 10/6, 2010 at 15:33 Comment(2)

Why do you think bit by bit is not ideal for parallelism? I think read several independant encoded value bit by bit is not a problem. The problem is to perform in parallel the decoding on these bits. – Subconscious 11/6, 2010 at 10:46

The problem for Huffman is that you don't know how many bits a symbol is encoded in. You read the first, check if it's a symbol, read the second, check if it's a symbol, read the third AH it's a symbol, okay so I store the symbol and rewind my state machine. Going on. That is not parallelizable. – Nightie 11/6, 2010 at 13:36

I am astonished by the apparent consensus that Huffman on a GPU is impossible.

I appeal to the aphorism: "If it happens, it must be possible". (variously attributed to Agatha Christie, Albert Einstein, etc.)

Since SuperXero does Huffman on a GPU, I suppose it must be possible.

CPU huffman compression faster after first execution? (SuperXero)

Google: GPU huffman decompression

Colorant answered 18/5, 2012 at 19:4 Comment(0)

Yes you can do huffman decoding in parallel and so you can get advantages in a GPU - provided memory is not an issue.

For the discussion below I'll talk about the huffman tree, and the huffman output - the output are the compressed symbols that need to be looked up in the huffman tree to be decoded.

The huffman algorithm requires that you have a huffman tree for decoding - that tree can be large. You can get around this by using a small huffman tree that fits on local memory in a GPU - but this will affect the compression efficiency of the algorithm. E.g. you could limit the tree to the best 2^n nodes for as much as your gpu processors allow. ( e.g. use a tree limited to say 1024 nodes.

If you don't limit the huffman tree such that you can fit one copy in local storage on each gpu then you won't really get the parallelism you expect because all the gpu processors will be blocked accessing memory all reading the same shared tree.

The huffman output the symbols are packed in a variable number of bits. There is no way if you start in the middle of the output to know if you are on a symbol boudary. But you can create your own boundaries. For example in the output you could just force the alignment of symbols every x words to be word aligned. Then you know that you can start decoding on any multiple of x word in the output, and send that block to a GPU processing node, along with the appropriate tree.

You don't have to use just one tree- but one tree per block may be overkill also. That is if you have one tree per block you'll severly cut into the compression efficiency if the blocks are small.

So you can try looking at the similarity of blocks and encode similar blocks with the same tree, and store a tree index per block. E.g. you may have 10000 blocks in the output but just 50 1024-node trees. Then you send one block, and one tree to each GPU processing node to decode in parallel.

The key to making it fast is that each GPU processing node works on local memory only.

Sideway answered 10/10, 2012 at 21:7 Comment(0)

Recommended topics

Hot tags