Yes you can do huffman decoding in parallel and so you can get advantages in a GPU - provided memory is not an issue.
For the discussion below I'll talk about the huffman tree, and the huffman output - the output are the compressed symbols that need to be looked up in the huffman tree to be decoded.
The huffman algorithm requires that you have a huffman tree for decoding - that tree can be large. You can get around this by using a small huffman tree that fits on local memory in a GPU - but this will affect the compression efficiency of the algorithm. E.g. you could limit the tree to the best 2^n nodes for as much as your gpu processors allow. ( e.g. use a tree limited to say 1024 nodes.
If you don't limit the huffman tree such that you can fit one copy in local storage on each gpu then you won't really get the parallelism you expect because all the gpu processors will be blocked accessing memory all reading the same shared tree.
The huffman output the symbols are packed in a variable number of bits. There is no way if you start in the middle of the output to know if you are on a symbol boudary. But you can create your own boundaries. For example in the output you could just force the alignment of symbols every x words to be word aligned. Then you know that you can start decoding on any multiple of x word in the output, and send that block to a GPU processing node, along with the appropriate tree.
You don't have to use just one tree- but one tree per block may be overkill also. That is if you have one tree per block you'll severly cut into the compression efficiency if the blocks are small.
So you can try looking at the similarity of blocks and encode similar blocks with the same tree, and store a tree index per block. E.g. you may have 10000 blocks in the output but just 50 1024-node trees. Then you send one block, and one tree to each GPU processing node to decode in parallel.
The key to making it fast is that each GPU processing node works on local memory only.