How to write Huffman coding to a file using Python?
Asked Answered
R

1

6

I created a Python script to compress text by using the Huffman algorithm. Say I have the following string:

string = 'The quick brown fox jumps over the lazy dog'

Running my algorithm returns the following 'bits':

result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'

By comparing the amount of bits of the result with the input string, the algorithm seems to work:

>>> print len(result), len(string) * 8
194 344

But now comes the question: how do I write this to a file, while still being able to decode it. You can only write to a file per byte, not per bit. By writing the 'codes' as bytes, there is no compression at all!

I am new at computer science, and the online resources just don't cut it for me. All help is much appreciated!

Edit: note that I had my codes something like this (in case of another input string 'xxxxxxxyzz'):

{'y': '00', 'x': '1', 'z': '10'}

The way I create the resulting string is by concatenating these codes in order of the input string:

result = '1111111001010'

How to get back to the original string from this result? Or am I getting this completely wrong? Thank you!

Rossetti answered 19/7, 2018 at 14:46 Comment(7)
This might be useful to you: #21221416Susuable
Of course a string like that can be converted, but is that really what you want to do? The temporary result would still be big that way.Durman
@harold storing the file as 25 bytes instead of 43 is a ~40% improvement, that seems worthwhile. What temporary result are you referring to?Thereupon
@Thereupon the string of "bits", before converting every group of 8 characters of it to a byte.Durman
@harold gotcha. so you're just suggesting OP do this in a way that doesn't create that string all at once?Thereupon
@Thereupon yes, piece by piece, while he's appending the individual codes - then the string being held in memory never gets really big. Actually it can be done with integer arithmetic too instead of strings, then there is no conversion at all any moreDurman
Hey guys, thanks for your answers. I edited my post to clarify my question a bit.Rossetti
M
9

First you need to convert your input string to bytes:

def _to_Bytes(data):
  b = bytearray()
  for i in range(0, len(data), 8):
    b.append(int(data[i:i+8], 2))
  return bytes(b)

Then, open a file to write in binary mode:

result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'
with open('test.bin', 'wb') as f:
  f.write(_to_Bytes(result))

Now, writing the original string to a file, a comparison of bytes can take place:

import os
with open('test_compare.txt', 'a') as f:
  f.write('The quick brown fox jumps over the lazy dog')

_o = os.path.getsize('test_compare.txt')
_c = os.path.getsize('test.bin')
print(f'Original file: {_o} bytes')
print(f'Compressed file: {_c} bytes')
print('Compressed file to about {}% of original'.format(round((((_o-_c)/_o)*100), 0)))

Output:

Original file: 43 bytes
Compressed file: 25 bytes
Compressed file to about 42.0% of original

To get back to the original, you can write a function that determines the possible ordering of characters:

d = {'y': '00', 'x': '1', 'z': '10'}
result = '1111111001010'
from typing import Generator
def reverse_encoding(content:str, _lookup) -> Generator[str, None, None]:
  while content:
    _options = [i for i in _lookup if content.startswith(i) and (any(content[len(i):].startswith(b) for b in _lookup) or not content[len(i):])]
    if not _options:
      raise Exception("Decoding error")
    yield _lookup[_options[0]]
    content = content[len(_options[0]):]

print(''.join(reverse_encoding(result, {b:a for a, b in d.items()})))

Output:

'xxxxxxxyzz'
Mita answered 19/7, 2018 at 14:52 Comment(3)
Thank you for your answer! I edited my post to clarify my question concerning the decoding of the file.Rossetti
Thank you, this is what I was looking for!Rossetti
@PimKlaassen Glad to helpMita

© 2022 - 2024 — McMap. All rights reserved.