How can I print a bit instead of byte in a file?

Asked 28/6, 2012 at 21:24 Answered 28/6, 2012 at 21:41

I am using huffman algorithm to develop a file compressor and right now I am facing a problem which is:

By using the algorithm to the word: stackoverflow, i get the following result:

a,c,e,f,k,l,r,s,t,v,w = 1 time repeated
o = 2 times repeated

a,c,e,f,k,l,r,s,t,v,w = 7.69231%
and
o = 15.3846%

So I start inserting then into a Binary Tree, which will get me the results:

o=00
a=010
e=0110
c=0111
t=1000
s=1001
w=1010
v=1011
k=1100
f=1101
r=1110
l=1111

which means the path for the character in the tree, considering 0 to be left and 1 to right.

then the word "stackoverflow" will be: 100110000100111010011111000010110110111011011111001010

and well, I want to put that whole value into a binary file to be in bits, which will result in 47bits, which would happen to be 6bytes, but instead I can only make it 47bytes because the minimun to put into a file with fwrite or fprintf is 1byte, by using sizeof(something).

Than my question is: how can I print in my file only a single bit?

Marozas answered 28/6, 2012 at 21:24 Comment(6)

Files are sized in bytes, not bits. You'll need to find a way to pack the extra bits into a byte. – Scarlet 28/6, 2012 at 21:27

I tried that also, by considering that I would pack up to 8bits in the sequence to make it into a byte, but then how can I make a notation to say that my file is ending? and whatif i don't actually need to use 8bits, lets say it was the word test, which would be 101001, it don't complete a byte, and if I fill the other 2 bits with 00 would represent another character for me to read them, x-x – Marozas 28/6, 2012 at 21:32

Even if you wrote the file using bit wise operators, reading it back will be quite difficult as all the bits will be together. – Envious 28/6, 2012 at 21:36

Unfortunately file size on most systems is in bytes. So you need a way to pad the end of the file. Unfortunately padding with zeros results in stackoverflowo. Do you add a new <end> symbol or start with a length. (length only need be 3 bits) as you can get byte count from file length, oh and 000 would be 8). Don't worry about the waist of bits. Operating systems often allocate disk space in chunks of 512 bytes or bigger, though sometimes smaller. There is a bigger overhead needed, as if you don't know the word is stackoverflow, the can not know the alphabet you are using. – Lexi 28/6, 2012 at 21:38

possible duplicate of writing 'bits' to c++ file streams – Juneberry 28/6, 2012 at 21:45

panickal: there would be no problem at all, Huffman decode is very great, and it will work if all bits are together – Marozas 28/6, 2012 at 22:43

Just write the "header" to the file: the number of bits and then "pack" the bits into bytes padding the last one. Here's a sample.

#include <stdio.h>

FILE* f;

/* how many bits in current byte */
int bit_counter;
/* current byte */
unsigned char cur_byte;

/* write 1 or 0 bit */
void write_bit(unsigned char bit)
{
    if(++bit_counter == 8)
    {
        fwrite(&cur_byte,1,1,f);
        bit_counter = 0;
        cur_byte = 0;
    }

    cur_byte <<= 1;
    cur_byte |= bit;
}

int main()
{
    f = fopen("test.bits", "w");

    cur_byte = 0;
    bit_counter = 0;

    /* write the number of bits here to decode the bitstream later (47 in your case) */
    /* int num = 47; */           
    /* fwrite(num, 1, 4, f); */

    write_bit(1);
    write_bit(0);
    write_bit(0);
    /* etc...  - do this in a loop for each encoded character */
    /* 100110000100111010011111000010110110111011011111001010 */

    if(bit_counter > 0)
    {
         // pad the last byte with zeroes
         cur_byte <<= 8 - bit_counter;
         fwrite(&cur_byte, 1, 1, f);
    }

    fclose(f);

    return 0;
}

To do the full Huffman encoder you'll have to write the bit codes at the beginning, of course.

Pipe answered 28/6, 2012 at 21:41 Comment(2)

Same as roddy post, I would happen to need extra bits in the end to complete the last byte, which would provoke in the add of another character – Marozas 28/6, 2012 at 22:38

I've written: "write the number of bits here to decode the bitstream later". Yes, you just need a header at the beginning to know where the stream ends (I propose a simple bit counter, 4-byte integer). Sorry, that's just too obvious for me, hope it is clear for you now. – Pipe 28/6, 2012 at 22:46

This is sort of an encoding issue. The problem is that files can only contain bytes - so 1 and 0 can only be '1' and '0' in a file - the characters for 1 and 0, which are bytes.

What you'll have to do is to pack the bits into bytes, creating a file that contains the bits in a set of bytes. You won't be able to open the file in a text editor - it doesn't know you want to display each bit as a 1 or 0 char, it will display whatever each packed byte turns out to be. You could open it with an editor that understands how to work with binary files, though. For instance, vim can do this.

As far as extra trailing bytes or an end-of-file marker, you're going to have to create some sort of encoding convention. For example, you can pack and pad with extra zeros, like you mention in your comments, but then by convention have the first N bytes be metadata - e.g. the data length, how many bits are interesting in your file. This sort of thing is very common.

Baulk answered 28/6, 2012 at 21:35 Comment(0)

You'll need to manage this yourself, by buffering the bits to write and only actually writing data when you have a complete byte. Something like...

 void writeBit(bool b)
 {
   static char buffer=0;
   static int bitcount=0;

   buffer = (buffer << 1) | (b ? 1:0);

   if (++bitcount == 8)
   {
     fputc(buffer); // write out the byte
     bitcount = 0;
     buffer = 0;
   }
 }

The above isn't reentrant (and is likely to be pretty inefficient) - and you need to make sure you somehow flush any half-written byte at the end, (write an extra 7 zero bits, maybe) but you should get the general idea.

Airstrip answered 28/6, 2012 at 21:37 Comment(2)

Yes, i have tought of that but still, how can I manage to know where the end is? there will be a increased amount of data which is not indeed needed. – Marozas 28/6, 2012 at 22:35

You can decide that by convention, the last byte of a file is actually the number of padded bits. So let's assume you have 30bits. you start by padding this into a 32bits (4 bytes) and append one lasst byte of value 2 (the number of paddings) resulting on a 5 bytes sized file. The maximum overhead of this method is 2 Bytes which is nothing. I still think that any compression algorithm should at least present a metadata.(algorithm version, original file size, compression date, checksum, encryption method...). – Glanders 29/6, 2012 at 2:12

Recommended topics

Hot tags