Read and write array to compressed file with boost iostreams
Asked Answered
E

2

5

I want to write an array to a file, compressing it as I go.

Later, I want to read the array from that file, decompressing it as I go.

Boost's Iostreams seems like a good way to go, so I built the following code. Unfortunately, the output and input data do not compare equal at the end. But they very nearly do:

Output         Input
0.8401877284   0.8401880264
0.3943829238   0.3943830132
0.7830992341   0.7830989957
0.7984400392   0.7984399796
0.9116473794   0.9116470218
0.1975513697   0.1975509971
0.3352227509   0.3352229893

This suggests that the least significant byte of each float is getting changed, or something. The compression should be lossless, though, so this is not expected or desired. What gives?

//Compile with: g++ test.cpp --std=c++11 -lz -lboost_iostreams
#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <cstdlib>
#include <vector>
#include <iomanip>

int main() 
{
    using namespace std;
    using namespace boost::iostreams;

    const int NUM = 10000;

    std::vector<float> data_out;
    std::vector<float> data_in;
    data_in.resize(NUM);
    for(float i=0;i<NUM;i++)
      data_out.push_back(rand()/(float)RAND_MAX);

    {
      ofstream file("/z/hello.z", ios_base::out | ios_base::binary);
      filtering_ostream out;
      out.push(zlib_compressor());
      out.push(file);

      for(const auto d: data_out)
        out<<d;
    }

    {
      ifstream file_in("hello.z", ios_base::in | ios_base::binary);
      filtering_istream in;
      in.push(zlib_decompressor());
      in.push(file_in);

      for(float i=0;i<NUM;i++)
        in>>data_in[i];
    }

    bool all_good=true;
    for(int i=0;i<NUM;i++){
      cout<<std::setprecision(10)<<data_out[i]<<"   "<<data_in[i]<<endl;
      all_good &= (data_out[i]==data_in[i]);
    }

    cout<<"Good? "<<(int)all_good<<endl;
}

And, yes, I very much prefer to use the stream operators in the way I do, rather than pushing or pulling an entire vector block at once.

Eritrea answered 19/5, 2016 at 17:50 Comment(8)
Does the problem also occur when you leave the compression out of the sample?Clothilde
Yes, it does, just tried it. You're printing the floats into the file as strings. Also with no separators. Only thanks to saving values less than 1 are you able to get something remotely sensible, since otherwise it would be unparseable.Clothilde
I wouldn't think separators should not be necessary for compression/decompression: I imagine (perhaps erroneously) that the floats get pushed into an internal byte buffer and compressed as sufficient data becomes available to do so. Decompression would work similarly: the data is read to an internal buffer and pulled out of it as sufficient information becomes available.Eritrea
Put another way: printing without compression leaves ambiguity as to where boundaries are because a 4-byte float is represented by a variable number of bytes in the output. With compression there is no ambiguity because readinging/writing continues until four bytes worth of data are translated.Eritrea
Well, they are if you're storing it as text -- unless that was not what you intended, or you wish to print it as text using a consistent number of characters (but then you'll probably need something different than the default operator >> to read it)Clothilde
"printing without compression leaves ambiguity" + "With compression there is no ambiguity" -- that doesn't make sense. You're writing the text representation of the floating point value into the stream, and then compressing the contents of the stream. You dropped the 4-byte representation the moment you called the operator <<.Clothilde
It makes perfect sense, if you're not expecting the << operator to be doing a conversion to text. Hmmm, that is unfortunate.Eritrea
Using boost serialization here is undesirable because, as I understand that library, I then end up with double the memory requirements of the current approach.Eritrea
E
1

As Dan Mašek pointed out in their answer, the << stream operator I was using was converting my floating-point data into a textual representation prior to compression. For some reason, I hadn't expected this.

Using the serialization library is one way to avoid this, but would introduce additional dependencies in addition to possible overhead.

Therefore, I have used a reinterpret_cast on the floating-point data and the ostream::write() method to write the data without conversion one character at a time. Reading uses a similar method. Efficiencies could be improved by increasing the number of characters written at a time.

#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <cstdlib>
#include <vector>
#include <iomanip>

int main() 
{
    using namespace std;
    using namespace boost::iostreams;

    const int NUM = 10000;

    std::vector<float> data_out;
    std::vector<float> data_in;
    data_in.resize(NUM);
    for(float i=0;i<NUM;i++)
      data_out.push_back(233*(rand()/(float)RAND_MAX));

    {
      ofstream file("/z/hello.z", ios_base::out | ios_base::binary);
      filtering_ostream out;
      out.push(zlib_compressor());
      out.push(file);

      char *dptr = reinterpret_cast<char*>(data_out.data());

      for(int i=0;i<sizeof(float)*NUM;i++)
        out.write(&dptr[i],1);
    }

    {
      ifstream file_in("hello.z", ios_base::in | ios_base::binary);
      filtering_istream in;
      in.push(zlib_decompressor());
      in.push(file_in);

      char *dptr = reinterpret_cast<char*>(data_in.data());

      for(int i=0;i<sizeof(float)*NUM;i++)
        in.read(&dptr[i],1);
    }

    bool all_good=true;
    for(int i=0;i<NUM;i++){
      cout<<std::setprecision(10)<<data_out[i]<<"   "<<data_in[i]<<endl;
      all_good &= (data_out[i]==data_in[i]);
    }

    cout<<"Good? "<<(int)all_good<<endl;
}
Eritrea answered 19/5, 2016 at 19:19 Comment(2)
Just out of curiosity, on what do you base the conclusion that using boost::serialization would double the memory requirements? From my understanding it writes to the underlying stream pretty much like you do. If anything, the impact of using zlib on your memory footprint will be much more significant (for all its internal data structures).Clothilde
My bad, it's been a while since I've worked with serialization. When I'd done so, the serialized representation needed to be held in memory, which doubled the requirement. I suppose there probably is a way to use serialization in a streaming fashion. I moved away from Boost serialization, though to Cereal to reduce dependencies and for speed (link).Eritrea
T
7

The problem is not with compression, but in the way you serialize the values of the vector.

If you disable the compression and limit the size to 10 elements for easier inspection, you can see that the produced file looks something like this:

0.001251260.5635850.1933040.808740.5850090.4798730.3502910.8959620.822840.746605

As you can see, the numbers are represented as text, with a limited number of decimal places, and no separator. It is by sheer chance (since you're only working with values < 1.0) that your program was able to produce a remotely sensible result.

This happens since you use the stream operator << which formats numeric types as text.


The simplest solution would seem to be using boost::serialization to handle the reading and writing (and use boost::iostreams as the underlying compressed stream). I used a binary archive, but you could concievably use text archive as well (just replace the binary_ with text_).

Sample code:

#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/zlib.hpp>

#include <boost/archive/binary_oarchive.hpp>
#include <boost/archive/binary_iarchive.hpp>
#include <boost/serialization/vector.hpp>

#include <cstdlib>
#include <vector>
#include <iomanip>

int main() 
{
    using namespace std;
    using namespace boost::iostreams;

    const int NUM = 10;

    std::vector<float> data_out;
    for (float i = 0; i < NUM; i++) {
        data_out.push_back(rand() / (float)RAND_MAX);
    }

    {
        ofstream file("hello.z", ios_base::out | ios_base::binary);
        filtering_ostream out;
        out.push(zlib_compressor());
        out.push(file);

        boost::archive::binary_oarchive oa(out);
        oa & data_out;
    }

    std::vector<float> data_in;
    {
        ifstream file_in("hello.z", ios_base::in | ios_base::binary);
        filtering_istream in;
        in.push(zlib_decompressor());
        in.push(file_in);

        boost::archive::binary_iarchive ia(in);
        ia & data_in;
    }

    bool all_good=true;
    for(int i=0;i<NUM;i++){
      cout<<std::setprecision(10)<<data_out[i]<<"   "<<data_in[i]<<endl;
      all_good &= (data_out[i]==data_in[i]);
    }

    cout<<"Good? "<<(int)all_good<<endl;
}

Console Output:

0.001251258887   0.001251258887
0.563585341   0.563585341
0.1933042407   0.1933042407
0.8087404966   0.8087404966
0.5850093365   0.5850093365
0.4798730314   0.4798730314
0.3502914608   0.3502914608
0.8959624171   0.8959624171
0.822840035   0.822840035
0.7466048002   0.7466048002
Good? 1

A minor problem is that you do not serialize the size of the vector, so when reading you have to keep reading until the end of the stream.

Thanatopsis answered 19/5, 2016 at 19:0 Comment(1)
Thanks, Dan. Your apt observation was sufficient for me to figure everything else out. I'm not sure why I was expecting the << operator not to do a conversion.Eritrea
E
1

As Dan Mašek pointed out in their answer, the << stream operator I was using was converting my floating-point data into a textual representation prior to compression. For some reason, I hadn't expected this.

Using the serialization library is one way to avoid this, but would introduce additional dependencies in addition to possible overhead.

Therefore, I have used a reinterpret_cast on the floating-point data and the ostream::write() method to write the data without conversion one character at a time. Reading uses a similar method. Efficiencies could be improved by increasing the number of characters written at a time.

#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <cstdlib>
#include <vector>
#include <iomanip>

int main() 
{
    using namespace std;
    using namespace boost::iostreams;

    const int NUM = 10000;

    std::vector<float> data_out;
    std::vector<float> data_in;
    data_in.resize(NUM);
    for(float i=0;i<NUM;i++)
      data_out.push_back(233*(rand()/(float)RAND_MAX));

    {
      ofstream file("/z/hello.z", ios_base::out | ios_base::binary);
      filtering_ostream out;
      out.push(zlib_compressor());
      out.push(file);

      char *dptr = reinterpret_cast<char*>(data_out.data());

      for(int i=0;i<sizeof(float)*NUM;i++)
        out.write(&dptr[i],1);
    }

    {
      ifstream file_in("hello.z", ios_base::in | ios_base::binary);
      filtering_istream in;
      in.push(zlib_decompressor());
      in.push(file_in);

      char *dptr = reinterpret_cast<char*>(data_in.data());

      for(int i=0;i<sizeof(float)*NUM;i++)
        in.read(&dptr[i],1);
    }

    bool all_good=true;
    for(int i=0;i<NUM;i++){
      cout<<std::setprecision(10)<<data_out[i]<<"   "<<data_in[i]<<endl;
      all_good &= (data_out[i]==data_in[i]);
    }

    cout<<"Good? "<<(int)all_good<<endl;
}
Eritrea answered 19/5, 2016 at 19:19 Comment(2)
Just out of curiosity, on what do you base the conclusion that using boost::serialization would double the memory requirements? From my understanding it writes to the underlying stream pretty much like you do. If anything, the impact of using zlib on your memory footprint will be much more significant (for all its internal data structures).Clothilde
My bad, it's been a while since I've worked with serialization. When I'd done so, the serialized representation needed to be held in memory, which doubled the requirement. I suppose there probably is a way to use serialization in a streaming fashion. I moved away from Boost serialization, though to Cereal to reduce dependencies and for speed (link).Eritrea

© 2022 - 2024 — McMap. All rights reserved.