How to read a file into unsigned char array from std::ifstream?
Asked Answered
P

1

8

So normaly I do stuff like:

    std::ifstream stream;
    int buff_length = 8192;
    boost::shared_array<char> buffer( new char[buff_length]);
    stream.open( path.string().c_str(), std::ios_base::binary);
    while (stream)
    {
            stream.read(buffer.get(), buff_length);
            //boost::asio::write(*socket, boost::asio::buffer(buffer.get(), stream.gcount()));
    }
    stream.close();

I wonder how to read into unsigned char buffer ( boost::shared_array<unsigned char> buffer( new unsigned char[buff_length]);)

Pushover answered 26/4, 2012 at 14:11 Comment(2)
One of the cases where reinterpret_cast<> is actually the correct approach.Elocution
Also, I'd prefer a shared_ptr<std::vector<uint8_t>> to shared_arrayTangential
F
15

In a simplest form:

std::vector<unsigned char> vec(
      std::istreambuf_iterator<char>(std::cin)
    , std::istreambuf_iterator<char>()
    );

Replace std::cin with your actual stream.

The above is likely to do more than one memory allocation (for files larger than a very few bytes) because std::istreambuf_iterator<> is an input-iterator, not a random-access or a forward iterator, so the length of the file can't be measured by subtracting iterators like end - begin or calling std::distance(begin, end). It can be reduced to one memory allocation if the vector is created first empty, then std::vector<>::reserve() is called to allocate memory for the file length and finally range insert is called vec.insert(vec.end(), beg, end) with beg and end being std::istreambuf_iterator<> as above to read the entire file.

std::vector<>::reserve() avoids unnecessary zero-initialization of the vector, which is about to be over-written with file content. Unlike the constructor or resize.


A few years later since original year 2012 answer, C++17 standard library allows to do exactly that:

#include <algorithm>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <iterator>
#include <vector>

std::vector<unsigned char> load_file(std::filesystem::path file_path) {
    using namespace std;

    vector<unsigned char> file_content;
    // Allocate memory for the entire file once. 
    // But don't waste any CPU cycles initializing it.
    file_content.reserve(file_size(file_path)); // C++17 standard library provides file_size.

    // Open the file in binary mode and read its bytes into the vector.
    ifstream file_stream(file_path, ios_base::in | ios_base::binary);
    file_content.insert(file_content.end(), istreambuf_iterator<char>{file_stream}, {});

    return file_content;
}

int main() {
    // Load a file.
    auto file_content = load_file("/usr/share/doc/gcc/copyright");

    // Output file contents into std::cout.
    copy(file_content.begin(), file_content.end(), std::ostreambuf_iterator<char>{std::cout});
}

There is a race condition window between getting the file size and reading it, which could be exploited by malicious users to make load_file allocate more memory in file_content.insert than what is available to make it emit std::bad_alloc or SIGSEGV (when overcommit_memory is enabled) to destabilise the process. Something to keep in mind if/when loading files from hostile environments.


Ideally, though, you'd like to read the entire file with one read syscall or std::istream::read, which std::istreambuf_iterator isn't capable of. Those calls require resizing the vector first which does the unnecessary initialization. Avoiding that initialization is possible but that changes the vector type.

If the file size is more than a few kilo-bytes it may be most efficient to map it into the process memory to avoid copying memory from the kernel to user-space.


The reason std::istreambuf_iterator<char> is used is because the implementation uses std::char_traits<> which normally has specializations only for char and wchar_t. Regardless, the C and C++ standards require all char types to have the same binary layout with no padding bits, so conversions between char, unsigned char and signed char (which are all distinct types, unlike signed int and int being the same type) preserve bit patterns and thus are safe.

[basic.fundamental/1]

Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements; that is, they have the same object representation... For narrow character types, all bits of the object representation participate in the value representation... For unsigned narrow character types, each possible bit pattern of the value representation represents a distinct number. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined. For each value i of type unsigned char in the range 0 to 255 inclusive, there exists a value j of type char such that the result of an integral conversion from i to char is j, and the result of an integral conversion from j to unsigned char is i.

Faceoff answered 26/4, 2012 at 15:35 Comment(6)
The standards don't require that char c = -1; unsigned char u = c; results in c and u having the same bit pattern. In theory signed chars could use 1's complement or sign-magnitude.Chit
Non-negative values of signed char must have the same representation as the corresponding unsigned char. But in one's complement for example, char c = -1; has bit pattern 11111110 whereas unsigned char u = c; has bit pattern 11111111. This is mostly academic as I'm not aware of any C++ implementation ever that didn't use 2's complementChit
@Chit Added the quote for you.Faceoff
They clearly don't have the same value representation ; the value -1 does have a value representation in signed char but not in unsigned charChit
@Chit Added a longer quote for you. Value representation is different (the sign bit). Object representation is the same.Faceoff
@MaximEgorushkin that text (added in C++14) appears to require that if plain char is signed it must follow 2's complement. But firstly, there is no equivalent requirement in C as you claim, and secondly, it still permits signed char c = -1; to use one's complement (i.e. get bit pattern 11111110) and have plain char be unsignedChit

© 2022 - 2024 — McMap. All rights reserved.