vector <unsigned char> vs string for binary data
Asked Answered
A

9

33

Which is a better c++ container for holding and accessing binary data?

std::vector<unsigned char>

or

std::string

Is one more efficient than the other?
Is one a more 'correct' usage?

Allrud answered 12/10, 2009 at 18:45 Comment(2)
Have a look to this post about using char vs unsigned char for binary data: #278155Gens
For an example of when std::string is used for binary data, see Google ProtobufMaebashi
K
32

You should prefer std::vector over std::string. In common cases both solutions can be almost equivalent, but std::strings are designed specifically for strings and string manipulation and that is not your intended use.

Kotz answered 12/10, 2009 at 20:23 Comment(5)
"Say that the default character traits determine that 'a' and 'á' are equivalent" That is a bad asumption. See the answer I wrote as continuation to this comment.Gens
I rechecked, and you are right in that the standard does define the specialization char_traits<char> and with the standard specialization, assignment, comparisons and ordering are defined as the equivalent for the built-in char type.Disrelish
So with default char_traits std::string would compare no differently than the corresponding std::vector?Allrud
@kalaxy: correct. Anyway, each class was meant for a purpose, and std::vector better suites what you want from a buffer, so if only because of the intention is clearer (as fnieto points out in his answer) I would prefer std::vectorDisrelish
@DavidRodríguez-dribeas: I edited your answer, since I understand (from the comments) that the previous version was incorrect.Emboss
G
15

Both are correct and equally efficient. Using one of those instead of a plain array is only to ease memory management and passing them as argument.

I use vector because the intention is more clear than with string.

Edit: C++03 standard does not guarantee std::basic_string memory contiguity. However from a practical viewpoint, there are no commercial non-contiguous implementations. C++0x is set to standardize that fact.

Gens answered 12/10, 2009 at 18:49 Comment(3)
from Sgi: "The basic_string class represents a Sequence of characters. It contains all the usual operations of a Sequence, and, additionally, it contains standard string operations such as search and concatenation.". Why is that incorrect? I agree it is not the best aproach (as I state in my answer) but it is not incorrect.Gens
So string works just as well as the vector because it in a sense extends the functionality of a vector yet the only functionality I will need ([] or the like) is contained in both? (Yes I realize that string doesn't actually inherit from vector.)Allrud
Yes, but conceptually is a worse option and have methods that could not have sense for a buffer. If you only want memory management and operator[], why to use a class so complex as std::string.Gens
T
3

Is one more efficient than the other?

This is the wrong question.

Is one a more 'correct' usage?

This is the correct question.
It depends. How is the data being used? If you are going to use the data in a string like fashon then you should opt for std::string as using a std::vector may confuse subsequent maintainers. If on the other hand most of the data manipulation looks like plain maths or vector like then a std::vector is more appropriate.

Terina answered 12/10, 2009 at 18:57 Comment(0)
B
3

For the longest time I agreed with most answers here. However, just today it hit me why it might be more wise to actually use std::string over std::vector<unsigned char>.

As most agree, using either one will work just fine. But often times, file data can actually be in text format (more common now with XML having become mainstream). This makes it easy to view in the debugger when it becomes pertinent (and these debuggers will often let you navigate the bytes of the string anyway). But more importantly, many existing functions that can be used on a string, could easily be used on file/binary data. I've found myself writing multiple functions to handle both strings and byte arrays, and realized how pointless it all was.

Boogie answered 28/11, 2017 at 17:0 Comment(0)
G
1

This is a comment to dribeas answer. I write it as an answer to be able to format the code.

This is the char_traits compare function, and the behaviour is quite healthy:

static bool
lt(const char_type& __c1, const char_type& __c2)
{ return __c1 < __c2; }

template<typename _CharT>
int
char_traits<_CharT>::
compare(const char_type* __s1, const char_type* __s2, std::size_t __n)
{
  for (std::size_t __i = 0; __i < __n; ++__i)
if (lt(__s1[__i], __s2[__i]))
  return -1;
else if (lt(__s2[__i], __s1[__i]))
  return 1;
  return 0;
}
Gens answered 13/10, 2009 at 8:1 Comment(2)
Is this behavior well defined in the standard?Decadence
+1: @gnud: Not in general, but fnieto is right (I just checked it) in that the standard defines the specialization of traits for char, where assign, eq and lt must be defined as builtin operators =, == and < for type char.Disrelish
L
0

As far as readability is concerned, I prefer std::vector. std::vector should be the default container in this case: the intent is clearer and as was already said by other answers, on most implementations, it is also more efficient.

On one occasion I did prefer std::string over std::vector though. Let's look at the signatures of their move constructors in C++11:

vector (vector&& x);

string (string&& str) noexcept;

On that occasion I really needed a noexcept move constructor. std::string provides it and std::vector does not.

Langmuir answered 24/6, 2014 at 8:19 Comment(0)
B
-1

If you just want to store your binary data, you can use bitset which optimizes for space allocation. Otherwise go for vector, as it's more appropriate for your usage.

Bottomry answered 12/10, 2009 at 18:47 Comment(16)
bitset is not a good choice. How are you going to get the data back out without casting? How do you easily read a byte out of a bitset? This isn't the right application for bitset.Pyriphlegethon
Hence, "if you just want to store your binary data". This is important in some memory intensive processes - for e.g. when working with binary images, you'd want to store them temporarily and then reuse them later.Bottomry
How often do you actually "just store data" though? If I was going to store it I would use a file or just an array or vector. What advantages does bitset have for storage? How do you even get your binary data into a bitset? Bitset has really lousy contructors for that purpose. Have you actually tried to do this? Bitset has a default constructor, a constructor that takes an unsigned long, and one that takes a string. Not real convenient for this purpose.Pyriphlegethon
Storing it in an array or a vector would defeat the purpose of storage since we're using bitset for it's optimized allocation of bits. Passing a string of bits is not that difficulty. As for applications, binary images are one: an RGB 1024x768 is 2.25MB stored as uchars - imagine storing a small batch of frames (which is not unrealistic). Also, r/w to files is much slower than storing it temporarily as a bitset. Additionally, I did mention that if storage wasn't the prime motivation, vector is better.Bottomry
Bitset is not optimized for storage of bits. In fact, the standard makes no guarantees on how the bits are actually stored. Bitset is used when you need, what else, a set of bits, as for example, flag manipulation. Please tell me how you are going to store a binary image 2.25 MB in size in a bit set. There is nothing more optimized for space allocation than an array of unsigned char.Pyriphlegethon
Read the line about optimizing space allocation: cplusplus.com/reference/stl/bitsetBottomry
Jacob, this is silly. You claim that bitset is useful for storing binary data. This is absurd. Bitset is not a container, and it has no suitable constructors for being initialized from raw data, unlike vector or string. Are you seriously telling me you would construct a string of ASCII 1's and 0's from 2.25 MB of binary data in order to construct a bitset??? That's a pretty big string. Think about it. Bitset was not meant for this purpose. The C++ standard does not even specify how bitset internally stores data, unlike vector, which the standard guarantees to be contiguous.Pyriphlegethon
There is no more compact way to store data in memory in C++ than with an array of unsigned char. The standard guarantees that you can treat the memory inside of vector<unsigned char> as a contiguous array. You cannot (portably) do that with bitset. You can't (portably) memcpy raw data into a bitset either.Pyriphlegethon
bitset is efficient at storing binary data - I never said bitset was an STL container. And creating that "pretty big string" (which would use unsigned char, btw) is trivial. Also, everything I've seen till now (sample code on my compiler, Googling and Effective STL (pg.70)) indicates that bitset does store binary data effectively. And yes, there is a better way to store binary data, and it's bitset - have you tried it out on your compiler? It's only two lines of code.Bottomry
To initialize a 2.25 MB bitset, you need a 10 MB string; each character in the string represents just one bit in the bitset. Also, you need to know how many bits you'll need at compile time. There are just two ways of extracting a bitset's contents en masse: to_ulong is useless if you have more bits than fit in a long, and to_string returns a string of zeroes and ones that can't easily be used in any other data type. So, yes, if all you want to do is store a preset amount of data, bitset might be OK. If you want the data back, or if the size is uncertain, then it's a lousy choice.Luminal
Agreed, if the size is uncertain, it's lousy, but getting the data back is not since it's the same as storing the data, you can use bitset::to_string. And yes, you need a 10 MB string - that's the whole point of using bitset. Suppose you have a array of bits which you've obtained as unsigned chars after some logical operation perhaps, and it's 10MB and you want to store it in memory - what do you do? bitset!Bottomry
Ha-ha, you keep messing with your 10 MB string and I'll use my 2 MB vector<unsigned char>. I still have absolutely no clue why you feel bitset is good for "storing" data. Why is it better than vector? And what the heck are you supposed to do with it while it is in bitset? And yes I have tried to use bitset for binary data. I actually wrote my own implementation of bitset and gave it constructors and accessors to get the raw data in and back out for embedded systems. But I need it because I was using it as it was intended, as a set of bit flags, not storage.Pyriphlegethon
The fact that bitset doesn't provide (begin, end) constructors and raw data accessors makes it absolutely terrible for storing data. Your only way in or out for large numbers of bits is string? You also cannot say it is optimized for storage. As I have said several times, the standard does not guarantee how bitset should store data, unlike vector. For all you know, your bitset may store 1 bit in every byte for speed. I know of no implementation that actually does this, but that's why you can't count on it or portably memcpy it around. P.S. Don't rely on cplusplus.com for everything.Pyriphlegethon
I don't think you understand what I'm saying. Your 2MB vector<unsigned char> which is supposed to represent 2Mbits can be more efficiently stored on most implementations (could you point out an implementation which performs so poorly? I can't find one!) using bitset. How? You throw it in to the constructor and poof! you get a bitset which has stored your data by possibly a factor of 8. Also, all I've said, repeatedly is, storage. Nothing about accessors, etc. etc.Bottomry
@Jacob: I think you have a communications problem here with Brian. If you read a 1024x768@24 bit raw image you will have 2.25MBytes of information. The most that a bitset can pack the data is one bit for each element, and at that level it will require exactly 2.25MBytes of memory, just as a vector of bytes. Bitset will be an advantage if each of your original elements is a bit (at this point you can note that std::vector<bool> is an specialization that is optimized for space, not that the standard committee is happy about it), so at that point it won't even take more memory than a bitset.Disrelish
... Now, if your intended use is testing flags, using a vector of bytes will be more cumbersome as it will require extracting each byte and then testing each bit for reading, extracting the byte, setting the bit and inserting the result back for setting a bit. At that point using a bitset or vector<bool> will simplify user code. But the thing is that if the elements you work with are not bits but rather bytes, then a vector is more efficient cpu wise than a bitset and is not less efficient memory wise. In most cases, when people talk about storing binary data they refer to bytes, not bits.Disrelish
P
-1

Compare this 2 and choose yourself which is more specific for you. Both are very robust, working with STL algorithms ... Choose yourself wich is more effective for your task

Piecedyed answered 12/10, 2009 at 19:30 Comment(0)
P
-1

Personally I prefer std::string because string::data() is much more intuitive for me when I want my binary buffer back in C-compatible form. I know that vector elements are guaranteed to be stored contiguously exercising this in code feels a little bit unsettling.

This is a style decision that individual developer or a team should make for themselves.

Prevalent answered 12/10, 2009 at 21:47 Comment(10)
You prefer using a string for non-string data? Rather than using the container designed for contiguous storage of data of any type?Jewish
Lets not forget that this is the matter of style. Perfectly workable and standard compliant code for binary buffers can be created with either of these classes. I would argue that vector is not designed to be a binary buffer either. It is compatible, but you will have to revert to algorithms or C tricks to get the job done. Not all string operations are safe, but some of them are quite useful to make the code cleaner and more maintainable.Prevalent
Vector is quite suited to store binary data, e.g. vector<unsigned char> v(256). I don't consider &v[0] a "C trick".Pyriphlegethon
No, &v[0] is fine, and so is s.data(). What is vector's alternative for string s; s.assign(BinaryBuffer, BinaryBufferSize); ?Prevalent
vector<unsigned char> v; v.assign(BinaryBuffer, BinaryBuffer + BinaryBufferSize);Pyriphlegethon
Of course vector has a constructor explicity for that purpose too: vector<unsigned char> v(first, last);Pyriphlegethon
Thus you have to explicitly parametrize vector with unsigned char and make sure pointer arithmetics works correctly in BinaryBuffer + BinaryBufferSize. Looks like more pitfalls then string option to me. As I said in the beginning, this is clearly a style issue. There's no such thing as "universal style". Teams or individual developers should decide which option they like better and adhere to that.Prevalent
Um, string is already parameterized by char, did you notice? So typedef your vector<unsigned char> if that makes you feel weird. String is meant for strings of characters, not raw binary data. String is a much more heavy-weight solution.Pyriphlegethon
And what do you mean by making sure pointer arithmetic works correctly? Vector uses the 2-iterator (begin, end) idiom like the rest of the STL (and string). Hardly more pitfalls than string.Pyriphlegethon
Pointer arithmetic may play tricks if BinaryBuffer is not (unsigned char*). Could you please elaborate on what makes string much more heavyweight?Prevalent

© 2022 - 2024 — McMap. All rights reserved.