Inserters and Extractors reading/writing binary data vs text
Asked Answered
B

3

6

I've been trying to read up on iostreams and understand them better. Occasionally I find it stressed that inserters (<<) and extractors (>>) are meant to be used in textual serialization. It's a few places, but this article is a good example:

http://spec.winprog.org/streams/

Outside of the <iostream> universe, there are cases where the << and >> are used in a stream-like way yet do not obey any textual convention. For instance, they write binary encoded data when used by Qt's QDataStream:

http://doc.qt.nokia.com/latest/qdatastream.html#details

At the language level, the << and >> operators belong to your project to overload (hence what QDataStream does is clearly acceptable). My question would be whether it is considered a bad practice for those using <iostream> to use the << and >> operators to implement binary encodings and decodings. Is there (for instance) any expectation that if written to a file on disk that the file should be viewable and editable with a text editor?

Should one always be using other method names and base them on read() and write()? Or should textual encodings be considered merely a default behavior that classes integrating with the standard library iostream can elect to ignore?


UPDATE A key terminology issue on this seems to be the distinction of I/O that is "formatted" vs "unformatted" (as opposed to the terms "textual" vs "binary"). I found this question:

writing binary data (std::string) to an std::ofstream?

It has a comment from @TomalakGeret'kal saying "I'd not want to use << for binary data anyway, as my brain reads it as "formatted output" which is not what you're doing. Again, it's perfectly valid, but I just would not confuse my brain like that."

The accepted answer to the question says it's fine as long as you use ios::binary. That seems to bolster the "there's nothing wrong with it" side of the debate...but I still don't see any authoritative source on the issue.

Bukavu answered 22/11, 2011 at 17:10 Comment(9)
"Textual encoding" is a misleading term. "Formatted I/O" is more appropriate, I would say.Evoy
Do whatever your framework does.Norland
@KerrekSB I have a clearer sense for what "Textual encoding" rules out than what "Formatted I/O" rules out. If I have an object with N 32-bit integers in it, then using write() to output 4 bytes for N followed by 4*N bytes corresponding to the values...is that still "formatted"?Bukavu
No, write() outputs unformatted data, so the actual binary (implementation) representation of your integer is inserted into the output stream as-is. By contrast, "formatting" may be something like creating a textual representation of the value of the integer and then inserting the (binary representation of the) text into the output.Evoy
@KerrekSB I still can't tell if your objection to the phrase "textual encoding" is a serious or minor issue, compared to the bigger question I'm trying to get at. But rather than chat here, would you mind reading-between-the-lines of my question and posting a response which (to the best of your ability) cites sources and addresses it? I'm talking about a solid "don't do that" argument to help with people suggesting things like this: https://mcmap.net/q/1176347/-load-binary-file-using-fstream/211160Bukavu
You're confusing formatted/unformatted access with textual/binary [mode] streams. And you'll be lucky to find any "authoritative source" on a completely subjective issue.Trainbearer
@TomalakGeret'kal Regardless of if I can find a whole answer, I'd consider an authoritative source on what formatted and unformatted access are to be at least some progress on the matter!!! :-/ I'd also, by the way, accept an authoritative "it's subjective" answer if it could be found...Bukavu
Formatted output function simply means that values are encoded in a well defined way by the function before being sent to the lower layer. OTOH unformatted output functions take the bytes given by the higher level and pass them unchanged to the lower layer. It is important to remember that "unformatted" apply only to the function, not the data: usually the data written using unformatted output function is formatted, but before being sent to the function, whereas data given to a formatted output function is formatted by the function itself.Alphorn
(...) This formatted/unformatted output function opposition is not related to the binary/text format opposition.Alphorn
B
9

Actually the operators << and >> are bit shift operators; using them for I/O is strictly speaking already a misuse. However that misuse is about as old as operator overloading itself, and I/O today is the most common usage of them, therefore they are widely regarded as I/O insertion/extraction operators. I'm pretty sure if there weren't the precedent of iostreams, nobody would use those operators for I/O (especially with C++11 which has variadic templates, solving the main problem which using those operators solved for iostreams, in a much cleaner way). On the other hand, from the language point of view, overloaded operator<< and operator>> can mean whatever you want them to mean.

So the question boils down to what would be an acceptable use of those operators. For this, I think one has to distinguish two cases: First, new overloads working on iostream classes, and second, new overloads working on other classes, possibly designed to work like iostreams.

Let's consider first new operators on iostream classes. Let me start with the observation that the iostream classes are all about formatting (and the reverse process, which could be called "deformatting"; "lexing" IMHO wouldn't be quite the right term here because the extractors don't determine the type, but only try to interpret the data according to the type given). The classes responsible for the actual I/O of raw data are the streambufs. However note that a proper binary file is not a file where you just dump internal raw data. Just like a text file (actually even more so), a binary file should have a well-specified encoding of the data it contains. Especially if the files are expected to be read on different systems. Therefore the concept of formatted output makes perfect sense also for binary files; just the formatting is different (e.g. writing a pre-determined number of bytes with the most significant one first for an integer value).

The iostreams themselves are classes which are intended to work on text files, that is, on files whose content is interpreted as textual representation of data. A lot of built-in behaviour is optimized for that, and may cause problems if used on binary files. An obvious example is that by default spaces are skipped before any input is attempted. For a binary file, this would be clearly the wrong behaviour. Also the use of locales doesn't make sense for binary files (although one might argue that there could be a "binary locale", but I don't think locales as defined for iostreams provide a suitable interface for that). Therefore I'd say writing binary operator<< or operator>> for iostream classes would be wrong.

The other case is where you define a separate class for binary input/output (possibly reusing the streambuf layer for doing the actual I/O). Since we are now speaking about different classes, the argumentation above doesn't apply any more. So the question now is: Should operator<< and operator>> on I/O be regarded as "text insertion/extraction operators" or more generally as "formatted data insertion/extraction operators"? The standard classes only use them for text, but then, there are no standard classes for binary I/O insertion/extraction at all, so the standard usage cannot distinguish between the two.

I personally would say that binary insertion/extraction is close enough to textual insertion/extraction that this usage is justified. Note that you also could make meaningful binary I/O manipulators, e.g. bigendian, littleendian and intwidth(n) to determine the format in which integers are to be output.

Beyond that there's also the use of those operators for things which are not really I/O (and where you wouldn't even think of using the streambuf layer), like reading from or inserting into a container. In my opinion, that already constitutes misuse of the operators, because there the data isn't translated into or out of a different format. It is just stored in a container.

Bobker answered 29/11, 2011 at 18:53 Comment(5)
Thank you for the lengthy answer. Among the points you make, it sounds as if you are suggesting that a "proper" use of iostream inserters/extractors would produce a file that can be transferred across different platform architectures or compiler implementations...and yet have the same meaning. Is there a reliable source that defines this to be what "formatted" means in this context? If so, are << and >> somehow uniquely tied to that responsibility (as opposed to a project that only uses .read() and .write())?Bukavu
@HostileFork: Actually I'd say a proper binary file has a well-defined format; the fact that you can then read it from another platform is automatic. Also a project that uses .read() and .write() exclusively hopefully has a well-defined binary format. An exception for this rule would be a temporary file which effectively is used as swap; such a file does not survive the current process and will never be read from another one, therefore it may well just contain memory dumps. It is obviously a good idea if a facility meant to write binary files helps the user to write proper binary files.Bobker
Bounty awarded for detailed response. I'm still a bit unsettled on "the answer" and I may hold off on declaring that and perhaps write my own just because I still feel the water is a little muddy here. It's hard to really distill out guidance to newbies wanting to grasp iostream, as no one has stepped forward to a commitment to "this is good code, do it this way" or "this is bad code, don't do it this way".Bukavu
@HostileFork: Thank you for awarding the bounty.Bobker
I'm still unsure if I really think the question is answered, it may not have an answer. But I'm looking to close old questions, so am accepting. If you have any comments on these slides feel free to make them.Bukavu
A
4

The abstraction of the iostreams in the standard is that of a textually formatted stream of data; there is no support for any non-text format. That is the abstraction of iostreams. There's nothing wrong about defining a different stream class whose abstraction is a binary format, but doing so in an iostream will likely break existing code, and not work.

Almagest answered 22/11, 2011 at 17:16 Comment(9)
My question was predicated on acceptance that the << and >> operators belong to your project to overload (hence what QDataStream was doing is acceptable). I suppose my question is more where-to-point people who want to implement iostream inserters and extractors to say that there's an expectation that if written to a file on disk that the file should be viewable and editable with a text editor...Bukavu
There's no problem with using << and >> for binary formats on streams other than iostream. The problem is changing their meaning on iostream.Almagest
But...says who? Note my comments with @KerrekSB above. He suggests that write() of an implementation integer is not "formatted"...but what if I corrected for endianness and wrote out a very specific 4-byte pattern that would work cross-platform? Where in the specification or canon can we find it written that objects should not serialize themselves in that way with these operators for iostream...?Bukavu
@HostileFork Says the authors of iostream. The authors of the standard have pretty much followed suite, and there isn't really any support for non-text formats. All of the standard >> read text formats, and expect to be surrounded by text. You can insert a binary format in a character buffer, and output it using ostream::write, but it's not really what iostreams were designed for. It's appropriate if you have basically a text stream, with a small bit of binary data inserted into it, but if the entire stream is binary, it's far better to use a binary stream.Almagest
@JamesKanze I do appreciate what might be called your "argument from intuition". It's what I also observed...that there's no binary serialization methods in iostream as-shipped, and since everything else is done with text then one appears to rock the boat by doing otherwise. I gave the Qt example as precedent for non-iostream, but I'm wondering if there are any counter-examples in iostream canon, such as in boost, for instance? (Basically is there anyplace other than questionable code samples or backwoods codebases that << and >> are used to serialize binary formats to iostreams?)Bukavu
@HostileFork "I do appreciate what might be called your "argument from intuition"" I don't get it: what is James' argument from intuition?Alphorn
@curiousguy: I said "says who?" and he said "Says the authors of iostream. The authors of the standard have pretty much followed [suit], and there isn't really any support for non-text formats. All of the standard >> read text formats, and expect to be surrounded by text." But there's a difference between saying "out of the box, all the << and >> operators for standard types use textual formats on iostream" and an authoritative source saying "when you serialize your types using << and >>, you should use text too"...Bukavu
@HostileFork Over the years, some bstream classes have been implemented, following the insertion/extraction syntax of iostream. There is nothing inherently wrong with non-text << and >>. James rightly mention that istream formatted input functions are defined to do text extraction, so istream& >> int& is going to parse an integer represented as text (and trying to use a customised locale to indicate non-text is going to cause problems). But that doesn't imply that ibstream& >> int& couldn't parse a binary representation.Alphorn
@Alphorn In my question I give the example of Qt's QDataStream as understanding that established "authorities" (well...to the extent Qt is an authority) have endorsed these operators for binary formats. I'm just trying to get an answer rather specifically for the iostream. An example of a relevant question I cite which provoked "uneasiness" but not a reaction of "incorrectness" was this one: #5355663Bukavu
S
3

The overloaded operators >> and << perform formatted IO. The rest IO functions (put, get, read, write, etc) perform unformatted IO. Unformatted IO means that the IO library only accepts a buffer, a sequence of unsigned character for its input. This buffer might contain textual message or a binary content. It’s the application’s responsibility to interpret the buffer. However the formatted IO would take the locale into consideration. In the case of text files, depending on the environment where the application runs, some special character conversion may occur in input/output operations to adapt them to a system-specific text file format. In many environments, such as most UNIX-based systems, it makes no difference to open a file as a text file or a binary file. Note that you could overload the operator >> and << for your own types. That means you are capable of applying the formatted IO without locale information to your own types, though that’s a bit tricky.

Sosthenna answered 28/11, 2011 at 6:41 Comment(1)
It is interesting that you bring up the issue of locale as an important distinction in the definition of "formatting". But I'm still trying to understand exactly what a "good" or "bad" practice would be. Do you think you can find or create some example code to illustrate the contrast?Bukavu

© 2022 - 2024 — McMap. All rights reserved.