Does the C++ standard mandate poor performance for iostreams, or am I just dealing with a poor implementation?
Asked Answered
D

4

212

Every time I mention slow performance of C++ standard library iostreams, I get met with a wave of disbelief. Yet I have profiler results showing large amounts of time spent in iostream library code (full compiler optimizations), and switching from iostreams to OS-specific I/O APIs and custom buffer management does give an order of magnitude improvement.

What extra work is the C++ standard library doing, is it required by the standard, and is it useful in practice? Or do some compilers provide implementations of iostreams that are competitive with manual buffer management?

Benchmarks

To get matters moving, I've written a couple of short programs to exercise the iostreams internal buffering:

Note that the ostringstream and stringbuf versions run fewer iterations because they are so much slower.

On ideone, the ostringstream is about 3 times slower than std:copy + back_inserter + std::vector, and about 15 times slower than memcpy into a raw buffer. This feels consistent with before-and-after profiling when I switched my real application to custom buffering.

These are all in-memory buffers, so the slowness of iostreams can't be blamed on slow disk I/O, too much flushing, synchronization with stdio, or any of the other things people use to excuse observed slowness of the C++ standard library iostream.

It would be nice to see benchmarks on other systems and commentary on things common implementations do (such as gcc's libc++, Visual C++, Intel C++) and how much of the overhead is mandated by the standard.

Rationale for this test

A number of people have correctly pointed out that iostreams are more commonly used for formatted output. However, they are also the only modern API provided by the C++ standard for binary file access. But the real reason for doing performance tests on the internal buffering applies to the typical formatted I/O: if iostreams can't keep the disk controller supplied with raw data, how can they possibly keep up when they are responsible for formatting as well?

Benchmark Timing

All these are per iteration of the outer (k) loop.

On ideone (gcc-4.3.4, unknown OS and hardware):

  • ostringstream: 53 milliseconds
  • stringbuf: 27 ms
  • vector<char> and back_inserter: 17.6 ms
  • vector<char> with ordinary iterator: 10.6 ms
  • vector<char> iterator and bounds check: 11.4 ms
  • char[]: 3.7 ms

On my laptop (Visual C++ 2010 x86, cl /Ox /EHsc, Windows 7 Ultimate 64-bit, Intel Core i7, 8 GB RAM):

  • ostringstream: 73.4 milliseconds, 71.6 ms
  • stringbuf: 21.7 ms, 21.3 ms
  • vector<char> and back_inserter: 34.6 ms, 34.4 ms
  • vector<char> with ordinary iterator: 1.10 ms, 1.04 ms
  • vector<char> iterator and bounds check: 1.11 ms, 0.87 ms, 1.12 ms, 0.89 ms, 1.02 ms, 1.14 ms
  • char[]: 1.48 ms, 1.57 ms

Visual C++ 2010 x86, with Profile-Guided Optimization cl /Ox /EHsc /GL /c, link /ltcg:pgi, run, link /ltcg:pgo, measure:

  • ostringstream: 61.2 ms, 60.5 ms
  • vector<char> with ordinary iterator: 1.04 ms, 1.03 ms

Same laptop, same OS, using cygwin gcc 4.3.4 g++ -O3:

  • ostringstream: 62.7 ms, 60.5 ms
  • stringbuf: 44.4 ms, 44.5 ms
  • vector<char> and back_inserter: 13.5 ms, 13.6 ms
  • vector<char> with ordinary iterator: 4.1 ms, 3.9 ms
  • vector<char> iterator and bounds check: 4.0 ms, 4.0 ms
  • char[]: 3.57 ms, 3.75 ms

Same laptop, Visual C++ 2008 SP1, cl /Ox /EHsc:

  • ostringstream: 88.7 ms, 87.6 ms
  • stringbuf: 23.3 ms, 23.4 ms
  • vector<char> and back_inserter: 26.1 ms, 24.5 ms
  • vector<char> with ordinary iterator: 3.13 ms, 2.48 ms
  • vector<char> iterator and bounds check: 2.97 ms, 2.53 ms
  • char[]: 1.52 ms, 1.25 ms

Same laptop, Visual C++ 2010 64-bit compiler:

  • ostringstream: 48.6 ms, 45.0 ms
  • stringbuf: 16.2 ms, 16.0 ms
  • vector<char> and back_inserter: 26.3 ms, 26.5 ms
  • vector<char> with ordinary iterator: 0.87 ms, 0.89 ms
  • vector<char> iterator and bounds check: 0.99 ms, 0.99 ms
  • char[]: 1.25 ms, 1.24 ms

EDIT: Ran all twice to see how consistent the results were. Pretty consistent IMO.

NOTE: On my laptop, since I can spare more CPU time than ideone allows, I set the number of iterations to 1000 for all methods. This means that ostringstream and vector reallocation, which takes place only on the first pass, should have little impact on the final results.

EDIT: Oops, found a bug in the vector-with-ordinary-iterator, the iterator wasn't being advanced and therefore there were too many cache hits. I was wondering how vector<char> was outperforming char[]. It didn't make much difference though, vector<char> is still faster than char[] under VC++ 2010.

Conclusions

Buffering of output streams requires three steps each time data is appended:

  • Check that the incoming block fits the available buffer space.
  • Copy the incoming block.
  • Update the end-of-data pointer.

The latest code snippet I posted, "vector<char> simple iterator plus bounds check" not only does this, it also allocates additional space and moves the existing data when the incoming block doesn't fit. As Clifford pointed out, buffering in a file I/O class wouldn't have to do that, it would just flush the current buffer and reuse it. So this should be an upper bound on the cost of buffering output. And it's exactly what is needed to make a working in-memory buffer.

So why is stringbuf 2.5x slower on ideone, and at least 10 times slower when I test it? It isn't being used polymorphically in this simple micro-benchmark, so that doesn't explain it.

Desalvo answered 2/12, 2010 at 21:57 Comment(26)
What's your compiler? I think we can all agree that not all C++ compilers are born equal... Also, what options did you pass to it?Thief
You're writing a million characters one-at-a-time, and wondering why it's slower than copying to a preallocated buffer?Axon
@Anon: I'm buffering four million bytes four-at-a-time, and yes I'm wondering why that's slow. If std::ostringstream isn't smart enough to exponentially increase its buffer size the way std::vector does, that's (A) stupid and (B) something people thinking about I/O performance should think about. Anyway, the buffer gets reused, it doesn't get reallocated every time. And std::vector is also using a dynamically growing buffer. I'm trying to be fair here.Desalvo
ostream::write() does not need to reallocate and move data like ostringstream::write() does, so if it is disk I/O you are concerned about why would you not test that? The kind of buffer management taking place here does not occur with a disk write.Infante
@Ben Voigt - Yeah, I'd assume as much.Sedge
Well, I'll have to agree with your premise in that iostream isn't designed to be fast. Instead, its designed to be flexible. streambuf on the other hand was designed to be closer to the metal and more performant. However, I think your tests are a bit unfair.Gowen
@Clifford: There ought to be buffer management going on with disk writes. I may add another benchmark that does disk I/O, it's just that the non-iostream version won't be as portable.Desalvo
What task are you actually trying to benchmark? If you're not using any of the formatting features of ostringstream and you want as fast performance as possible then you should consider going straight to stringbuf. The ostream classes are suppose to tie together locale aware formatting functionality with flexible buffer choice (file, string, etc.) through rdbuf() and its virtual function interface. If you're not doing any formatting then that extra level of indirection is certainly going to look proportionally expensive compared with other approaches.Van
+1 for truth op. We've gotten order or magnitude speed ups by moving from ofstream to fprintf when outputting logging info involving doubles. MSVC 2008 on WinXPsp3. iostreams is just dog slow.Statist
Here is some test on the committee site: open-std.org/jtc1/sc22/wg21/docs/D_5.cppBiathlon
@Ben: yes buffer management, but not dynamic buffer resizing, the data is streamed to the disk not to memory.Infante
@Clifford: I was consciously trying to minimize the effect of buffer resizing, it should grow to its maximum size on the first iteration and then the other 999 should reuse the same storage without needing any further allocations.Desalvo
@Alex: VS: Sampling has to be uncorrelated with what the program is doing. Sampling triggered by a system call event will not show percent of time in system calls, because the samples are not random. CodeAnalyst: Doesn't sample the stack on wall-clock time, only the IP. People seem to think just sort of any old sampling or timing, whatever it is, is good enough. It's not. Here's more: https://mcmap.net/q/15226/-alternatives-to-gprof-closed/…Reardon
@Mike, actually, CodeAnalyst does sample the stack if you ask it to. See "Call stack sampling" in its docs.Casual
@Alex: quoting: 2.7.3.3. Enable Call Stack Sampling (CSS) NOTE: This feature requires a specialized OProfile daemon and OProfile kernel module not available publicly at the moment. Call Stack Depth - specify the maximum depth of call stack unwinding. SO, it's hard to tell from the doc alone if you can make it get random wall-time stack samples and get percent-by-line summary. It seems to be based on OProfile, and I'm told OProfile can do the right thing, so maybe CodeAnalyst can do it after all. If so, I stand corrected. (Note: limited stack depth is a bad limitation.)Reardon
Note that an interesting discussion related to this can be found at bytes.com/topic/c/answers/…Portwin
Also groups.google.com/group/comp.lang.c++/browse_frm/thread/…Portwin
@beldaz: Which just makes it even more embarrassing that, almost a decade after these problems were well documented, absolutely no progress has been made. If anything, the performance gap between stdio and iostreams has gotten bigger since then.Desalvo
Dietmar Kühl (the guy who wrote most of the streams part of Josuttis' std lib book) said that C++' IO streams, "knowing" the types of the objects they operate on, should be much faster than C's IO, and that the fact that they aren't in all known implementations, is due to sloppiness of the vendors implementing them. He used to have an implementation of IO streams that he claimed were very fast. Unfortunately, it seems this got lost and now the only thing I can find of him is a defunct home page with an old (2002) implementation of the std lib: dietmar-kuehl.de/cxxrt.Stockbroker
Anyway, if Dietmar is still doing C++, he would be the person to ask about this. He and James Kanze (who appeared here on SO just two months ago) were who you hoped would answer your streams question in comp.lang.c++.moderated a decade ago, because their answers usually turned out to be definitive.Stockbroker
Awesome! finally someone has worked out that iostream really is very slow. That's a big reason I generally don't use it.Ataman
These comments need cleaning up. Anyway, none of the ideone links are working for me at the moment.Jointress
@BenVoigt do you still have the code snippets? I'd suggest either editing them into the post, or maybe putting them in a new post that will be closed but can still be linked to, or something.Jointress
@MattMcNabb: Yes I'm aware, ideone's policy is to store code "Forever." but they don't honor it.Desalvo
@BenVoigt, do you still have the snippets?Postcard
As the last time someone asked this was in 2015, here's a reminder, if you still have the snippets ;)Eoin
P
51

Not answering the specifics of your question so much as the title: the 2006 Technical Report on C++ Performance has an interesting section on IOStreams (p.68). Most relevant to your question is in Section 6.1.2 ("Execution Speed"):

Since certain aspects of IOStreams processing are distributed over multiple facets, it appears that the Standard mandates an inefficient implementation. But this is not the case — by using some form of preprocessing, much of the work can be avoided. With a slightly smarter linker than is typically used, it is possible to remove some of these inefficiencies. This is discussed in §6.2.3 and §6.2.5.

Since the report was written in 2006 one would hope that many of the recommendations would have been incorporated into current compilers, but perhaps this is not the case.

As you mention, facets may not feature in write() (but I wouldn't assume that blindly). So what does feature? Running GProf on your ostringstream code compiled with GCC gives the following breakdown:

  • 44.23% in std::basic_streambuf<char>::xsputn(char const*, int)
  • 34.62% in std::ostream::write(char const*, int)
  • 12.50% in main
  • 6.73% in std::ostream::sentry::sentry(std::ostream&)
  • 0.96% in std::string::_M_replace_safe(unsigned int, unsigned int, char const*, unsigned int)
  • 0.96% in std::basic_ostringstream<char>::basic_ostringstream(std::_Ios_Openmode)
  • 0.00% in std::fpos<int>::fpos(long long)

So the bulk of the time is spent in xsputn, which eventually calls std::copy() after lots of checking and updating of cursor positions and buffers (have a look in c++\bits\streambuf.tcc for the details).

My take on this is that you've focused on the worst-case situation. All the checking that is performed would be a small fraction of the total work done if you were dealing with reasonably large chunks of data. But your code is shifting data in four bytes at a time, and incurring all the extra costs each time. Clearly one would avoid doing so in a real-life situation - consider how negligible the penalty would have been if write was called on an array of 1m ints instead of on 1m times on one int. And in a real-life situation one would really appreciate the important features of IOStreams, namely its memory-safe and type-safe design. Such benefits come at a price, and you've written a test which makes these costs dominate the execution time.

Portwin answered 2/12, 2010 at 23:40 Comment(7)
Sounds like great information for a future question on performance of formatted insertion/extraction of iostreams which I'll probably ask soon. But I don't believe there are any facets involved with ostream::write().Desalvo
+1 for profiling (that's a Linux machine I presume?). However, I'm actually adding four bytes at a time (actually sizeof i, but all compilers I'm testing with have 4-byte int). And that doesn't seem all that unrealistic to me, what size chunks do you think get passed in each call to xsputn in typical code like stream << "VAR: " << var.x << ", " << var.y << endl;.Desalvo
@Ben: Window XP, MinGW GCC 4.4.0. I take your point about shifting 4 bytes each time, but the 1m loop still dominates, whereas your typical code example only calls xsputn twice (on the var members, maybe 5 times all up). Clearly writing to a stream is best done with the minimum number of calls, and an optimized version of your test would be to chunk your data up and then write out each chunk.Portwin
@beldaz: That "typical" code example that only calls xsputn five times could very well be inside a loop that writes a 10 million line file. Passing data to iostreams in large chunks is a lot less of a real-life scenario than my benchmark code. Why should I have to write to a buffered stream with the minimum number of calls? If I have to do my own buffering, what's the point of iostreams anyway? And with binary data, I have the option to buffer it myself, when writing millions of numbers to a text file, the bulk option just doesn't exist, I HAVE to call operator << for each one.Desalvo
@Ben: Yes, your point is valid, but once you start creating such large files won't I/O start to dominate? (I don't know the answer to that). What I'm getting at is that the comparatively high cost of write is fixed, and so only comes to dominate when called frequently on small peices of data. If anything I think this identifies a limitation of IOStreams, rather than condemning IOStreams to be uniformly poor in performance (implied by the title).Portwin
@beldaz: One can estimate when I/O starts to dominate with a simple calculation. At a 90 MB/s average write rate which is typical of current consumer grade hard disks, flushing the 4MB buffer takes <45ms (throughput, latency is unimportant because of OS write cache). If running the inner loop takes longer than that to fill the buffer, then the CPU will be the limiting factor. If the inner loop runs faster, then I/O will be the limiting factor, or at least there's some CPU time left over to do the real work.Desalvo
Of course, that doesn't mean that using iostreams necessarily means a slow program. If I/O is a very small part of the program, then using an I/O library with poor performance isn't going to have much overall impact. But not being called often enough to matter isn't the same as good performance, and in I/O heavy applications, it does matter.Desalvo
D
27

I'm rather disappointed in the Visual Studio users out there, who rather had a gimme on this one:

  • In the Visual Studio implementation of ostream, the sentry object (which is required by the standard) enters a critical section protecting the streambuf (which is not required). This doesn't seem to be optional, so you pay the cost of thread synchronization even for a local stream used by a single thread, which has no need for synchronization.

This hurts code that uses ostringstream to format messages pretty severely. Using the stringbuf directly avoids the use of sentry, but the formatted insertion operators can't work directly on streambufs. For Visual C++ 2010, the critical section is slowing down ostringstream::write by a factor of three vs the underlying stringbuf::sputn call.

Looking at beldaz's profiler data on newlib, it seems clear that gcc's sentry doesn't do anything crazy like this. ostringstream::write under gcc only takes about 50% longer than stringbuf::sputn, but stringbuf itself is much slower than under VC++. And both still compare very unfavorably to using a vector<char> for I/O buffering, although not by the same margin as under VC++.

Desalvo answered 2/12, 2010 at 21:57 Comment(2)
Is this information still up to date? AFAIK, C++11 implementation shipped with GCC performs this 'crazy' lock. Certainly, VS2010 still does it too. Could anyone clarify this behaviour and if 'which is not required' still holds in C++11?Lyra
@mloskot: I see no thread-safety requirement on sentry... "The class sentry defines a class that is responsible for doing exception safe prefix and suffix operations." and a note "The sentry constructor and destructor can also perform additional implementation-dependent operations." One can also surmise from the C++ principle of "you don't pay for what you don't use" that the C++ committee would never approve such a wasteful requirement. But feel free to ask a question about iostream thread safety.Desalvo
A
7

The problem you see is all in the overhead around each call to write(). Each level of abstraction that you add (char[] -> vector -> string -> ostringstream) adds a few more function call/returns and other housekeeping guff that - if you call it a million times - adds up.

I modified two of the examples on ideone to write ten ints at a time. The ostringstream time went from 53 to 6 ms (almost 10 x improvement) while the char loop improved (3.7 to 1.5) - useful, but only by a factor of two.

If you're that concerned about performance then you need to choose the right tool for the job. ostringstream is useful and flexible, but there's a penalty for using it the way you're trying to. char[] is harder work, but the performance gains can be great (remember the gcc will probably inline the memcpys for you as well).

In short, ostringstream isn't broken, but the closer you get to the metal the faster your code will run. Assembler still has advantages for some folk.

Audrit answered 2/12, 2010 at 22:42 Comment(7)
What does ostringstream::write() have to do that vector::push_back() doesn't? If anything, it should be faster since it's handed a block instead of four individual elements. If ostringstream is slower than std::vector without providing any additional features, then yeah I would call that broken.Desalvo
@Ben Voigt: On the contrary, its something vector has to do that ostringstream DOESN'T have to do that makes vector more performant in this case. Vector is guaranteed to be contiguous in memory, while ostringstream is not. Vector is one of the classes designed to be performant, while ostringstream is not.Gowen
@Dragontamer5788: That's true, insofar as ostringstream has to support ostream::rdbuf(streambuf*) and therefore can't assume that the internal buffer actually is a stringbuf. I'm going to run some performance tests using stringbuf directly, which will also get rid of virtual calls that prevent inlining (I tried using PGO to let the compiler inline anyway, it doesn't seem to have worked).Desalvo
@Ben Voigt: Using stringbuf directly is not going to remove all function calls as stringbuf's public interface consists of public non-virtual functions in the base class which then dispatch to protected virtual function in the derived class.Van
@Charles: On any decent compiler it should, since the public function call will get inlined into a context where the dynamic type is known to the compiler, it can remove the indirection and even inline those calls.Desalvo
@Ben, a lot of the calls will be across compilation unit (stream/string/etc) boundaries, so you're going to need the linker (not the compiler) to do the inlining/optimization. Not sure how well they realistically do that yet...Audrit
@Roddy: I should think that this is all inline template code, visible in every compilation unit. But I guess that could vary by implementation. For certain I would expect the call under discussion, the public sputn function which calls the virtual protected xsputn, to be inlined. Even if xsputn isn't inlined, the compiler can, while inlining sputn, determine the exact xsputn override needed and generate a direct call without going through the vtable.Desalvo
I
1

To get better performance you have to understand how the containers you are using work. In your char[] array example, the array of the required size is allocated in advance. In your vector and ostringstream example you are forcing the objects to repeatedly allocate and reallocate and possibly copy data many times as the object grows.

With std::vector this is easly resolved by initialising the size of the vector to the final size as you did the char array; instead you rather unfairly cripple the performance by resizing to zero! That is hardly a fair comparison.

With respect to ostringstream, preallocating the space is not possible, I would suggest that it is an inappropruate use. The class has far greater utility than a simple char array, but if you don't need that utility, then don't use it, because you will pay the overhead in any case. Instead it should be used for what it is good for - formatting data into a string. C++ provides a wide range of containers and an ostringstram is amongst the least appropriate for this purpose.

In the case of the vector and ostringstream you get protection from buffer overrun, you don't get that with a char array, and that protection does not come for free.

Infante answered 2/12, 2010 at 22:43 Comment(13)
Allocation doesn't seem to be the issue for ostringstream. He just seeks back to zero for subsequent iterations. No truncation. Also I tried ostringstream.str.reserve(4000000) and it made no difference.Audrit
I think with ostringstream, you could "reserve" by passing in a dummy string, i.e.: ostringstream str(string(1000000 * sizeof(int), '\0')); With vector, the resize doesn't deallocate any space, it only expands if it needs to.Inarticulate
"vector .. protection from buffer overrun". A common misconception - vector[] operator is typically NOT checked for bounds errors by default. vector.at() is however.Audrit
vector<T>::resize(0) doesn't usually reallocate the memorySedge
@Roddy: Not using operator[], but push_back() (by way of back_inserter), which definitely DOES test for overflow. Added another version that doesn't use push_back.Desalvo
@Nim: Good point, but on a resize the destructor is called for each dropped object. In this case that has no effect, but in general may be a consideration. It is part of the 'value added' functionality that makes std::vector useful, but not necessarily appropriate in all cases.Infante
@Roddy: Well fgets() is not overrun safe if you give it the wrong length! std::vector provides intrinsic features enabling 'safe' code. For a plain array, you always have to code the safety yourself.Infante
@Clifford: Yes, vectors are much safer than arrays, BUT many folk assume that all accesses are bounds checked, like Pascal arrays. Sadly, IMO, they aren't...Audrit
@Roddy: One of the design principles of C++ is "you don't pay for what you don't use". This precludes bounds checking for access to vectors in an optimized build. Thankfully, many library vendors provide an alternate implementation that is checked, for use during debugging. It would probably be even better to make that available in even optimized builds under a different name, such as checked_vector, for use with tainted data. But imposing bounds checking on all vectors is against the spirit of C++.Desalvo
@Ben, checked_vector would be great. Using 'complete' STL debugging is totally OTT for production code (consider, every container maintains a list of all currently valid iterators, to detect when an invalid iterator is used) and also (IIRC) changes some function signatures so you can't easily switch between debug/non-debug builds. my opinion (just that - but based in part on many years developing embedded real-time systems in Pascal) is that vector bounds-checking should typically be left enabled in 99% of code, even in shipped, production systems. The benefits are significant.Audrit
@Roddy: iterator debugging is a little more efficient than that, it just consists of a version number stored in both container and iterator, the iterator is invalid if the versions don't match. If optimizing compilers get smart enough to optimize away bounds checks in safe contexts such as this usage pattern for ( auto it = c.begin(); it != c.end(); ++it ) then having checking everywhere wouldn't be such a problem. But bounds checking isn't sufficient to assure safety, due to iterator invalidation... C++0x lambdas ought to help, because iteration will be safely done by a tested algorithm.Desalvo
@Ben: iterator/container version numbering. Very neat idea, shame my STL (dinkumware) doesn't use it :-( #2731919Audrit
@Roddy: You're absolutely right, VC2010 _HAS_ITERATOR_DEBUGGING does in fact maintain a list of iterators. I guess it's changed since iterator debugging first appeared, some of my code took a huge performance hit following the compiler upgrade, and I stepped through to find out what was so insanely slow.Desalvo

© 2022 - 2024 — McMap. All rights reserved.