Asked 2/3, 2011 at 10:39 Answered 11/2, 2016 at 13:29

Solved c++optimization iostream c++-faq std

Most C++ users that learned C prefer to use the printf / scanf family of functions even when they're coding in C++.

Although I admit that I find the interface way better (especially POSIX-like format and localization), it seems that an overwhelming concern is performance.

Taking at look at this question:

How can I speed up line by line reading of a file

It seems that the best answer is to use fscanf and that the C++ ifstream is consistently 2-3 times slower.

I thought it would be great if we could compile a repository of "tips" to improve IOStreams performance, what works, what does not.

Points to consider

buffering (rdbuf()->pubsetbuf(buffer, size))
synchronization (std::ios_base::sync_with_stdio)
locale handling (Could we use a trimmed-down locale, or remove it altogether ?)

Of course, other approaches are welcome.

Note: a "new" implementation, by Dietmar Kuhl, was mentioned, but I was unable to locate many details about it. Previous references seem to be dead links.

Yaker answered 2/3, 2011 at 10:39 Comment(19)

I'm making this an FAQ question. Feel free to revert if you think this is wrong. – Jittery 2/3, 2011 at 10:43

@Matthieu: Dietmar once said that his work got abandoned, though I can't find where. (In general, you need to search the newsgroups to find this stuff. comp.lang.c++.moderated was where all the interesting C++ discussions took place in the 90s.) – Jittery 2/3, 2011 at 10:45

Is this factor also true for g++? I seem to remember that there has been work in the gnu stdlib implementation in order to remove unneeded performance hit. (I rarely do performance sensitive formatted IO, so I don't know). – Wickiup 2/3, 2011 at 10:49

@sbi, I'm pretty sure he stopped to work on it. The issue recently resurfaced on clc++m and he did participate. – Wickiup 2/3, 2011 at 10:50

@Wickiup The performance difference is essentially an urban legend, fed by two facts: (1) Legacy implementation of the c++stdlib were slower. (2) Many people don’t know about std::ios_base::sync_with_stdio. – Polity 2/3, 2011 at 11:42

@AProgrammer: I only constated a 17% performance hit using gcc 3.4.2 on unix, after increasing the buffer size. – Yaker 2/3, 2011 at 12:46

@Matthieu, thanks for data point. – Wickiup 2/3, 2011 at 13:29

@AProgrammer: I've provided the code I used for benchmarking (in full), I am interested in results on other platforms if you have the occasions. From my measures it seems the default behavior on gcc/unix is already good to go, and no extra tuning is necessary. – Yaker 2/3, 2011 at 13:54

@Konrad: If I debug into Dinkumware's streams implementation (one of the most widely distributed one) of the input operators, I will ultimately arrive at scanf(). Of course, since this is sharing all the disadvantages of scanf(), and adding a few layers on top, this stream implementation will, ultimately, be slower. And I'm not talking disk IO here, but pure parsing. In theory, streams might even be faster than printf()/scanf(), but I've yet to encounter such an implementation in the wild. – Jittery 2/3, 2011 at 14:2

@AProgrammer: My comment was misleading. Yes, he stopped work on that many years ago. What I couldn't find was a posting of him where he explained why his work never got adopted. – Jittery 2/3, 2011 at 14:5

@sbi: the same problem occurs regularly in C++ I've found. Normally template programming could move checks from runtime to compile-time, but most of the times the C++ lib is a thin wrapper around the C one, which performs all checking at runtime anyway... – Yaker 2/3, 2011 at 14:6

Matthieu, I used your same code, reduced the iterations to 1, use a large data file, and using "time" see 2x-3x difference between your cpp test and c test. – Ezequiel 2/3, 2011 at 14:7

@sbi: do you still have his work around ? I could not even find archives of it, and his website seems to have been moved / shut down. – Yaker 2/3, 2011 at 14:7

@sbi, Here is the message I was thinking of: groups.google.com/group/comp.lang.c++.moderated/msg/… – Wickiup 2/3, 2011 at 14:14

@Matthieu, the link in the message I referenced above is alife here. – Wickiup 2/3, 2011 at 14:16

@Matthieu: I wasn't a workaround, but a full-blown streams implementation, which he claimed (I never tried it) to be faster than C IO. Google found it at dietmar-kuehl.de/cxxrt. However, most of the source files are timestamped 2002, some 2003, so it really is outdated. – Jittery 2/3, 2011 at 14:16

@AProgrammer: That's not the message I was looking for, but it's pretty much the content I wanted. Thanks for posting it! – Jittery 2/3, 2011 at 14:19

@sbi: I didn't say workaround but "work" "around", which can be translated at "production" "somewhere", thanks for the link, I'll put it in my "things" to read :) – Yaker 2/3, 2011 at 17:59

@Matthieu: Ah, sorry for misunderstanding this. – Jittery 2/3, 2011 at 18:15

Here is what I have gathered so far:

Buffering:

If by default the buffer is very small, increasing the buffer size can definitely improve the performance:

it reduces the number of HDD hits
it reduces the number of system calls

Buffer can be set by accessing the underlying streambuf implementation.

char Buffer[N];

std::ifstream file("file.txt");

file.rdbuf()->pubsetbuf(Buffer, N);
// the pointer reader by rdbuf is guaranteed
// to be non-null after successful constructor

Warning courtesy of @iavr: according to cppreference it is best to call pubsetbuf before opening the file. Various standard library implementations otherwise have different behaviors.

Locale Handling:

Locale can perform character conversion, filtering, and more clever tricks where numbers or dates are involved. They go through a complex system of dynamic dispatch and virtual calls, so removing them can help trimming down the penalty hit.

The default C locale is meant not to perform any conversion as well as being uniform across machines. It's a good default to use.

Synchronization:

I could not see any performance improvement using this facility.

One can access a global setting (static member of std::ios_base) using the sync_with_stdio static function.

Measurements:

Playing with this, I have toyed with a simple program, compiled using gcc 3.4.2 on SUSE 10p3 with -O2.

C : 7.76532e+06
C++: 1.0874e+07

Which represents a slowdown of about 20%... for the default code. Indeed tampering with the buffer (in either C or C++) or the synchronization parameters (C++) did not yield any improvement.

Results by others:

@Irfy on g++ 4.7.2-2ubuntu1, -O3, virtualized Ubuntu 11.10, 3.5.0-25-generic, x86_64, enough ram/cpu, 196MB of several "find / >> largefile.txt" runs

C : 634572 C++: 473222

C++ 25% faster

@Matteo Italia on g++ 4.4.5, -O3, Ubuntu Linux 10.10 x86_64 with a random 180 MB file

C : 910390
C++: 776016

C++ 17% faster

@Bogatyr on g++ i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664), mac mini, 4GB ram, idle except for this test with a 168MB datafile

C : 4.34151e+06
C++: 9.14476e+06

C++ 111% slower

@Asu on clang++ 3.8.0-2ubuntu4, Kubuntu 16.04 Linux 4.8-rc3, 8GB ram, i5 Haswell, Crucial SSD, 88MB datafile (tar.xz archive)

C : 270895 C++: 162799

C++ 66% faster

So the answer is: it's a quality of implementation issue, and really depends on the platform :/

The code in full here for those interested in benchmarking:

#include <fstream>
#include <iostream>
#include <iomanip>

#include <cmath>
#include <cstdio>

#include <sys/time.h>

template <typename Func>
double benchmark(Func f, size_t iterations)
{
  f();

  timeval a, b;
  gettimeofday(&a, 0);
  for (; iterations --> 0;)
  {
    f();
  }
  gettimeofday(&b, 0);
  return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
         (a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}


struct CRead
{
  CRead(char const* filename): _filename(filename) {}

  void operator()() {
    FILE* file = fopen(_filename, "r");

    int count = 0;
    while ( fscanf(file,"%s", _buffer) == 1 ) { ++count; }

    fclose(file);
  }

  char const* _filename;
  char _buffer[1024];
};

struct CppRead
{
  CppRead(char const* filename): _filename(filename), _buffer() {}

  enum { BufferSize = 16184 };

  void operator()() {
    std::ifstream file(_filename, std::ifstream::in);

    // comment to remove extended buffer
    file.rdbuf()->pubsetbuf(_buffer, BufferSize);

    int count = 0;
    std::string s;
    while ( file >> s ) { ++count; }
  }

  char const* _filename;
  char _buffer[BufferSize];
};


int main(int argc, char* argv[])
{
  size_t iterations = 1;
  if (argc > 1) { iterations = atoi(argv[1]); }

  char const* oldLocale = setlocale(LC_ALL,"C");
  if (strcmp(oldLocale, "C") != 0) {
    std::cout << "Replaced old locale '" << oldLocale << "' by 'C'\n";
  }

  char const* filename = "largefile.txt";

  CRead cread(filename);
  CppRead cppread(filename);

  // comment to use the default setting
  bool oldSyncSetting = std::ios_base::sync_with_stdio(false);

  double ctime = benchmark(cread, iterations);
  double cpptime = benchmark(cppread, iterations);

  // comment if oldSyncSetting's declaration is commented
  std::ios_base::sync_with_stdio(oldSyncSetting);

  std::cout << "C  : " << ctime << "\n"
               "C++: " << cpptime << "\n";

  return 0;
}

Yaker answered 2/3, 2011 at 13:52 Comment(21)

Actually I found out that C++ is faster (g++ 4.4.5, -O3, Ubuntu Linux 10.10 x86_64): with a random 180 MB file I got C: 910390 C++: 776016. – Broadcaster 2/3, 2011 at 14:1

@Matteo: Ah that's great. I need to try with g++4.3.2 as well. – Yaker 2/3, 2011 at 14:8

The question that led to this one has nothing to do with preference, it has to do with concrete measurements of "typical" case input processing. Your benchmark is not really interesting, since it doens't meet a real world case. Instead, why don't you write a shell script that runs your program through 1 iteration on a set of large files, and measure the aggregate wallclock time. – Ezequiel 2/3, 2011 at 14:11

and 2nd, you need to break up the runs into: 1 run C case, 1 run C++ case, not putting them both together in the same executable. – Ezequiel 2/3, 2011 at 14:14

OK I ran your code as is, with the results (3 iterations): C : 4.34151e+06 C++: 9.14476e+06, g++ i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664), mac mini, 4GB ram, idle except for this test. My data file is 168MB – Ezequiel 2/3, 2011 at 14:17

@Ezequiel gettimeofday is, if anything more precise than time. Furthermore, this is a good approximation of a real-world case: reading data. After all, we don’t want to measure other things, only the reading of data. So this benchmark is good. And putting both codes in the same executable is perfectly fine, too. Just make sure that enough iterations of the benchmark are run to offset warming-up slowdowns (or run it once at the beginning, which Mathieu does). This benchmark is much superior to your suggested “improvements”. – Polity 2/3, 2011 at 14:21

@Konrad At one iteration, it's OK, if the file is of a certain size. And my interest in this subject comes from a case were my "improvements" were the scenario -- algorithm competitions, where you have a very limited time to read in different, large-ish data sets, not the same data set over and over again. The fact is, on that site at least on that day, "cin >> s" lost severely to "scanf". On my mac mini with the stated g++, scanf wins big, too. However, on my ubuntu linux vmware on a windows 7 laptop with 4.4.1 "cin >> s" beats "scanf". So go figure, I'll agree it "does depend." – Ezequiel 2/3, 2011 at 15:12

@Bogatyr: I suspect that the difference is due to the improvements in g++; I don't see changes in the iostream implementation between g++ 4.2 and 4.4, but I notice that they improved many things in the optimizer, especially regarding inlining; with all the layers involved in iostream I think that changes to the inlining algorithms can really make a significant difference. – Broadcaster 3/3, 2011 at 14:25

I just tested on 3 linux machines, compiled with g++ from 4.5.4 to 4.7.2, differences from 25% faster C++ to 40% faster C++. – Landmark 19/3, 2013 at 21:33

The program always runs cread before cppread, and they read the same file. Then the second one will benefit from the disk cache populated by the first one. – Discursive 16/10, 2013 at 6:29

@musiphil: note how benchmark is implemented, there is a first (not timed) dry run to warm up the cache, and only then, are there N runs (timed). – Yaker 16/10, 2013 at 7:7

@musiphil: no complaint from me, it's so easy to have a meaningless benchmark program (because of optimization, cache warmup, ...) that I am grateful for additional pair of eyes scrutinizing this code. – Yaker 16/10, 2013 at 14:59

@Matthieu Nice work. I was just experimenting with reading a large binary file, and looking for ways to control the buffer. I realized using strace that file.rdbuf()->pubsetbuf() was ignored in my case. Then I saw here that it should be called before opening the file, which you don't do in your benchmark. – Winona 22/5, 2015 at 23:21

@iavr: Interesting, it looks like a limitation of libstdc++. I am mildly annoyed by this, as RAII is all about opening first... Guess once wrapped properly it'll work better. – Yaker 23/5, 2015 at 12:27

but cppreference says file.rdbuf()->pubsetbuf(Buffer, N); in base class does nothing - en.cppreference.com/w/cpp/io/basic_streambuf/pubsetbuf – Portage 30/8, 2016 at 12:14

@hg_git: Specifically, cppreference mentions that the implementation of std::basic_streambuf::pubsetbuf does nothing, however pubsetbuf is a virtual method and is there specifically so that derived classes can (if they so wish) make it do something useful. It turns out that ifstream will yield a derived version of basic_streambuf which overrides pubsetbuf. – Yaker 30/8, 2016 at 13:58

@MatthieuM. Thankyou :) where can I find about basic_streambuf overriding pubsetbuf? – Portage 31/8, 2016 at 16:53

~100 MB file, clang version 3.8.0-2ubuntu4 compiling with -Os : C : 278425, C++: 159543 - 75% improvement! Getting slightly worse results on gcc, speeding up a bit C and slowering up a bit C++, but by a small margin. – Huambo 4/10, 2016 at 9:55

@Asu: gcc and clang use different C++ standard libraries by default (libstdc++ and libc++ respectively) so this might be the cause of the difference you are observing. Thanks for this datapoint :) – Yaker 4/10, 2016 at 10:19

@MatthieuM. - good point - I tried compiling with clang + libstdc++ and got C : 273557 - C++: 159604 Which is actually surprisingly even better C++ side. g++ : C : 267510 - C++: 172379 Nice to see how clang evolves. – Huambo 4/10, 2016 at 17:1

I actually removed the stdio sync and the buffering and didn't encounter significant performance impact. – Huambo 4/10, 2016 at 17:14

Two more improvements:

Issue `std::cin.tie(nullptr);` before heavy input/output.

Quoting http://en.cppreference.com/w/cpp/io/cin:

Once std::cin is constructed, std::cin.tie() returns &std::cout, and likewise, std::wcin.tie() returns &std::wcout. This means that any formatted input operation on std::cin forces a call to std::cout.flush() if any characters are pending for output.

You can avoid flushing the buffer by untying std::cin from std::cout. This is relevant with multiple mixed calls to std::cin and std::cout. Note that calling std::cin.tie(std::nullptr); makes the program unsuitable to run interactively by user, since output may be delayed.

Relevant benchmark:

File test1.cpp:

#include <iostream>
using namespace std;

int main()
{
  ios_base::sync_with_stdio(false);

  int i;
  while(cin >> i)
    cout << i << '\n';
}

File test2.cpp:

#include <iostream>
using namespace std;

int main()
{
  ios_base::sync_with_stdio(false);
  cin.tie(nullptr);

  int i;
  while(cin >> i)
    cout << i << '\n';

  cout.flush();
}

Both compiled by g++ -O2 -std=c++11. Compiler version: g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4 (yeah, I know, pretty old).

Benchmark results:

work@mg-K54C ~ $ time ./test1 < test.in > test1.in

real    0m3.140s
user    0m0.581s
sys 0m2.560s
work@mg-K54C ~ $ time ./test2 < test.in > test2.in

real    0m0.234s
user    0m0.234s
sys 0m0.000s

(test.in consists of 1179648 lines each consisting only of a single 5. It’s 2.4 MB, so sorry for not posting it here.).

I remember solving an algorithmic task where the online judge kept refusing my program without cin.tie(nullptr) but was accepting it with cin.tie(nullptr) or printf/scanf instead of cin/cout.

Use `'\n'` instead of `std::endl`.

Quoting http://en.cppreference.com/w/cpp/io/manip/endl :

Inserts a newline character into the output sequence os and flushes it as if by calling os.put(os.widen('\n')) followed by os.flush().

You can avoid flushing the bufer by printing '\n' instead of endl.

Relevant benchmark:

File test1.cpp:

#include <iostream>
using namespace std;

int main()
{
  ios_base::sync_with_stdio(false);

  for(int i = 0; i < 1179648; ++i)
    cout << i << endl;
}

File test2.cpp:

#include <iostream>
using namespace std;

int main()
{
  ios_base::sync_with_stdio(false);

  for(int i = 0; i < 1179648; ++i)
    cout << i << '\n';
}

Both compiled as above.

Benchmark results:

work@mg-K54C ~ $ time ./test1 > test1.in

real    0m2.946s
user    0m0.404s
sys 0m2.543s
work@mg-K54C ~ $ time ./test2 > test2.in

real    0m0.156s
user    0m0.135s
sys 0m0.020s

Tekla answered 11/2, 2016 at 13:29 Comment(1)

Ah yes, the endl situation is usually well known by afficionados but so many tutorials use it by default (why????) that it trips beginners/medium level programmers regularly. As for tie: I am learning something today! I knew prompting the user would force a flush, but didn't know how it was controlled. – Yaker 11/2, 2016 at 13:31

Interesting you say C programmers prefer printf when writing C++ as I see a lot of code that is C other than using cout and iostream to write the output.

Uses can often get better performance by using filebuf directly (Scott Meyers mentioned this in Effective STL) but there is relatively little documentation in using filebuf direct and most developers prefer std::getline which is simpler most of the time.

With regards to locale, if you create facets you will often get better performance by creating a locale once with all your facets, keeping it stored, and imbuing it into each stream you use.

I did see another topic on this here recently, so this is close to being a duplicate.

Chelate answered 2/3, 2011 at 11:29 Comment(2)

If you get better performance by using a file buffer directly, then that means it's the parsing code (for reading, anyway) that's the performance hog, since this is what std::istream wraps the buffer with. Unfortunately, widespread IO stream implementations use printf()/scanf() under the hood, which certainly must be slower than using C std lib IO directly. (Also see my comment to @Konrad on the question.) – Jittery 2/3, 2011 at 14:8

"code that is C other than using cout and iostream" - we call it "C with iostreams" and it is what passes for C++ in many university courses. – Correia 22/10, 2011 at 23:52

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Issue std::cin.tie(nullptr); before heavy input/output.

Use '\n' instead of std::endl.

Recommended topics

Hot tags

Issue `std::cin.tie(nullptr);` before heavy input/output.

Use `'\n'` instead of `std::endl`.