Massive fprintf speed difference without "-std=c99"

Asked 20/12, 2012 at 10:59 Answered 20/12, 2012 at 14:7

Solved c performance locking stdio mingw32

I had been struggling for weeks with a poor-performing translator I had written. On the following simple bechmark

#include<stdio.h>

int main()
{
    int x;
    char buf[2048];
    FILE *test = fopen("test.out", "wb");
    setvbuf(test, buf, _IOFBF, sizeof buf);
    for(x=0;x<1024*1024; x++)
        fprintf(test, "%04d", x);
    fclose(test);
    return 0
}

we see the following result

bash-3.1$ gcc -O2 -static test.c -o test
bash-3.1$ time ./test

real    0m0.334s
user    0m0.015s
sys     0m0.016s

As you can see, the moment the "-std=c99" flag is added in, performance comes crashing down:

bash-3.1$ gcc -O2 -static -std=c99 test.c -o test
bash-3.1$ time ./test

real    0m2.477s
user    0m0.015s
sys     0m0.000s

The compiler I'm using is gcc 4.6.2 mingw32.

The file generated is about 12M, so this is a difference between of about 21MB/s between the two.

Running diff shows the the generated files are identical.

I assumed this has something to do with file locking in fprintf, of which the program makes heavy use, but I haven't been able to find a way to switch that off in the C99 version.

I tried flockfile on the stream I use at the beginning of the program, and an corresponding funlockfile at the end, but was greeted with compiler errors about implicit declarations, and linker errors claiming undefined references to those functions.

Could there be another explanation for this problem, and more importantly, is there any way to use C99 on windows without paying such an enormous performance price?

Edit:

After looking at the code generated by these options, it looks like in the slow versions, mingw sticks in the following:

_fprintf:
LFB0:
    .cfi_startproc
    subl    $28, %esp
    .cfi_def_cfa_offset 32
    leal    40(%esp), %eax
    movl    %eax, 8(%esp)
    movl    36(%esp), %eax
    movl    %eax, 4(%esp)
    movl    32(%esp), %eax
    movl    %eax, (%esp)
    call    ___mingw_vfprintf
    addl    $28, %esp
    .cfi_def_cfa_offset 4
    ret
    .cfi_endproc

In the fast version, this simply does not exist; otherwise, both are exactly the same. I assume __mingw_vfprintf seems to be the slowpoke here, but I have no idea what behavior it needs to emulate that makes it so slow.

Dora answered 20/12, 2012 at 10:59 Comment(8)

You need to isolate the problem down to a complete working example of code you can actually post, having people speculate about what you might have done is not very productive. – Discomfiture 20/12, 2012 at 11:2

Comment out the printf and retest to see if that's actually the source of the difference. – Gothar 20/12, 2012 at 11:5

@Discomfiture I apologize; I was so convinced the problem was in printf I broke the cardinal rule of this site. It didn't even occur to me the problem would be elsewhere. – Dora 20/12, 2012 at 11:17

And the assembler for the fast version? It would be useful for comparison. – Mullinax 20/12, 2012 at 11:56

Ok, so _fprintf does not exist in the fast version. But then, what does the main loop look like? – Mullinax 20/12, 2012 at 12:17

@Mullinax The fast one doesn't redefine fprintf the main loops are identical. I'm trying to look at the mingw source now to see if there's anything that seems bottleneck-ey. – Dora 20/12, 2012 at 12:19

_fprintf is probably the name of the fprintf() function from Microsoft C Runtime, which is not C99 compliant, so it's overloaded by mingw32's one – Zeal 20/12, 2012 at 13:35

see sourceforge.net/project/shownotes.php?release_id=24832 – Zeal 20/12, 2012 at 13:41

After some digging in the source code, I have found why the MinGW function is so terribly slow:

At the beginning of a [v,f,s]printf in MinGW, there is some innocent-looking initialization code:

__pformat_t stream = {
    dest,                   /* output goes to here        */
    flags &= PFORMAT_TO_FILE | PFORMAT_NOLIMIT, /* only these valid initially */
    PFORMAT_IGNORE,             /* no field width yet         */
    PFORMAT_IGNORE,             /* nor any precision spec     */
    PFORMAT_RPINIT,             /* radix point uninitialised  */
    (wchar_t)(0),               /* leave it unspecified       */
    0,                          /* zero output char count     */
    max,                        /* establish output limit     */
    PFORMAT_MINEXP          /* exponent chars preferred   */
};

However, PFORMAT_MINEXP is not what it appears to be:

#ifdef _WIN32
# define PFORMAT_MINEXP    __pformat_exponent_digits() 
# ifndef _TWO_DIGIT_EXPONENT
#  define _get_output_format()  0 
#  define _TWO_DIGIT_EXPONENT   1
# endif
static __inline__ __attribute__((__always_inline__))
int __pformat_exponent_digits( void )
{
  char *exponent_digits = getenv( "PRINTF_EXPONENT_DIGITS" );
  return ((exponent_digits != NULL) && ((unsigned)(*exponent_digits - '0') < 3))
    || (_get_output_format() & _TWO_DIGIT_EXPONENT)
    ? 2
    : 3
    ;
}

This winds up getting called every time I want to print, and getenv on windows must not be very quick. Replacing that define with a 2 brings the runtime back to where it should be.

So, the answer comes down to this: when using -std=c99 or any ANSI-compliant mode, MinGW switches the CRT runtime with its own. Normally, this wouldn't be an issue, but the MinGW lib had a bug which slowed its formatting functions down far beyond anything imaginable.

Dora answered 20/12, 2012 at 14:7 Comment(7)

You might to report a bug to mingw32 developers. – Zeal 20/12, 2012 at 14:10

I intend to. I'm putting together a patch, in the hopes that they will be more receptive if I have a solution already ready. – Dora 20/12, 2012 at 14:11

This behavor was documented in the changelog I linked in my other answer – Zeal 20/12, 2012 at 14:12

the getenv() is probably fine if done once at the runtime ininitialisation – Zeal 20/12, 2012 at 14:13

But it's not. It's done every single time a format function is called. It should be a hook in the pre-main libc init goo. – Dora 20/12, 2012 at 14:15

that was an advice for your patch – Zeal 20/12, 2012 at 14:18

I'm wondering if you could have found this problem quicker by this method. I tried it but was unable to link test because of missing a definition for __mingw_vfprintf, when using c99. – Cellule 20/12, 2012 at 16:23

Using -std=c99 disable all GNU extensions.

With GNU extensions and optimization, your fprintf(test, "B") is probably replaced by a fputc('B', test)

Note this answer is obsolete, see https://mcmap.net/q/276339/-massive-fprintf-speed-difference-without-quot-std-c99-quot and https://mcmap.net/q/276339/-massive-fprintf-speed-difference-without-quot-std-c99-quot

Zeal answered 20/12, 2012 at 11:28 Comment(7)

So a `-std=gnu99' would be a fair comparison, wouldn't it? – Mullinax 20/12, 2012 at 11:31

The plot thickens: gnu99 is fast, but fprintf(test, "%04d\n", x) is still fast on gnu99 and slow in c99. – Dora 20/12, 2012 at 11:34

@Dora - Then disassemble and look at the generated code. The difference should be obvious there. – Mullinax 20/12, 2012 at 11:36

@Dora @Mullinax no need to dissamble, just run gcc -std=c99 -O2 -S file.c, it will produce the file.s assembly file. You could also use gcc -v -std=c99 -O2 file.c -o file to check linker parameters – Zeal 20/12, 2012 at 11:38

all 3 gave the same timings and the same assembly for me (gcc 4.4.3) – Forras 20/12, 2012 at 11:41

@ydroneaud - Yeah, but I find objdump -S easier to read (because the asm is intermixed with the source). Anyway, I've done the test c99 vs. gnu99 vs nothing, with "%04d", GCC 4.7.2 and there is no measurable performance difference. – Mullinax 20/12, 2012 at 11:41

@Mullinax @peter-schneider to specifying -std= is equivalent to -std=gnu99 – Zeal 20/12, 2012 at 11:45

After some consideration of your assembler, it looks like the slow version is using the *printf() implementation of MinGW, based undoubtedly in the GCC one, while the fast version is using the Microsoft implementation from msvcrt.dll.

Now, the MS one is notably for lacking a lot of features, that the GCC one does implement. Some of these are GNU extensions but some others are for C99 conformance. And since you are using -std=c99 you are requesting the conformance.

But why so slow? Well, one factor is simplicity, the MS version is far simpler so it is expected that it will run faster, even in the trivial cases. Other factor is that you are running under Windows, so it is expected that the MS version be more efficient that one copied from the Unix world.

Does it explain a factor of x10? Probably not...

Another thing you can try:

Replace fprintf() with sprintf(), printing into a memory buffer without touching the file at all. Then you can try doing fwrite() without printfing. That way you can guess if the loss is in the formatting of the data or in the writing to the FILE.

Mullinax answered 20/12, 2012 at 13:31 Comment(0)

Since MinGW32 3.15, compliant printf functions are available to use instead of those found in Microsoft C runtime (CRT). The new printf functions are used when compiling in strict ANSI, POSIX and/or C99 modes.

For more information see the mingw32 changelog

You can use __msvcrt_fprintf() to use the fast (non compliant) function.

Zeal answered 20/12, 2012 at 13:46 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags