Buffering of standard I/O library
Asked Answered
W

3

7

In the book Advanced Programming in the UNIX Environments (2nd edition), the author wrote in Section 5.5 (stream operations of the standard I/O library) that:

When a file is opened for reading and writing (the plus sign in the type), the following restrictions apply.

  • Output cannot be directly followed by input without an intervening fflush, fseek, fsetpos, or rewind.
  • Input cannot be directly followed by output without an intervening fseek, fsetpos, or rewind, or an input operation that encounters an end of file.

I got confused about this. Could anyone explain a little about this? For example, in what situation the input and output function calls violating the above restrictions will cause unexpected behavior of the program? I guess the reason for the restrictions may be related to the buffering in the library, but I'm not so clear.

Woodenware answered 16/1, 2013 at 8:44 Comment(0)
G
3

It's not clear what you're asking.

Your basic question is "Why does the book say I can't do this?" Well, the book says you can't do it because the POSIX/SUS/etc. standard says it's undefined behavior in the fopen specification, which it does to align with the ISO C standard (N1124 working draft, because the final version is not free), 7.19.5.3.

Then you ask, "in what situation the input and output function calls violating the above restrictions will cause unexpected behavior of the program?"

Undefined behavior will always cause unexpected behavior, because the whole point is that you're not allowed to expect anything. (See 3.4.3 and 4 in the C standard linked above.)

But on top of that, it's not even clear what they could have specified that would make any sense. Look at this:

int main(int argc, char *argv[]) {
  FILE *fp = fopen("foo", "r+");
  fseek(fp, 0, SEEK_SET);
  fwrite("foo", 1, 3, fp);
  fseek(fp, 0, SEEK_SET);
  fwrite("bar", 1, 3, fp);
  char buf[4] = { 0 };
  size_t ret = fread(buf, 1, 3, fp);
  printf("%d %s\n", (int)ret, buf);
}

So, should this print out 3 foo because that's what's on disk, or 3 bar because that's what's in the "conceptual file", or 0 because there's nothing after what's been written so you're reading at EOF? And if you think there's an obvious answer, consider the fact that it's possible that bar has been flushed already—or even that it's been partially flushed, so the disk file now contains boo.

If you're asking the more practical question "Can I get away with it in some circumstances?", well, I believe on most Unix platforms, the above code will give you an occasional segfault, but 3 xyz (either 3 uninitialized characters, or in more complicated cases 3 characters that happened to be in the buffer before it got overwritten) the rest of the time. So, no, you can't get away with it.

Finally, you say, "I guess the reason for the restrictions may be related to the buffering in the library, but I'm not so clear." This sounds like you're asking about the rationale.

You're right that it's about buffering. As I pointed out above, there really is no intuitive right thing to do here—but also, think about the implementation. Remember that the Unix way has always been "if the simplest and most efficient code is good enough, do that".

There are three ways you could implement something like stdio:

  1. Use a shared buffer for read and write, and write code to switch contexts as needed. This is going to be a bit complicated, and will flush buffers more often than you'd ideally like.
  2. Use two separate buffers, and cache-style code to determine when one operation needs to copy from and/or invalidate the other buffer. This is even more complicated, and makes a FILE object take twice as much memory.
  3. Use a shared buffer, and just don't allow interleaving reads and writes without explicit flushes in between. This is dead-simple, and as efficient as possible.
  4. Use a shared buffer, and implicitly flush between interleaved reads and writes. This is almost as simple, and almost as efficient, and a lot safer, but not really any better in any way other than safety.

So, Unix went with #3, and documented it, and SUS, POSIX, C89, etc. standardized that behavior.

You might say, "Come on, it can't be that inefficient." Well, you have to remember that Unix was designed for low-end 1970s systems, and the basic philosophy that it's not worth trading off even a little efficiency unless there's some actual benefit. But, most importantly, consider that stdio has to handle trivial functions like getc and putc, not just fancy stuff like fscanf and fprintf, and adding anything to those functions (or macros) that makes them 5x as slow would make a huge difference in a lot of real-world code.

If you look at modern implementations from, e.g., *BSD, glibc, Darwin, MSVCRT, etc. (most of which are open source, or at least commercial-but-shared-source), most of them do things the same way. A few add safety checks, but they generally give you an error for interleaving rather than implicitly flushing—after all, if your code is wrong, it's better to tell you that your code is wrong than to try to DWIM.

For example, look at early Darwin (OS X) fopen, fread, and fwrite (chosen because it's nice and simple, and has easily-linkable code that's syntax-colored but also copy-pastable). All that fread has to do is copy bytes out of the buffer, and refill the buffer if it runs out. You can't get any simpler than that.

Galang answered 30/1, 2013 at 2:13 Comment(0)
B
4

You aren't allowed to intersperse input and output operations. For example, you can't use formatted input to seek to a particular point in the file, then start writing bytes starting at that point. This allows the implementation to assume that at any time, the sole I/O buffer will only contain either data to be read (to you) or written (to the OS), without doing any safety checks.

f = fopen( "myfile", "rw" ); /* open for read and write */
fscanf( f, "hello, world\n" ); /* scan past file header */
fprintf( f, "daturghhhf\n" ); /* write some data - illegal */

This is OK, though, if you do an fseek( f, 0, SEEK_CUR ); between the fscanf and the fprintf because that changes the mode of the I/O buffer without repositioning it.

Why is it done this way? As far as I can tell, because OS vendors often want to support automatic mode switching, but fail. The stdio spec allows a buggy implementation to be compliant, and a working implementation of automatic mode switching simply implements a compatible extension.

Branle answered 16/1, 2013 at 8:54 Comment(10)
Thanks for your answer first. What do you mean by "changes the mode of the I/O buffer without repositioning it"? For example, I call fread to read 5 bytes from a file, but the underlying read system call reads 10 bytes actually, where 5 bytes are given to my application and another 5 bytes are buffered in stdio. Then the offset in FILE and the OS file table are different. If I call fseek before fwrite, what will happen to the bytes still in the buffer and the two offsets? I can't find too much details in the manpage of fseek.Woodenware
@PJ.Hades The unused bytes in the buffer get discarded. The library is responsible for seeking the OS file position back to the position visible to you, the user, such that any data is flushed to the correct location in the file. (The underlying POSIX file descriptors are completely encapsulated; the C library doesn't specify what happens to them and I wouldn't expect POSIX to, either. Implementations need the flexibility.)Branle
Well it seems the source of all of this is that there is a single buffer used for both reading and writing operations. And there is no explicit "R\W" state for the buffer is your point @Potatoswatter?Property
@Property If there's an R/W state, then it becomes the only state ;v) . I modified the buffering in the GNU C++ library to avoid statefulness, and now it works without intervening seeks despite C++ inheriting these restrictions from C.Branle
@Branle So, you mean: 1. the library does not distinguish between the actual use of the buffer (raed/write); 2. what the functions fseek, rewind, fsetpos do is to "clear the state" of the buffer, making the upper and lower level states of the data structures (FILE and OS file table) consistent to each other (like the offsets); 3. the detail of how to "clear the state" of the buffer (like how to deal with the 5 unused bytes in my example) is implementation dependent. Do I get your point right?Woodenware
1. Yes, 2. Yes clear the state but no this doesn't imply anything about the OS level, 3. As long as things get flushed properly, yes.Branle
More seriously, the rationale is mostly wrong here. Most OS vendors don't want to support automatic switching but get it wrong. The spec allows "do the simplest thing" because most OS vendors were already dong the simplest thing, and still are today, because that's the Unix/C way.Galang
It isn't that "vendors fail to implement switching", it is that it is (was?) prohitively costly. Remember that getc() putc() were/are often implemented as macros expanding to not much more than simple *(ptr++) and *(ptr++) = value. Adding some futzing around with buffer state and such would have made many performance critical loops take twice the time or more. Just not acceptable.Lafond
@vonbrand: Well, they need to do a bit more than that; they have to refill/flush if the buffer under/overflows. But yes, adding the state switching would add a lot more than that simple if (f->ptr < f->end_ptr) check.Galang
@Lafond Interesting. Macros introduce a significant size/speed tradeoff so that's definitely biased toward a focus on a few inner loops, which is very old-school UNIX. (Though it's still relevant for any highly optimized parser.) However, if flushing the just-read part of the buffer is defined to be OK, the overhead to the macros is only that putc set a "dirty" bit so the next buffer reload is preceded by a flush.Branle
G
3

It's not clear what you're asking.

Your basic question is "Why does the book say I can't do this?" Well, the book says you can't do it because the POSIX/SUS/etc. standard says it's undefined behavior in the fopen specification, which it does to align with the ISO C standard (N1124 working draft, because the final version is not free), 7.19.5.3.

Then you ask, "in what situation the input and output function calls violating the above restrictions will cause unexpected behavior of the program?"

Undefined behavior will always cause unexpected behavior, because the whole point is that you're not allowed to expect anything. (See 3.4.3 and 4 in the C standard linked above.)

But on top of that, it's not even clear what they could have specified that would make any sense. Look at this:

int main(int argc, char *argv[]) {
  FILE *fp = fopen("foo", "r+");
  fseek(fp, 0, SEEK_SET);
  fwrite("foo", 1, 3, fp);
  fseek(fp, 0, SEEK_SET);
  fwrite("bar", 1, 3, fp);
  char buf[4] = { 0 };
  size_t ret = fread(buf, 1, 3, fp);
  printf("%d %s\n", (int)ret, buf);
}

So, should this print out 3 foo because that's what's on disk, or 3 bar because that's what's in the "conceptual file", or 0 because there's nothing after what's been written so you're reading at EOF? And if you think there's an obvious answer, consider the fact that it's possible that bar has been flushed already—or even that it's been partially flushed, so the disk file now contains boo.

If you're asking the more practical question "Can I get away with it in some circumstances?", well, I believe on most Unix platforms, the above code will give you an occasional segfault, but 3 xyz (either 3 uninitialized characters, or in more complicated cases 3 characters that happened to be in the buffer before it got overwritten) the rest of the time. So, no, you can't get away with it.

Finally, you say, "I guess the reason for the restrictions may be related to the buffering in the library, but I'm not so clear." This sounds like you're asking about the rationale.

You're right that it's about buffering. As I pointed out above, there really is no intuitive right thing to do here—but also, think about the implementation. Remember that the Unix way has always been "if the simplest and most efficient code is good enough, do that".

There are three ways you could implement something like stdio:

  1. Use a shared buffer for read and write, and write code to switch contexts as needed. This is going to be a bit complicated, and will flush buffers more often than you'd ideally like.
  2. Use two separate buffers, and cache-style code to determine when one operation needs to copy from and/or invalidate the other buffer. This is even more complicated, and makes a FILE object take twice as much memory.
  3. Use a shared buffer, and just don't allow interleaving reads and writes without explicit flushes in between. This is dead-simple, and as efficient as possible.
  4. Use a shared buffer, and implicitly flush between interleaved reads and writes. This is almost as simple, and almost as efficient, and a lot safer, but not really any better in any way other than safety.

So, Unix went with #3, and documented it, and SUS, POSIX, C89, etc. standardized that behavior.

You might say, "Come on, it can't be that inefficient." Well, you have to remember that Unix was designed for low-end 1970s systems, and the basic philosophy that it's not worth trading off even a little efficiency unless there's some actual benefit. But, most importantly, consider that stdio has to handle trivial functions like getc and putc, not just fancy stuff like fscanf and fprintf, and adding anything to those functions (or macros) that makes them 5x as slow would make a huge difference in a lot of real-world code.

If you look at modern implementations from, e.g., *BSD, glibc, Darwin, MSVCRT, etc. (most of which are open source, or at least commercial-but-shared-source), most of them do things the same way. A few add safety checks, but they generally give you an error for interleaving rather than implicitly flushing—after all, if your code is wrong, it's better to tell you that your code is wrong than to try to DWIM.

For example, look at early Darwin (OS X) fopen, fread, and fwrite (chosen because it's nice and simple, and has easily-linkable code that's syntax-colored but also copy-pastable). All that fread has to do is copy bytes out of the buffer, and refill the buffer if it runs out. You can't get any simpler than that.

Galang answered 30/1, 2013 at 2:13 Comment(0)
A
0

reason 1

find the real file position to start.

due to the buffer implementation of the stdio, the stdio stream position may differ from the OS file position. when you read 1 byte, stdio mark the file position to 1. Due to the buffering, stdio may read 4096 bytes from the underlying file, where OS would record its file position at 4096. When you switch to output, you really need to choose which position you want to use.


reason 2

find the right buffer cursor to start.

tl;dr,

if an underlying implementation only uses a single shared buffer for both read and write, you have to flush the buffer when changing IO direction.

Take this glibc used in chromium os to demo how fwrite, fseek, and fflush handle the single shared buffer.

fwrite fill buffer impl:

    fill_buffer:
      while (to_write > 0)
    {
      register size_t n = to_write;
      if (n > buffer_space)
        n = buffer_space;
      buffer_space -= n;
      written += n;
      to_write -= n;
      if (n < 20)
        while (n-- > 0)
          *stream->__bufp++ = *p++;
      else
        {
          memcpy ((void *) stream->__bufp, (void *) p, n);
          stream->__bufp += n;
          p += n;
        }
      if (to_write == 0)
        /* Done writing.  */
        break;
      else if (buffer_space == 0)
        {
          /* We have filled the buffer, so flush it.  */
          if (fflush (stream) == EOF)
        break;

from this code snippet, we can see, if buffer is full, it will flush it.

Let's take a look at fflush

int
fflush (stream)
     register FILE *stream;
{
  if (stream == NULL) {...}
  if (!__validfp (stream) || !stream->__mode.__write)
    {
      __set_errno (EINVAL);
      return EOF;
    }
  return __flshfp (stream, EOF);
}

it uses __flshfp

/* Flush the buffer for FP and also write C if FLUSH_ONLY is nonzero.
   This is the function used by putc and fflush.  */
int
__flshfp (fp, c)
     register FILE *fp;
     int c;
{
  /* Make room in the buffer.  */
  (*fp->__room_funcs.__output) (fp, flush_only ? EOF : (unsigned char) c);
}

the __room_funcs.__output by default is using flushbuf

        /* Write out the buffered data.  */
        wrote = (*fp->__io_funcs.__write) (fp->__cookie, fp->__buffer,
                                           to_write);

Now we are close. What's __write? Trace the default settings aforementioned, it's __stdio_write

int
__stdio_write (cookie, buf, n)
     void *cookie;
     register const char *buf;
     register size_t n;
{
  const int fd = (int) cookie;
  register size_t written = 0;
  while (n > 0)
    {
      int count = __write (fd, buf, (int) n);
      if (count > 0)
        {
          buf += count;
          written += count;
          n -= count;
        }
      else if (count < 0
#if        defined (EINTR) && defined (EINTR_REPEAT)
               && errno != EINTR
#endif
               )
        /* Write error.  */
        return -1;
    }
  return (int) written;
}

__write is the system call to write(3).

As we can see, the fwrite is only using only one single buffer. If you change direction, it can still store the previous write contents. From the above example, you can call fflush to empty the buffer.

The same applies to fseek


/* Move the file position of STREAM to OFFSET
   bytes from the beginning of the file if WHENCE
   is SEEK_SET, the end of the file is it is SEEK_END,
   or the current position if it is SEEK_CUR.  */
int
fseek (stream, offset, whence)
     register FILE *stream;
     long int offset;
     int whence;
{
  ...
  if (stream->__mode.__write && __flshfp (stream, EOF) == EOF)
    return EOF;
  ...
  /* O is now an absolute position, the new target.  */
  stream->__target = o;
  /* Set bufp and both end pointers to the beginning of the buffer.
     The next i/o will force a call to the input/output room function.  */
  stream->__bufp
    = stream->__get_limit = stream->__put_limit = stream->__buffer;
  ...
}

it will soft flush (reset) the buffer at the end, which means read buffer will be emptied after this call.

This obeys the C99 rationale:

A change of input/output direction on an update file is only allowed following a successful fsetpos, fseek, rewind, or fflush operation, since these are precisely the functions which assure that the I/O buffer has been flushed.

Ameline answered 8/3, 2022 at 4:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.