Understanding undefined behavior for a binary stream using fseek(file, 0, SEEK_END) with a file
Asked Answered
S

1

18

The C spec has an interesting footnote (#268 C11dr §7.21.3 9)

"Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state."

Does this ever apply to binary streams reading a file? (as from a physical device)

IMO, a binary file on a disk is just a sea of bytes. It seems to me that a binary file could not have state-dependent encoding as it is a binary file. I'm fuzzy on the concept of "binary wide-oriented streams" and if that even could apply to disk I/O.

I see that calling fseek(file, 0, SEEK_END) on a serial stream like a com port or maybe stdin may not get to the true end as the end is yet to be determined. Thus the narrowing of the question to physical files.


[edit] Answer: A concern with older (maybe up to late 1980s). Presently in 2014, Windows, POSIT-specific and non-exotic others: not a problem.

@Shafik Yaghmour provides a good reference in Using fseek and ftell to determine the size of a file has a vulnerability?. There @Jerry Coffin discusses CP/M as binary files not always having a precise length. (128-byte records per wiki).

Thanks to @Keith Thompson answer for the meat of the answer.

Together this explains the specs's "(because of possible trailing null characters)" comment.

Sulfur answered 10/1, 2014 at 17:27 Comment(1)
G
13

Binary files are going to be sequences of 8-bit bytes, with an exact specified size, on any system you're likely to use. But not all systems store files that way, and the C standard is carefully designed to allow portability to systems with unusual characteristics.

For example, a conforming C implementation might run on an operating system that stores files as sequences of 512-byte blocks, with no indication of how many bytes of the final block are significant. On such a system, when a binary file is created, the OS might pad the remainder of the final block with zero bytes. When you read from such a file, the padding bytes might either appear in the input (even though they were never explicitly written to the file), or they might be ignored (even though the program that created the file might have written them explicitly).

If you're reading from a non-seekable stream (for example keyboard input), then fseek(file, 0, SEEK_END) won't just give you a bad result, it will indicate failure by returning a non-zero result. (On POSIX-compliant systems, it returns -1 and sets errno; ISO C doesn't require that.)

On most systems, fseek(file, 0, SEEK_END) on a binary file will either seek to the actual end of the file (a position determined by exactly how many bytes were written to the file), or return a clear failure indication. If you're using POSIX-specific features anyway, you can safely assume this behavior; you can probably make the same assumption for Windows and a number of other systems. If you want your code to be 100% portable to exotic systems, you shouldn't assume that binary files won't be padded with extra zero bytes.

Grainfield answered 10/1, 2014 at 17:45 Comment(12)
Aside from maybe CP/M, do you know any current file system that does not "determined by exactly how many bytes were written"?Sulfur
@chux: No, but I'm not familiar with all current file systems. (There might be something for embedded systems.)Grainfield
"going to be ... an exact specified size, on any system you're likely to use", well I've used a lot of file systems, even a CP/M like OS called CBM DOS which I believe also lacked byte specific file size. Doubt I'll write C code for that platform any time soon though. I though the odd C spec would be about something new, instead its about something old. Thanks.Sulfur
Re: "code to be 100% portable": usually GetFilesize (or similar name) is implemented using fseek(file, 0, SEEK_END) coupled with ftell. Then how to implement a 100% portable GetFilesize? Is it possible at all using Standard C?Arrive
@Arrive No, it's not possible to implement a 100% portable GetFileSize function that tells you how many bytes were written to a binary file. It is possible to write such a function that will work correctly on almost every implementation, and very likely on all hosted implementations. (Note that Windows has 32-bit long, so the fseek/ftell method will fail on files larger than 2 GiB.)Grainfield
@KeithThompson Re: "possible to write such a function": then what the implementation would be? Perhaps, you can point on the existing one (GitHub, etc.)? Re: "2 Gib": exactly.Arrive
@chux-ReinstateMonica Perhaps, you know / have an idea for such an implementation of GetFileSize that will work correctly on almost every implementation. Note that I'm on C (C11+), not C++. Preferably UB free. All the impl. that I'm aware of rely on fseek(file, 0, SEEK_END) coupled with ftell, where fseek may trigger UB.Arrive
@Arrive 1) "will work correctly on almost every implementation" ---> (re)open file in binary mode, fread() characters, 1 buffer at a time until end-of-file returning the count as a long long. Certainly not speedy, yet highly portable. This is a case were we should set aside the single portable solution goal and use code that makes sense per implementation. 2) alternative: Re-vamp coding goal to not need the file size - this is often the better approached.Sulfur
@chux-ReinstateMonica In a conforming implementation that keeps track of the size of a file only in, say, 512-byte blocks, you could create a binary file, write 42 bytes to it, close it, re-open it, and read 512 bytes from it before seeing EOF. The last 470 bytes would all read as zeros. The standard deliberately permits such implementations.Grainfield
N1570 draft 7.21.2p3: "A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream shall compare equal to the data that were earlier written out to that stream, under the same implementation. Such a stream may, however, have an implementation-defined number of null characters appended to the end of the stream."Grainfield
@chux-ReinstateMonica But yes, re-reading the entire file does work around the problem of long not being big enough to represent the size of the file. And yes, if you need the size of a file, an implementation-specific solution (fstat, for example) is likely to be better.Grainfield
@KeithThompson Yes, such uncommon systems do not well differentiate 42 vs. 512. Yet it that case it is argumentative that the file size is 512 or 42 or maybe just 512-ish.Sulfur

© 2022 - 2024 — McMap. All rights reserved.