How to get file size in ANSI C without fseek and ftell?
Asked Answered
C

7

14

While looking for ways to find the size of a file given a FILE*, I came across this article advising against it. Instead, it seems to encourage using file descriptors and fstat.

However I was under the impression that fstat, open and file descriptors in general are not as portable (After a bit of searching, I've found something to this effect).

Is there a way to get the size of a file in ANSI C while keeping in line with the warnings in the article?

Coterminous answered 22/3, 2012 at 21:25 Comment(2)
Please note that the article you linked to is Considered Harmful. fseek/ftell (actually fseeko/ftello, if you have POSIX, so you can deal with large files) is the preferred way to determine file size. The stat-based alternative will fail to determine sizes of some non-regular-files that do have well-defined sizes, such as block devices (disk partitions, etc.).Battik
It's not useful but... open a file in append mode works: FILE* fp = fopen("teste.txt", "a"); size_t sz = ftell(fp);Spondee
S
15

In standard C, the fseek/ftell dance is pretty much the only game in town. Anything else you'd do depends at least in some way on the specific environment your program runs in. Unfortunately said dance also has its problems as described in the articles you've linked.

I guess you could always read everything out of the file until EOF and keep track along the way - with fread() for example.

Solipsism answered 22/3, 2012 at 21:28 Comment(7)
I think the answer was downvoted because of the specific wording in the C standard, which, at least should've been mentioned: Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state. and A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.Cetane
Unfortunately it also doesn't provide any other options.Solipsism
Maybe it doesn't have to. You can fread() or fgetc() until EOF, which isn't fast, but should work and be more portable.Cetane
Note that while ISO C does not define the end of a binary file, POSIX does, and all real-world, post-1980 implementations of C agree on this issue. Binary files have an exact size and you can seek relative to the end.Battik
But using POSIX functions is undefined behavior according to C. There is no solution for undefined behavior in solving this problem. fseek using SEEK_END is undefined behavior, and calling a function that is not in ISO C and not in your program is undefined behavior. Solving this problem, and most other everyday problems, requires removing the ISO C blinders from one's eyes.Polysynthetic
@Polysynthetic "fseek using SEEK_END is undefined behavior" - really ? I though that Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END) is undefined behavior. So setting position N bytes before SEEK_END (fseek(file, -1, SEEK_END))- seems this behaviour is Ok according standard.Stiver
@0x69 I'd worry about files sized <=1 bytes there. But that looks worth some man page readingSophrosyne
G
7

The article claims fseek(stream, 0, SEEK_END) is undefined behaviour by citing an out-of-context footnote.

The footnote appears in text dealing with wide-oriented streams, which are streams that the first operation that is performed on them is an operation on wide-characters.

This undefined behaviour stems from the combination of two paragraphs. First §7.19.2/5 says that:

— Binary wide-oriented streams have the file-positioning restrictions ascribed to both text and binary streams.

And the restrictions for file-positioning with text streams (§7.19.9.2/4) are:

For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.

This makes fseek(stream, 0, SEEK_END) undefined behaviour for wide-oriented streams. There is no such rule like §7.19.2/5 for byte-oriented streams.

Furthermore, when the standard says:

A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

It doesn't mean it's undefined behaviour to do so. But if the stream supports it, it's ok.

Apparently this exists to allow binary files can have coarse size granularity, i.e. for the size to be a number of disk sectors rather than a number of bytes, and as such allows for an unspecified number of zeros to magically appear at the end of binary files. SEEK_END cannot be meaningfully supported in this case. Other examples include pipes or infinite files like /dev/zero. However, the C standard provides no way to distinguish between such cases, so you're stuck with system-dependent calls if you want to consider that.

Go answered 22/3, 2012 at 23:9 Comment(3)
The last paragraph is not quite right. ISO C allows binary files to have course size granularity, i.e. for the size to be a number of disk sectors rather than a number of bytes, and as such allows for an unspecified number of zeros to magically appear at the end of binary files. This is the reason SEEK_END may not be "meaningfully" supported. Still, no real-world implementation would be this broken; further, POSIX forbids it.Battik
@R.. Oh, thanks. That would be indeed quite weird. Would those nulls at the end be read by say fread?Go
The article does not cite an out of context footnote; it cites a pertinent foonote. The basic claims in the article are based on normative text. The article's author is taking normative text and the notion of undefined behavior out of a rational context, and does not realize that the proposed solution (the use of platform specific functions, not defined in the C program or the standard library) are also, formally, undefined behavior.Polysynthetic
R
4

Use fstat - requires the file descriptor - can get that from fileno from the FILE* - Hence the size is in your grasp along with other details.

i.e.

fstat(fileno(filePointer), &buf);

Where filePointer is the FILE *

and

buf is

struct stat {
    dev_t     st_dev;     /* ID of device containing file */
    ino_t     st_ino;     /* inode number */
    mode_t    st_mode;    /* protection */
    nlink_t   st_nlink;   /* number of hard links */
    uid_t     st_uid;     /* user ID of owner */
    gid_t     st_gid;     /* group ID of owner */
    dev_t     st_rdev;    /* device ID (if special file) */
    off_t     st_size;    /* total size, in bytes */
    blksize_t st_blksize; /* blocksize for file system I/O */
    blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
    time_t    st_atime;   /* time of last access */
    time_t    st_mtime;   /* time of last modification */
    time_t    st_ctime;   /* time of last status change */
};
Riorsson answered 22/3, 2012 at 21:35 Comment(6)
As a previous poster noted the OS differ - but the same sort of thing is available by windows. The equivalent of fstat is available.Riorsson
Guess the best option is to make it work according to the OS.Riorsson
Voted up because it's standard to POSIX.Befool
Danger, Will Robinson! If you use fstat() on an open file to which you have previously been writing stuff via a FILE* it could well return the wrong size, due to unbuffered data not yet being written.Hangman
I was making the assuption that either the person would take this into account by either doing this at the start (as hinted in the OP) or used the flush.Riorsson
@DavidGiven While you point out a common pitfall, obviously stat wouldn't be reporting "the wrong size" there - it is actually the size of the file (since the unbuffered changes have... not you been written)Sophrosyne
A
3

The executive summary is that you must use fseek/ftell because there is no alternative (even the implementation specific ones) that is better.

The underlying issue is that the "size" of a file in bytes is not always the same as the length of the data in the file and that, in some circumstances, the length of the data is not available.

A POSIX example is what happens when you write data to a device; the operating system only knows the size of the device. Once the data has been written and the (FILE*) closed there is no record of the length of the data written. If the device is opened for read the fseek/ftell approach will either fail or give you the size of the whole device.

When the ANSI-C committee was sitting at the end of the 1980's a number of operating systems the members remembered simply did not store the length of the data in a file; rather they stored the disk blocks of the file and assumed that something in the data terminated it. The 'text' stream represents this. Opening a 'binary' stream on those files shows not only the magic terminator byte, but also any bytes beyond it that were never written but happen to be in the same disk block.

Consequently the C-90 standard was written so that it is valid to use the fseek trick; the result is a conformant program, but the result may not be what you expect. The behavior of that program is not 'undefined' in the C-90 definition and it is not 'implementation-defined' (because on UN*X it varies with the file). Neither is it 'invalid'. Rather you get a number you can't completely rely on or, maybe, depending on the parameters to fseek, -1 and an errno.

In practice if the trick succeeds you get a number that includes at least all the data, and this is probably what you want, and if the trick fails it is almost certainly someone else's fault.

John Bowler

Anywheres answered 17/12, 2014 at 1:23 Comment(0)
N
2

different OS's provide different apis for this. For example in windows we have:

GetFileAttributes()

In MAC we have:

[[[NSFileManager defaultManager] attributesOfItemAtPath:someFilePath error:nil] fileSize];

But raw method is only by fread and fseek only: How can I get a file's size in C?

Narrative answered 22/3, 2012 at 21:40 Comment(0)
T
2

You can't always avoid writing platform-specific code, especially when you have to deal with things that are a function of the platform. File sizes are a function of the file system, so as a rule I'd use the native filesystem API to get that information over the fseek/ftell dance. I'd create my own generic wrapper around it, so as to not pollute application logic with platform-specific details and make the code easier to port.

Thriftless answered 22/3, 2012 at 22:19 Comment(0)
P
-2

The article has a little problem of logic.

It (correctly) identifies that a certain usage of C functions has behavior which is not defined by ISO C. But then, to avoid this undefined behavior, the article proposes a solution: replace that usage with platform-specific functions. Unfortunately, the use of platform-specific functions is also undefined according to ISO C. Therefore, the advice does not solve the problem of undefined behavior.

The quote in my copy of the 1999 standard confirms that the alleged behavior is indeed undefined:

A binary stream need no meaningfully support fseek calls with a whence value of SEEK_END. [ISO 9899:1999 7.19.9.2 paragraph 3]

But undefined behavior does not mean "bad behavior"; it is simply behavior for which the ISO C standard gives no definition. Not all undefined behaviors are the same.

Some undefined behaviors are areas in the language where meaningful extensions can be provided. The platform fills the gap by defining a behavior.

Providing a working fseek which can seek from SEEK_END is an example of an extension in place of undefined behavior. It is possible to confirm whether or not a given platform supports fseek from SEEK_END, and if this is provisioned, then it is fine to use it.

Providing a separate function like lseek is also an extension in place of undefined behavior (the undefined behavior of calling a function which is not in ISO C and not defined in the C program). It is fine to use that, if available.

Note that those platforms which have functions like the POSIX lseek will also likely have an ISO C fseek which works from SEEK_END. Also note that on platforms where fseek on a binary file cannot seek from SEEK_END, the likely reason is that this is impossible to do (no API can be provided to do it and that is why the C library function fseek is not able to support it).

So, if fseek does provide the desired behavior on the given platform, then nothing has to be done to the program; it is a waste of effort to change it to use that platform's special function. On the other hand, if fseek does not provide the behavior, then likely nothing does, anyway.

Note that even including a nonstandard header which is not in the program is undefined behavior. (By omission of the definition of behavior.) For instance if the following appears in a C program:

#include <unistd.h>

the behavior is not defined after that. [See References below.] The behavior of the preprocessing directive #include is defined, of course. But this creates two possibilities: either the header <unistd.h> does not exist, in which case a diagnostic is required. Or the header does exist. But in that case, the contents are not known (as far as ISO C is concerned; no such header is documented for the Library). In this case, the include directive brings in an unknown chunk of code, incorporating it into the translation unit. It is impossible to define the behavior of an unknown chunk of code.

#include <platform-specific-header.h> is one of the escape hatches in the language for doing anything whatsoever on a given platform.

In point form:

  1. Undefined behavior is not inherently "bad" and not inherently a security flaw (though of course it can be! E.g. buffer overruns linked to the undefined behaviors in the area of pointer arithmetic and dereferencing.)
  2. Replacing one undefined behavior with another, only for the purpose of avoiding undefined behavior, is pointless.
  3. Undefined behavior is just a special term used in ISO C to denote things that are outside of the scope of ISO C's definition. It does not mean "not defined by anyone in the world" and doesn't imply something is defective.
  4. Relying on some undefined behaviors is necessary for making most real-world, useful programs, because many extensions are provided through undefined behavior, including platform-specific headers and functions.
  5. Undefined behavior can be supplanted by definitions of behavior from outside of ISO C. For instance the POSIX.1 (IEEE 1003.1) series of standards defines the behavior of including <unistd.h>. An undefined ISO C program can be a well defined POSIX C program.
  6. Some problems cannot be solved in C without relying on some kind of undefined behavior. An example of this is a program that wants to seek so many bytes backwards from the end of a file.

References:

Polysynthetic answered 2/5, 2012 at 17:51 Comment(8)
Oh God, not again… It's not undefined behavior.Jaquez
I think you mix "undefined behavior" and "implementation defined behavior".Inconsistency
@Etienne de Martel, for the second time.Jaquez
Really, I think the mixup is about what 'undefined behaviour' applies to: the compiler's behaviour is very welldefined for processing includes. The resulting program obviously can have undefined behaviour (hell, it could even be ill-formed). Usually 'undefined behaviour' refers to the compiler's actions/output. Not the behaviour of the resulting program (although, that of course becomes hard to reason about at the very same time)Sophrosyne
No, "undefined behavior" simply means any situation for which the programming language standard either says that it has "undefined behavior", or for which it provides no definition of behavior. It does not mean "not defined by any system or vendor". It means not standard-defined. A compiler's behavior is not very well standard-defined at all! The C standard only partially defines what happens when #include <unistd.h> is processed. Not enough to actually define the consequences.Polysynthetic
"Undefined behavior is behavior, such as might arise upon use of an erroneous program construct or erroneous data, for which the C++ Standard imposes no requirements. Undefined behavior may also be expected when the C++ Standard omits the description of any explicit definition of behavior or defines the behavior to be ill-formed, with no diagnostic required." Although ambiguous to some degree, undefined behavior is bad in the sense that you can't know what will happen. Not knowing what your program will do is bad isn't it?Uncommon
I agree with Etienne's comment. Undefined and implementation defined are very different things. Undefined behavior is typically tied to an ill formed program that is wrong, and the language simply imposes no requirements on how to handle that situation. To say that undefined behavior is not always bad is wrong. It doesn't always result in a noticeable problem, but the fact that we can't know what the result might be is automatically bad.Uncommon
@Uncommon You're severely mistaken. "Undefined behavior" in the context of ISO C++ (what we're discussing here) means "not defined by ISO C++" not "not defined by nobody at all". Compilers provide useful, documented extensions which fall under ISO C++ undefined behavior, and which programmers use to their advantage.Polysynthetic

© 2022 - 2024 — McMap. All rights reserved.