C fgets versus fgetc for reading line
Asked Answered
U

5

13

I need to read a line of text (terminated by a newline) without making assumptions about the length. So I now face to possibilities:

  • Use fgets and check each time if the last character is a newline and continuously append to a buffer
  • Read each character using fgetc and occasionally realloc the buffer

Intuition tells me the fgetc variant might be slower, but then again I don't see how fgets can do it without examining every character (also my intuition isn't always that good). The lines are quite large so the performance is important.

I would like to know the pros and cons of each approach. Thank you in advance.

Uretic answered 3/3, 2011 at 20:58 Comment(0)
D
4

I suggest using fgets() coupled with dynamic memory allocation - or you can investigate the interface to getline() that is in the POSIX 2008 standard and available on more recent Linux machines. That does the memory allocation stuff for you. You need to keep tabs on the buffer length as well as its address - so you might even create yourself a structure to handle the information.

Although fgetc() also works, it is marginally fiddlier - but only marginally so. Underneath the covers, it uses the same mechanisms as fgets(). The internals may be able to exploit speedier operation - analogous to strchr() - that are not available when you call fgetc() directly.

Despotic answered 3/3, 2011 at 21:5 Comment(3)
The one limitation when implementing a getline function with fgets is that it is impossible to handle null bytes and files not ending with a newline character at the same time. If fgets encounters an EOF condition and returns without a newline character, you can only assume that the string ends on the first null byte. (In other cases, you can do strchr(buf, '\n') to find out where the reading stopped—or if there is no '\n', you need to realloc.)Ostiary
If the file contains null bytes, it isn't a text file. (It might be a wide character file, but then you need to use wide character I/O functions to read it.) And fgets() is not designed to handle files that contain null bytes — precisely because it does not give a reliable indication of how many bytes it read. If your data file contains null bytes, you should (probably) not be using fgets() to read it.Despotic
linux.die.net/man/3/getline (Return Value section) seems to suggest that it might be a useful thing. That's where I got the idea, although I guess I agree with you. Now that I think about it, maybe that's only mentioned there because it could be useful when using a delimiter other than '\n'.Ostiary
W
2

Does your environment provide the getline(3) function? If so, I'd say go for that.

The big advantage I see is that it allocates the buffer itself (if you want), and will realloc() the buffer you pass in if it's too small. (So this means you need to pass in something gotten from malloc()).

This gets rid of some of the pain of fgets/fgetc, and you can hope that whoever wrote the C library that implements it took care of making it efficient.

Bonus: the man page on Linux has a nice example of how to use it in an efficient manner.

Wirth answered 3/3, 2011 at 21:5 Comment(3)
Unfortunately (I am sorry I did not mention this in the question) I need to use standard stuff :-( The getline functions sure sounds attractive.Uretic
Well, it is standard (for some definition of standard). See The Open Group Base Specifications Issue 7, aka "IEEE Std 1003.1™-2008" aka "POSIX C 2008". But standard != widespread, unfortunately. I feel your pain. getline is sexy :-)Wirth
getline() functionality is good; the name getline() is an atrocious intrusion on the user namespace, pre-empting one of the more widely used function names (for example, see K&R 1 and 2) with a wide range of diverse interfaces. It was an appalling decision to use that name; it was an excellent decision to provide the functionality. The only surprising thing is the omission of the ability to handle CRLF line endings; the related getdelim() function can handle CR or LF or NUL line endings, but cannot handle CRLF line endings.Despotic
A
2

If performance matters much to you, you generally want to call getc instead of fgetc. The standard tries to make it easier to implement getc as a macro to avoid function call overhead.

Past that, the main thing to deal with is probably your strategy in allocating the buffer. Most people use fixed increments (e.g., when/if we run out of space, allocate another 128 bytes). I'd advise instead using a constant factor, so if you run out of space allocate a buffer that's, say, 1 1/2 times the previous size.

Especially when getc is implemented as a macro, the difference between getc and fgets is usually quite minimal, so you're best off concentrating on other issues.

Altman answered 3/3, 2011 at 21:11 Comment(0)
J
0

If you can set a maximum line length, even a large one, then one fgets would do the trick. If not, multiple fgets calls will still be faster than multiple fgetc calls because the overhead of the latter will be greater.

A better answer, though, is that it's not worth worrying about the performance difference until and unless you have to. If fgetc is fast enough, what does it matter?

Joint answered 3/3, 2011 at 21:2 Comment(1)
Also note that getc is usually implemented as a macro and is therefore faster than fgetc, and should be used as long as you are careful (argument can't be an expression).Ostiary
D
0

I would allocate a large buffer and then use fgets, checking, reallocing and repeating if you haven't read to the end of the line.

Each time you read (either via fgetc or fgets) you are making a system call which takes time, you want to minimize the number of times that happens, so calling fgets fewer times and iterating in memory is faster.

If you are reading from a file, mmap()ing in the file is another option.

Draconian answered 3/3, 2011 at 21:10 Comment(2)
I have to contradict you on the system call part: the stdio library does buffering so I don't think every function call will be translated into a system call. I may be wrongUretic
this is true, but with fgets he will have finer grained control. if he has some idea of how long the lines are on average he can optimize the buffer lengths, rather than fgetc which will buffer but be completely agnostic about ideal buffer lengths.Draconian

© 2022 - 2024 — McMap. All rights reserved.