C : Best way to go to a known line of a file
Asked Answered
H

4

8

I have a file in which I'd like to iterate without processing in any sort the current line. What I am looking for is the best way to go to a determined line of a text file. For example, storing the current line into a variable seems useless until I get to the pre-determined line.

Example :

file.txt

foo
fooo
fo
here

Normally, in order to get here, I would have done something like :

FILE* file = fopen("file.txt", "r");
if (file == NULL)
    perror("Error when opening file ");
char currentLine[100];
while(fgets(currentLine, 100, file))
{
    if(strstr(currentLine, "here") != NULL)
         return currentLine;
}

But fgetswill have to read fully three line uselessly and currentLine will have to store foo, fooo and fo.

Is there a better way to do this, knowing that here is line 4? Something like a go tobut for files?

Heterogeneous answered 29/5, 2017 at 14:10 Comment(1)
For ordinary files, the only way to do better is to construct and maintain your own index of line numbers and fseek offsets. (This is straightforward, but a bit of work.)Triumph
T
6

You cannot access directly to a given line of a textual file (unless all lines have the same size in bytes; and with UTF8 everywhere a Unicode character can take a variable number of bytes, 1 to 6; and in most cases lines have various length - different from one line to the next). So you cannot use fseek (because you don't know in advance the file offset).

However (at least on Linux systems), lines are ending with \n (the newline character). So you could read byte by byte and count them:

int c= EOF;
int linecount=1;
while ((c=fgetc(file)) != EOF) {
  if (c=='\n')
    linecount++;
}

You then don't need to store the entire line.

So you could reach the line #45 this way (using while ((c=fgetc(file)) != EOF) && linecount<45) ...) and only then read entire lines with fgets or better yet getline(3) on POSIX systems (see this example). Notice that the implementation of fgets or of getline is likely to be built above fgetc, or at least share some code with it. Remember that <stdio.h> is buffered I/O, see setvbuf(3) and related functions.


Another way would be to read the file in two passes. A first pass stores the offset (using ftell(3)...) of every line start in some efficient data structure (a vector, an hashtable, a tree...). A second pass use that data structure to retrieve the offset (of the line start), then use fseek(3) (using that offset).


A third way, POSIX specific, would be to memory-map the file using mmap(2) into your virtual address space (this works well for not too huge files, e.g. of less than a few gigabytes). With care (you might need to mmap an extra ending page, to ensure the data is zero-byte terminated) you would then be able to use strchr(3) with '\n'

In some cases, you might consider parsing your textual file line by line (using appropriately fgets, or -on Linux- getline, or generating your parser with flex and bison) and storing each line in a relational database (such as PostGreSQL or sqlite).

PS. BTW, the notion of lines (and the end-of-line mark) vary from one OS to the next. On Linux the end-of-line is a \n character. On Windows lines are rumored to end with \r\n, etc...

Topple answered 29/5, 2017 at 14:20 Comment(3)
Technically on Windows, lines end with the \n character too... they just have an \r before it. The point is, counting \ns will work on Windows too.Sphene
Is there any advantage iterating character by character instead of line by line ?Heterogeneous
@Badda: how would you iterate line by line?Topple
K
8

Since you do not know the length of every line, no, you will have to go through the previous lines.

If you knew the length of every line, you could probably play with how many bytes to move the file pointer. You could do that with fseek().

Kun answered 29/5, 2017 at 14:12 Comment(0)
T
6

You cannot access directly to a given line of a textual file (unless all lines have the same size in bytes; and with UTF8 everywhere a Unicode character can take a variable number of bytes, 1 to 6; and in most cases lines have various length - different from one line to the next). So you cannot use fseek (because you don't know in advance the file offset).

However (at least on Linux systems), lines are ending with \n (the newline character). So you could read byte by byte and count them:

int c= EOF;
int linecount=1;
while ((c=fgetc(file)) != EOF) {
  if (c=='\n')
    linecount++;
}

You then don't need to store the entire line.

So you could reach the line #45 this way (using while ((c=fgetc(file)) != EOF) && linecount<45) ...) and only then read entire lines with fgets or better yet getline(3) on POSIX systems (see this example). Notice that the implementation of fgets or of getline is likely to be built above fgetc, or at least share some code with it. Remember that <stdio.h> is buffered I/O, see setvbuf(3) and related functions.


Another way would be to read the file in two passes. A first pass stores the offset (using ftell(3)...) of every line start in some efficient data structure (a vector, an hashtable, a tree...). A second pass use that data structure to retrieve the offset (of the line start), then use fseek(3) (using that offset).


A third way, POSIX specific, would be to memory-map the file using mmap(2) into your virtual address space (this works well for not too huge files, e.g. of less than a few gigabytes). With care (you might need to mmap an extra ending page, to ensure the data is zero-byte terminated) you would then be able to use strchr(3) with '\n'

In some cases, you might consider parsing your textual file line by line (using appropriately fgets, or -on Linux- getline, or generating your parser with flex and bison) and storing each line in a relational database (such as PostGreSQL or sqlite).

PS. BTW, the notion of lines (and the end-of-line mark) vary from one OS to the next. On Linux the end-of-line is a \n character. On Windows lines are rumored to end with \r\n, etc...

Topple answered 29/5, 2017 at 14:20 Comment(3)
Technically on Windows, lines end with the \n character too... they just have an \r before it. The point is, counting \ns will work on Windows too.Sphene
Is there any advantage iterating character by character instead of line by line ?Heterogeneous
@Badda: how would you iterate line by line?Topple
S
5

A FILE * in C is a stream of chars. In a seekable file, you can address these chars using the file pointer with fseek(). But apart from that, there are no "special characters" in files, a newline is just another normal character.

So in short, no, you can't jump directly to a line of a text file, as long as you don't know the lengths of the lines in advance.

This model in C corresponds to the files provided by typical operating systems. If you think about it, to know the starting points of individual lines, your file system would have to store this information somewhere. This would mean treating text files specially.

What you can do however is just count the lines instead of pattern matching, something like this:

#include <stdio.h>

int main(void)
{
    char linebuf[1024];
    FILE *input = fopen("seekline.c", "r");
    int lineno = 0;
    char *line;
    while (line = fgets(linebuf, 1024, input))
    {
        ++lineno;
        if (lineno == 4)
        {
            fputs("4: ", stdout);
            fputs(line, stdout);
            break;
        }
    }
    fclose(input);
    return 0;
}
Selfaggrandizement answered 29/5, 2017 at 14:25 Comment(0)
C
1

If you don't know the length of each line, you have to go through all of them. But if you know the line you want to stop you can do this:

while (!found && fgets(line, sizeof line, file) != NULL) /* read a line */
{
    if (count == lineNumber)
    {
         //you arrived at the line
         //in case of a return first close the file with "fclose(file);"
         found = true;
    }
    else
    {
        count++;
    }
}

At least you can avoid so many calls to strstr

Chlorophyll answered 29/5, 2017 at 14:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.