fseek passing negative offset and SEEK_CUR
Asked Answered
P

2

8

I have a bad performance running fseek(..) in a very big file. Every time a call fseek function, I need to move the file pointer position backward 100 bytes:

  fseek(fp, -100, SEEK_CUR);

before, I was doing this:

  fseek(fp, (index)*100, SEEK_SET); // which makes basically the same...

My question is how fseek moves the pointer through the file and sets the file pointer in an specific position.

I thought that it takes the file pointer and moves it backward, but now I think that what it really does, is

  • get the current position (cp)

  • add the negative index (p = idx + cp)

  • and move the file pointer from the beginning of the file to that position (fseek(fp, p, SEEK_SET))

Presser answered 14/8, 2015 at 17:48 Comment(6)
What was the open? was it "a+b" or "r+b" ... My guess it was not "a+b" it was not "append".Ainu
The performance of the underlying fileSystem may be relevant.Heteronym
For other reason I cannot open the file with append, so I need to open it "r+b"Presser
Is fseek working in this mode?Solfatara
The position computation you describe is as the standard specifies, but the standard does not specify how the implementation will set the stream to that position. I would be surprised to find that an implementation did as you suggest, but it is permissible.Nolde
Generally speaking, you should think twice before relying on fseek(). Sometimes it's the right tool for the job, but be aware that not all streams are seekable at all. If you're always looking / moving backward exactly 100 bytes, then consider buffering at least that much prior data in memory so that you don't have to seek backward.Nolde
F
5

First, what operating system are you using? If it's Linux, run your application under strace to see what system calls it's actually making.

Second, fopen()/fseek()/fread() are the wrong tools for this access pattern. Those calls buffer the file reads - by reading ahead. That does you no good. You fseek() to offset X, whatever data is buffered is now useless, you fread() 100 bytes, and the buffered fread() reads more - probably 8 kB. You're likely reading almost every byte of the file over 80 times. You could use use setbuf() or setvbuf() to disable buffering, but then you'll be doing 100-byte reads while going through the file backwards. It should be faster, but not as fast as you can go.

To do this about as fast as you can (without getting into multithreaded and/or asynchronous IO):

  1. Use open()/pread(). You don't need to seek - pread() reads directly from an arbitrary offset.

  2. Read larger chunks - say 8192 x 100. Or even larger. Read backwards just as before, but do the buffering yourself and start at an offset in the file that is a multiple of the large size you're reading - the first read will probably have less than 819,200 bytes in it. Process the last 100 bytes in your buffer first, then work backwards through the buffer. When you've processed the first 100 bytes in your buffer, use pread() to read the previous 819,200 bytes (or even larger) from the file.

  3. Use direct IO if available. File system optimizations might try to "optimize" your access by reading ahead and placing data into the page cache - data that you've already processed. So bypass the page cache if possible (not all operating systems support direct IO, and not all filesystems on OSes that do support direct IO implement it.)

Something like this:

#define DATA_SIZE 100
#define NUM_CHUNKS (32UL * 1024UL)
#define READ_SIZE ( ( size_t ) DATA_SIZE * NUM_CHUNKS )

void processBuffer( const char *buffer, ssize_t bytes )
{
    if ( bytes <= 0 ) return;
    // process a buffer backwards...
}

void processFile( const char *filename )
{
    struct stat sb;
    // get page-aligned buffer for direct IO
    char *buffer = valloc( READ_SIZE );
    // Linux-style direct IO
    int fd = open( filename, O_RDONLY | O_DIRECT );
    fstat( fd, &sb );    
    // how many read operations?
    // use lldiv() to get quotient and remainder in one op
    lldiv_t numReads = lldiv( sb.st_size, READ_SIZE );
    if ( numReads.rem )
    {
        numReads.quot++;
    }
    while ( numReads.quot > 0 )
    {
        numReads.quot--;
        ssize_t bytesRead = pread( fd, buffer,
            READ_SIZE, numReads.quot * READ_SIZE );
        processBuffer( buffer, bytesRead );
    }
    free( buffer );
    close( fd );
}

You'll need to add error handling to that.

Fearful answered 14/8, 2015 at 20:57 Comment(0)
T
3

At the user application level, you think of a file like a big block of memory and moving the file pointer like a simple memory operation (increment or decrement a pointer to get to the desired offset in the file).

But at the runtime library and at the OS level the things are completely different. The runtime library code that handles the files doesn't load the entire content of the file in memory. Maybe the file is very large, maybe you only need to read only a couple of bytes from it, there are many reasons.

The runtime library (and also the file cache managed by the OS) loads only some data from the file in a memory buffer. You work with that data (read it, write it) and when you want to access information that is not already loaded in the buffer the file management code loads it for you; maybe it enlarges the buffer or maybe it writes the buffer on the file (if it is modified) or just discards the previously loaded data (if it was not modified) and loads another chunk of data in the buffer.

When you use fseek() to jump to a different part of the file usually the file pointer arrives in an area that is not in memory yet. I suppose it loads the data starting from the new position of the file pointer (at the OS level, the file cache loads the data in multiple of disk blocks). Because you run backwards through the file I guess the data at the new position of the file pointer is almost never already loaded in memory. It triggers a disk access and this makes it slow.

I think the best solution for you is to use the functions provided by the OS to map the file into memory. Read about mmap() on Linux (and maybe on OSX) or File Mapping on Windows. It could help you but it's possible that the improvement will not be significant because of the particular access pattern you use. Most of the time the programs read a file from beginning to end and the code that deals with files and disk access is optimized for this pattern.

Tomaso answered 14/8, 2015 at 18:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.