Fastest way to read every 30th byte of large binary file?

V

7

24

What is the fastest way to read every 30th byte of a large binary file (2-3 GB)? I've read there are performance problems with fseek because of I/O buffers, but I don't want to read 2-3 GB of data into memory before grabbing every 30th byte either.

Valentinevalentino answered 6/3, 2010 at 23:28 Comment(0)

P

17

Performance test. If you want to use it yourself, note that the integrity check (printing total) only works if "step" divides BUFSZ, and MEGS is small enough that you don't read off the end of the file. This is due to (a) laziness, (b) desire not to obscure the real code. rand1.data is a few GB copied from /dev/urandom using dd.

#include <stdio.h>
#include <stdlib.h>

const long long size = 1024LL*1024*MEGS;
const int step = 32;

int main() {
    FILE *in = fopen("/cygdrive/c/rand1.data", "rb");
    int total = 0;
    #if SEEK
        long long i = 0;
        char buf[1];
        while (i < size) {
            fread(buf, 1, 1, in);
            total += (unsigned char) buf[0];
            fseek(in, step - 1, SEEK_CUR);
            i += step;
        }
    #endif
    #ifdef BUFSZ
        long long i = 0;
        char buf[BUFSZ];
        while (i < size) {
            fread(buf, BUFSZ, 1, in);
            i += BUFSZ;
            for (int j = 0; j < BUFSZ; j += step) 
                total += (unsigned char) buf[j];
        }
    #endif
    printf("%d\n", total);
}

Results:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m1.391s
user    0m0.030s
sys     0m0.030s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.172s
user    0m0.108s
sys     0m0.046s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m0.031s
user    0m0.030s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.141s
user    0m0.140s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2
83595817

real    0m20.797s
user    0m1.733s
sys     0m9.140s

Summary:

I'm using 20MB of data initially, which of course fits in cache. The first time I read it (using a 32KB buffer) takes 1.4s, bringing it into cache. The second time (using a 32 byte buffer) takes 0.17s. The third time (back with the 32KB buffer again) takes 0.03s, which is too close to the granularity of my timer to be meaningful. fseek takes over 20s, even though the data is already in disk cache.

At this point I'm pulling fseek out of the ring so the other two can continue:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m33.437s
user    0m0.749s
sys     0m1.562s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.078s
user    0m5.030s
sys     0m0.484s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.141s
user    0m0.280s
sys     0m0.500s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.094s
user    0m4.968s
sys     0m0.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.140s
user    0m0.171s
sys     0m0.640s

1000MB of data also appears to be substantially cached. A 32KB buffer is 6 times faster than a 32 byte buffer. But the difference is all user time, not time spent blocked on disk I/O. Now, 8000MB is much more than I have RAM, so I can avoid caching:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m25.515s
user    0m5.155s
sys     0m12.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2
-938074821

real    3m59.015s
user    1m11.061s
sys     0m10.999s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m42.423s
user    0m5.577s
sys     0m14.484s

Ignore the first of those three, it benefited from the first 1000MB of the file already being in RAM.

Now, the version with the 32KB is only slightly faster in wall clock time (and I can't be bothered to re-run, so let's ignore it for now), but look at the difference in user+sys time: 20s vs. 82s. I think that my OS's speculative read-ahead disk caching has saved the 32-byte buffer's bacon here: while the 32 byte buffer is being slowly refilled, the OS is loading the next few disk sectors even though nobody has asked for them. Without that I suspect it would have been a minute (20%) slower than the 32KB buffer, which spends less time in user-land before requesting the next read.

Moral of the story: standard I/O buffering doesn't cut it in my implementation, the performance of fseek is atrocious as the questioner says. When the file is cached in the OS, buffer size is a big deal. When the file is not cached in the OS, buffer size doesn't make a whole lot of difference to wall clock time, but my CPU was busier.

incrediman's fundamental suggestion to use a read buffer is vital, since fseek is appalling. Arguing over whether the buffer should be a few KB or a few hundred KB is most likely pointless on my machine, probably because the OS has done a job of ensuring that the operation is tightly I/O bound. But I'm pretty sure this is down to OS disk read-ahead, not standard I/O buffering, because if it was the latter then fseek would be better than it is. Actually, it could be that the standard I/O is doing the read ahead, but a too-simple implementation of fseek is discarding the buffer every time. I haven't looked into the implementation (and I couldn't follow it across the boundary into the OS and filesystem drivers if I did).

Plethoric answered 6/3, 2010 at 23:28 Comment(5)

Very cool. But fread is not optimized for 1 char. Can you try fgetc? – Devilment 7/3, 2010 at 5:23

fgetc vs. fread makes no difference that I can detect in 4 test runs of each (with MEGS=20, data pre-loaded). Range of results is 19.4s to 21.2s, with the best and worst both using fgetc. I expect other people's mileage to vary - I don't know to what extent cygwin+gcc is using unmodified glibc, and I don't know whether there's some peculiarity of Windows responsible for the performance hit on fseek. You'd think that a forward seek of 31 bytes "should" most of the time just increment an offset in the FILE*, but apparently not. – Plethoric 7/3, 2010 at 5:31

I traced it; the sucker makes a system call on every fseek. What idiots! I changed your program to use Phong Vo's sfio library, and at that point the differences are still there but they are reasonably small. Thanks for posting such a useful program. Oh, and +1 :-) – Devilment 7/3, 2010 at 5:59

Thanks, Norman. Number 1 rule of performance questions: it's usually really easy to write a half-assed benchmark, and a half-assed benchmark is usually enough to reveal serious performance disasters :-) – Plethoric 8/3, 2010 at 0:23

Phong Vo's sfio library can be found at github.com/ellson/graphviz/tree/master/lib/sfio (among other places, but some earlier links here have broken). – Selah 12/1, 2016 at 16:55

W

24

What I'd suggest is that you create a buffer of a few thousand bytes, read every 30th byte from it, reload the buffer with the next few thousand bytes, and continue until you reach the eof. That way the amount of data read into memory is limited, and you also don't have to read from the file as often. You'll find that the larger the buffer you create, the faster it'll be.

Edit: Actually, as suggested below, you'll probably want to make your buffer a few hundred kb's, not a few thousand bytes (like I said - bigger buffer = faster file read).

Warehouseman answered 6/3, 2010 at 23:31 Comment(5)

+1 -- was just writing almost exactly the same thing -- except I recommended a few hundred kilobytes per chunk. – Interphone 6/3, 2010 at 23:32

Yeah, that's probably better. I mean if the file is that large, he's obviously in an environment where he can afford a buffer bigger than a few thousand bytes :) (edited answer) – Warehouseman 6/3, 2010 at 23:35

I predict that compared with the default buffering strategy used in the standard I/O library, the benefits of this scheme won't even be measurable (for a program reading every 30th byte). I would be pleased to see measurements proving me wrong. – Devilment 7/3, 2010 at 2:53

@Norman Ramsey: I predict otherwise. Test currently running, I'll post a CW answer shortly. – Plethoric 7/3, 2010 at 4:7

On many platforms, making your buffer size / read size match the sector size of the disk results in the fastest reads. – Altheta 14/8, 2012 at 20:31

P

17

Performance test. If you want to use it yourself, note that the integrity check (printing total) only works if "step" divides BUFSZ, and MEGS is small enough that you don't read off the end of the file. This is due to (a) laziness, (b) desire not to obscure the real code. rand1.data is a few GB copied from /dev/urandom using dd.

#include <stdio.h>
#include <stdlib.h>

const long long size = 1024LL*1024*MEGS;
const int step = 32;

int main() {
    FILE *in = fopen("/cygdrive/c/rand1.data", "rb");
    int total = 0;
    #if SEEK
        long long i = 0;
        char buf[1];
        while (i < size) {
            fread(buf, 1, 1, in);
            total += (unsigned char) buf[0];
            fseek(in, step - 1, SEEK_CUR);
            i += step;
        }
    #endif
    #ifdef BUFSZ
        long long i = 0;
        char buf[BUFSZ];
        while (i < size) {
            fread(buf, BUFSZ, 1, in);
            i += BUFSZ;
            for (int j = 0; j < BUFSZ; j += step) 
                total += (unsigned char) buf[j];
        }
    #endif
    printf("%d\n", total);
}

Results:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m1.391s
user    0m0.030s
sys     0m0.030s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.172s
user    0m0.108s
sys     0m0.046s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m0.031s
user    0m0.030s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.141s
user    0m0.140s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2
83595817

real    0m20.797s
user    0m1.733s
sys     0m9.140s

Summary:

I'm using 20MB of data initially, which of course fits in cache. The first time I read it (using a 32KB buffer) takes 1.4s, bringing it into cache. The second time (using a 32 byte buffer) takes 0.17s. The third time (back with the 32KB buffer again) takes 0.03s, which is too close to the granularity of my timer to be meaningful. fseek takes over 20s, even though the data is already in disk cache.

At this point I'm pulling fseek out of the ring so the other two can continue:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m33.437s
user    0m0.749s
sys     0m1.562s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.078s
user    0m5.030s
sys     0m0.484s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.141s
user    0m0.280s
sys     0m0.500s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.094s
user    0m4.968s
sys     0m0.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.140s
user    0m0.171s
sys     0m0.640s

1000MB of data also appears to be substantially cached. A 32KB buffer is 6 times faster than a 32 byte buffer. But the difference is all user time, not time spent blocked on disk I/O. Now, 8000MB is much more than I have RAM, so I can avoid caching:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m25.515s
user    0m5.155s
sys     0m12.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2
-938074821

real    3m59.015s
user    1m11.061s
sys     0m10.999s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m42.423s
user    0m5.577s
sys     0m14.484s

Ignore the first of those three, it benefited from the first 1000MB of the file already being in RAM.

Now, the version with the 32KB is only slightly faster in wall clock time (and I can't be bothered to re-run, so let's ignore it for now), but look at the difference in user+sys time: 20s vs. 82s. I think that my OS's speculative read-ahead disk caching has saved the 32-byte buffer's bacon here: while the 32 byte buffer is being slowly refilled, the OS is loading the next few disk sectors even though nobody has asked for them. Without that I suspect it would have been a minute (20%) slower than the 32KB buffer, which spends less time in user-land before requesting the next read.

Moral of the story: standard I/O buffering doesn't cut it in my implementation, the performance of fseek is atrocious as the questioner says. When the file is cached in the OS, buffer size is a big deal. When the file is not cached in the OS, buffer size doesn't make a whole lot of difference to wall clock time, but my CPU was busier.

incrediman's fundamental suggestion to use a read buffer is vital, since fseek is appalling. Arguing over whether the buffer should be a few KB or a few hundred KB is most likely pointless on my machine, probably because the OS has done a job of ensuring that the operation is tightly I/O bound. But I'm pretty sure this is down to OS disk read-ahead, not standard I/O buffering, because if it was the latter then fseek would be better than it is. Actually, it could be that the standard I/O is doing the read ahead, but a too-simple implementation of fseek is discarding the buffer every time. I haven't looked into the implementation (and I couldn't follow it across the boundary into the OS and filesystem drivers if I did).

Plethoric answered 6/3, 2010 at 23:28 Comment(5)

Very cool. But fread is not optimized for 1 char. Can you try fgetc? – Devilment 7/3, 2010 at 5:23

fgetc vs. fread makes no difference that I can detect in 4 test runs of each (with MEGS=20, data pre-loaded). Range of results is 19.4s to 21.2s, with the best and worst both using fgetc. I expect other people's mileage to vary - I don't know to what extent cygwin+gcc is using unmodified glibc, and I don't know whether there's some peculiarity of Windows responsible for the performance hit on fseek. You'd think that a forward seek of 31 bytes "should" most of the time just increment an offset in the FILE*, but apparently not. – Plethoric 7/3, 2010 at 5:31

I traced it; the sucker makes a system call on every fseek. What idiots! I changed your program to use Phong Vo's sfio library, and at that point the differences are still there but they are reasonably small. Thanks for posting such a useful program. Oh, and +1 :-) – Devilment 7/3, 2010 at 5:59

Thanks, Norman. Number 1 rule of performance questions: it's usually really easy to write a half-assed benchmark, and a half-assed benchmark is usually enough to reveal serious performance disasters :-) – Plethoric 8/3, 2010 at 0:23

Phong Vo's sfio library can be found at github.com/ellson/graphviz/tree/master/lib/sfio (among other places, but some earlier links here have broken). – Selah 12/1, 2016 at 16:55

H

10

Well, you can read a byte and then seek 29 bytes in a loop. But the IO subsystem has to read from the file by sectors, which are typically 512 bytes in size, so it will still end up reading the whole file.

In the long run, it will be faster to just read the whole file in chunks that are a multiple of your step size, and then just look in the buffer. You'll make your life a bit simpler if you make sure that you buffer size is a multiple of 30, and you make the fileio subsystem's life easier if it's a multiple of 512.

while (still more file to read)
{ 
   char buf[30 * 512];
   int cread = fread (buf, sizeof(buf), 1, fd);
   for (int ii = 0; ii < cread; ii += 30)
   {

   }
}

This may look inefficient, but it will work out to be faster than trying to read in 30 byte chunks.

By the way. If you are running on Windows, and willing to be OS specific, you really can't beat the performance of memory mapped files. How to scan through really huge files on disk?

Hangup answered 6/3, 2010 at 23:39 Comment(3)

It's an important point that the sector size means that the OS will be reading the entire file regardless. – Fullblooded 7/3, 2010 at 0:16

Windows isn't the only platform with memory-mapped files, of course. – Pomona 7/3, 2010 at 6:15

@Ken: I have no first hand knowledge of how mmap performs relative to fread, and the sample code I link to is Windows only. – Hangup 7/3, 2010 at 6:48

B

9

If you're willing to break out of ANSI-C and use OS specific calls, I'd recommend using memory mapped files. This is the Posix version (Windows has it's own OS specific calls):

#define MAPSIZE 4096
int fd = open(file, O_RDONLY);
struct stat stbuf;
fstat(fd, &stbuf);


char *addr = 0;
off_t last_mapped_offset = -1;
off_t idx = 0;
while (idx < stbuf.st_size)
{
    if (last_mapped_offset != (idx / MAPSIZE))
    {
        if (addr)
            munmap(addr, MAPSIZE);

        last_mapped_offset = idx / MAPSIZE; 

        addr = mmmap(0, MAPSIZE, PROT_READ, MAP_FILE, fd, idx, last_mapped_offset);
    }

    *(addr + (idx % MAPSIZE));

    idx += 30;

}

munmap(addr, MAPSIZE);
close(fd);

Bur answered 6/3, 2010 at 23:43 Comment(2)

Would typical POSIX-based OSes still perform read-ahead when you only mmap() one page at a time and never call madvise()? – Cecelia 7/3, 2010 at 0:27

By the way, mmap() uses SIGBUS to report errors that occur after the file is mapped. This is much harder to deal with correctly than errors from read() or fread(). – Cecelia 7/3, 2010 at 0:34

D

3

The whole purpose of a buffered I/O library is to free you from such concerns. If you have to read every 30th byte, the OS is going to wind up reading the whole file, because the OS reads in larger chunks. Here are your options, from highest performance to lowest performance:

If you have a large address space (i.e., you're running a 64-bit OS on 64-bit hardware), then using memory-mapped IO (mmap on POSIX systems) will save you the cost of having the OS copy data from kernel space to user space. This savings could be significant.
As shown by the detailed notes below (and thanks to Steve Jessop for the benchmark), if you care about I/O performance you should download Phong Vo's sfio library from the AT&T Advanced Software Technology group. It is safer, better designed, and faster than C's standard I/O library. On programs that use fseek a lot, it is dramatically faster: up to seven times faster on a simple microbenchmark.
Just relax and use fseek and fgetc, which are designed and implemented exactly to solve your problem.

If you take this problem seriously, you should measure all three alternatives. Steve Jessop and I showed that using fseek is slower, and if you are using the GNU C library, fseek is a lot slower. You should measure mmap; it may be the fastest of all.

Addendum: You want to look into your filesystem and making sure it can pull 2–3 GB off the disk quickly. XFS may beat ext2, for example. Of course, if you're stuck with NTFS or HFS+, it's just going to be slow.

Shocking results just in

I repeated Steve Jessop's measurements on Linux. The GNU C library makes a system call at every fseek. Unless POSIX requires this for some reason, it's insane. I could chew up a bunch of ones and zeroes and puke a better buffered I/O library than that. Anyway, costs go up by about a factor of 20, much of which is spent in the kernel. If you use fgetc instead of fread to read single bytes, you can save about 20% on small benchmarks.

Less shocking results with a decent I/O library

I did the experiment again, this time using Phong Vo's sfio library. Reading 200MB takes

0.15s without using fseek (BUFSZ is 30k)
0.57s using fseek

Repeated measurements show that without fseek, using sfio still shaves about 10% off the run time, but the run times are very noisy (almost all time is spent in the OS).

On this machine (laptop) I don't have enough free disk space to run with a file that won't fit in the disk cache, but I'm willing to draw these conclusions:

Using a sensible I/O library, fseek is more expensive, but not more expensive enough to make a big difference (4 seconds if all you do is the I/O).
The GNU project does not provide a sensible I/O library. As is too often the case, the GNU software sucks.

Conclusion: if you want fast I/O, your first move should be to replace the GNU I/O library with the AT&T sfio library. Other effects are likely to be small by comparison.

Devilment answered 7/3, 2010 at 2:52 Comment(3)

Prepare to be shocked, fseek causes a big slowdown on my machine (NTFS, Windows XP, cygwin). – Plethoric 7/3, 2010 at 5:13

@Steve: I'm pretty skeptical about cygwin. I'd love to know how performance compares with the Microsoft C compiler and library (identical code). – Devilment 7/3, 2010 at 5:21

"I could chew up a bunch of ones and zeroes and puke a better buffered I/O library than that." It's open source. Re-write it yourself and submit it; if it gets rejected for some big reason (e.g. POSIX requires it), then you'll know why the GNU library performs so badly. If it gets accepted, then you'll have single-handedly made a huge improvement to Linux's default I/O libraries. – Weatherproof 24/2, 2013 at 16:45

P

1

You almost certainly don't need to worry about it. The runtime may well buffer the last block that it read for each file handle. Even if it doesn't, the operating system is caching file accesses for you.

That said, if you read a block at a time, you do save on call overheads to the fseek and fread functions. The bigger the block you read at once, the more you save on call overheads - though other costs obviously start making themselves felt beyond a certain point.

Premises answered 7/3, 2010 at 0:21 Comment(0)

B

0

If you are reading data from a hard disk with a spinning platter the answer is you read the whole file sequentially using a large buffer and discard the portions in memory you don't want.

The smallest unit of access possible to a standard hard disk drive is the sector. Sector sizes for all common spinning disk drives are many times more than 30 bytes. This means the hard disk controller must access each and every sector anyway regardless of what the request from the host looks like. There is no low level magic possible to change this.

Even if this was not the case and you could read individual bytes there is a huge premium for seek vs sequential read operations. The best possible case is still the same as sequential read. In the real world I wouldn't be surprised if signaling overhead would preclude such schemes from working even with a massive command buffer.

Boorer answered 19/6, 2010 at 22:28 Comment(0)

Shocking results just in

Less shocking results with a decent I/O library

Recommended topics

Hot tags