Buffered reading from stdin using fread in C
Asked Answered
M

11

10

I am trying to efficiently read from the stdin by using setvbuf in `_IOFBF~ mode. I am new to buffering. I am looking for working examples.

The input begins with two integers (n,k). The next n lines of input contain 1 integer. The aim is to print how many integers are divisible by k.

#define BUFSIZE 32
int main(){
  int n, k, tmp, ans=0, i, j;
  char buf[BUFSIZE+1] = {'0'};
  setvbuf(stdin, (char*)NULL, _IONBF, 0);
  scanf("%d%d\n", &n, &k);
  while(n>0 && fread(buf, (size_t)1, (size_t)BUFSIZE, stdin)){
    i=0; j=0;
    while(n>0 && sscanf(buf+j, "%d%n", &tmp, &i)){
    //printf("tmp %d - scan %d\n",tmp,i); //for debugging
      if(tmp%k==0)  ++ans;
      j += i; //increment the position where sscanf should read from
      --n;
    }
  }
  printf("%d", ans);
  return 0;
}

The problem is if number is at the boundary, the buffer buf will read 23 from 2354\n, when it should have either read 2354 (which it cannot) or nothing at all.

How can I solve this issue?


Edit
Resolved now (with analysis).

Edit
Complete Problem Specification

Michaels answered 3/3, 2010 at 12:48 Comment(0)
M
2

Version 1 : Using getchar_unlocked as suggested by R Samuel Klatchko (see comments)

#define BUFSIZE 32*1024
int main(){
  int lines, number=0, dividend, ans=0;
  char c;
  setvbuf(stdin, (char*)NULL, _IOFBF, 0);// full buffering mode
  scanf("%d%d\n", &lines, ÷nd);
  while(lines>0){
    c = getchar_unlocked();
    //parse the number using characters
    //each number is on a separate line
    if(c=='\n'){
      if(number % dividend == 0)    ans += 1;
      lines -= 1;
      number = 0;
    }
    else
      number = c - '0' + 10*number;
  }

  printf("%d are divisible by %d \n", ans, dividend);
  return 0;
}

Version 2: Using fread to read a block and parsing number from it.

#define BUFSIZE 32*1024
int main(){
int lines, number=0, dividend, ans=0, i, chars_read;
char buf[BUFSIZE+1] = {0}; //initialise all elements to 0
scanf("%d%d\n",&lines, &dividend);

while((chars_read = fread(buf, 1, BUFSIZE, stdin)) > 0){
  //read the chars from buf
  for(i=0; i < chars_read; i++){
    //parse the number using characters
    //each number is on a separate line
    if(buf[i] != '\n')
      number = buf[i] - '0' + 10*number;
    else{
      if(number%dividend==0)    ans += 1;
      lines -= 1;
      number = 0;
    }       
  }

if(lines==0)  break;
}

printf("%d are divisible by %d \n", ans, dividend);
return 0;
}

Results: (10 million numbers tested for divisibility by 11)

Run 1: ( Version 1 without setvbuf ) 0.782 secs
Run 2: ( Version 1 with setvbuf ) 0.684 secs
Run 3: ( Version 2 ) 0.534

P.S. - Every run compiled with GCC using -O1 flag

Michaels answered 4/3, 2010 at 10:38 Comment(5)
Neat solution to the problem of numbers potentially being cut of at the end of a buffer but what happens if a line consists of "z\n"?Alegre
Your conclusion is incorrect. Half your speedup comes from doing your own character -> number conversion instead of using scanf. The other half is that stdio locking can add quite a bit of overhead. Try this: 1) enable the call to setvbuf, 2) read the data byte by byte with getchar_unlocked instead of with fread. You'll get a similar speedup.Alloway
@Samuel: okay. will try it today.Michaels
@Sinan Ünür: This is solution to a problem specification (from SPOJ) which clearly says that there is only 1 number on each line. So i have accounted for that only. Ofcourse, this is not a general solution. BTW i have mentioned that in my question too!Michaels
Doesn't handle negative numbers either. Maybe you should link to the problem spec?Hydromancy
L
3

I am going to recommend trying full buffering with setvbuf and ditching fread. If the specification is that there is one number per line, I will take that for granted, use fgets to read in a full line and pass it to strtoul parse the number that is supposed to be on that line.

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define INITIAL_BUFFER_SIZE 2 /* for testing */

int main(void) {
    int n;
    int divisor;
    int answer = 0;
    int current_buffer_size = INITIAL_BUFFER_SIZE;
    char *line = malloc(current_buffer_size);

    if ( line == NULL ) {
        return EXIT_FAILURE;
    }

    setvbuf(stdin, (char*)NULL, _IOFBF, 0);

    scanf("%d%d\n", &n, &divisor);

    while ( n > 0 ) {
        unsigned long dividend;
        char *endp;
        int offset = 0;
        while ( fgets(line + offset, current_buffer_size, stdin) ) {
            if ( line[strlen(line) - 1] == '\n' ) {
                break;
            }
            else {
                int new_buffer_size = 2 * current_buffer_size;
                char *tmp = realloc(line, new_buffer_size);
                if ( tmp ) {
                    line = tmp;
                    offset = current_buffer_size - 1;
                    current_buffer_size = new_buffer_size;
                }
                else {
                    break;
                }
            }
        }
        errno = 0;
        dividend = strtoul(line, &endp, 10);
        if ( !( (endp == line) || errno ) ) {
            if ( dividend % divisor == 0 ) {
                answer += 1;
            }
        }
        n -= 1;
    }

    printf("%d\n", answer);
    return 0;
}

I used a Perl script to generate 1,000,000 random integers between 0 and 1,000,000 and checked if they were divisible by 5 after compiling this program with gcc version 3.4.5 (mingw-vista special r3) on my Windows XP laptop. The whole thing took less than 0.8 seconds.

When I turned buffering off using setvbuf(stdin, (char*)NULL, _IONBF, 0);, the time went up to about 15 seconds.

Liquorice answered 4/3, 2010 at 0:54 Comment(2)
Could you explain the reason to ditch fread and move to setvbuf?Sweeny
So, the points are: 1) there is no reason to try to eliminate the buffered IO; 2) no good reason is provided as to why one should read binary blocks and parse numbers digit by digit. Instead, rely on the library's buffering and parsing.Alegre
A
2

One thing that I find confusing is why you are both enabling full buffering within the stream object via the call to setvbuf and doing your own buffering by reading a full buffer into buf.

I understand the need to do buffering, but that is a bit overkill.

I'm going to recommend you stick with setvbuf and remove your own buffering. The reason why is that implementing your own buffering can be tricky. The problem is what will happen when a token (in your case a number) straddles the buffer boundary. For example, let's say your buffer is 8 bytes (9 bytes total for trailing NULL) and your input stream looks like

12345 12345

The first time you fill the buffer you get:

"12345 12"

while the second time you fill the buffer you get:

"345"

Proper buffering requires you to handle that case so you treat the buffer as the two numbers {12345, 12345} and not three numbers {12345, 12, 234}.

Since stdio handles that already for you, just use that. Continue to call setvbuf, get rid of the fread and use scanf to read individual numbers from the input stream.

Alloway answered 4/3, 2010 at 0:41 Comment(4)
Now you got my problem exactly. For proper understanding, i would still like to do it using fread :). Although, next thing will be to do with just setvbuf.Michaels
and FYI, i first tried with using just setvbuf alone, then also i was getting around the same execution time (~5secs). I just want to speed up the IO anyway.Michaels
Unless you have a horribly bad version of stdio, you are not going to get any significant speedup by doing your own buffering.Alloway
setvbuf can sometimes be very effective. For instance, it did help a lot to set it to 1MB in the case of reading 45KB chunks of data from an SD card. Without using it, reading could take up to half a second at times, but now it takes less than 0.05 sec.Approach
M
2

Version 1 : Using getchar_unlocked as suggested by R Samuel Klatchko (see comments)

#define BUFSIZE 32*1024
int main(){
  int lines, number=0, dividend, ans=0;
  char c;
  setvbuf(stdin, (char*)NULL, _IOFBF, 0);// full buffering mode
  scanf("%d%d\n", &lines, ÷nd);
  while(lines>0){
    c = getchar_unlocked();
    //parse the number using characters
    //each number is on a separate line
    if(c=='\n'){
      if(number % dividend == 0)    ans += 1;
      lines -= 1;
      number = 0;
    }
    else
      number = c - '0' + 10*number;
  }

  printf("%d are divisible by %d \n", ans, dividend);
  return 0;
}

Version 2: Using fread to read a block and parsing number from it.

#define BUFSIZE 32*1024
int main(){
int lines, number=0, dividend, ans=0, i, chars_read;
char buf[BUFSIZE+1] = {0}; //initialise all elements to 0
scanf("%d%d\n",&lines, &dividend);

while((chars_read = fread(buf, 1, BUFSIZE, stdin)) > 0){
  //read the chars from buf
  for(i=0; i < chars_read; i++){
    //parse the number using characters
    //each number is on a separate line
    if(buf[i] != '\n')
      number = buf[i] - '0' + 10*number;
    else{
      if(number%dividend==0)    ans += 1;
      lines -= 1;
      number = 0;
    }       
  }

if(lines==0)  break;
}

printf("%d are divisible by %d \n", ans, dividend);
return 0;
}

Results: (10 million numbers tested for divisibility by 11)

Run 1: ( Version 1 without setvbuf ) 0.782 secs
Run 2: ( Version 1 with setvbuf ) 0.684 secs
Run 3: ( Version 2 ) 0.534

P.S. - Every run compiled with GCC using -O1 flag

Michaels answered 4/3, 2010 at 10:38 Comment(5)
Neat solution to the problem of numbers potentially being cut of at the end of a buffer but what happens if a line consists of "z\n"?Alegre
Your conclusion is incorrect. Half your speedup comes from doing your own character -> number conversion instead of using scanf. The other half is that stdio locking can add quite a bit of overhead. Try this: 1) enable the call to setvbuf, 2) read the data byte by byte with getchar_unlocked instead of with fread. You'll get a similar speedup.Alloway
@Samuel: okay. will try it today.Michaels
@Sinan Ünür: This is solution to a problem specification (from SPOJ) which clearly says that there is only 1 number on each line. So i have accounted for that only. Ofcourse, this is not a general solution. BTW i have mentioned that in my question too!Michaels
Doesn't handle negative numbers either. Maybe you should link to the problem spec?Hydromancy
Z
1

If you are after out-and-out speed and you work on a POSIX-ish platform, consider using memory mapping. I took Sinan's answer using standard I/O and timed it, and also created the program below using memory mapping. Note that memory mapping will not work if the data source is a terminal or a pipe and not a file.

With one million values between 0 and one billion (and a fixed divisor of 17), the average timings for the two programs was:

  • standard I/O: 0.155s
  • memory mapped: 0.086s

Roughly, memory mapped I/O is twice as fast as standard I/O.

In each case, the timing was repeated 6 times, after ignoring a warm-up run. The command lines were:

time fbf < data.file    # Standard I/O (full buffering)
time mmf < data.file    # Memory mapped file I/O

#include <ctype.h>
#include <errno.h>
#include <limits.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>

static const char *arg0 = "**unset**";
static void error(const char *fmt, ...)
{
    va_list args;
    fprintf(stderr, "%s: ", arg0);
    va_start(args, fmt);
    vfprintf(stderr, fmt, args);
    va_end(args);
    exit(EXIT_FAILURE);
}

static unsigned long read_integer(char *src, char **end)
{
    unsigned long v;
    errno = 0;
    v = strtoul(src, end, 0);
    if (v == ULONG_MAX && errno == ERANGE)
        error("integer too big for unsigned long at %.20s", src);
    if (v == 0 && errno == EINVAL)
        error("failed to convert integer at %.20s", src);
    if (**end != '\0' && !isspace((unsigned char)**end))
        error("dubious conversion at %.20s", src);
    return(v);
}

static void *memory_map(int fd)
{
    void *data;
    struct stat sb;
    if (fstat(fd, &sb) != 0)
        error("failed to fstat file descriptor %d (%d: %s)\n",
              fd, errno, strerror(errno));
    if (!S_ISREG(sb.st_mode))
        error("file descriptor %d is not a regular file (%o)\n", fd, sb.st_mode);
    data = mmap(0, sb.st_size, PROT_READ, MAP_PRIVATE, fileno(stdin), 0);
    if (data == MAP_FAILED)
        error("failed to memory map file descriptor %d (%d: %s)\n",
              fd, errno, strerror(errno));
    return(data);
}

int main(int argc, char **argv)
{
    char *data;
    char *src;
    char *end;
    unsigned long k;
    unsigned long n;
    unsigned long answer = 0;
    size_t i;

    arg0 = argv[0];
    data = memory_map(0);

    src = data;

    /* Read control data */
    n = read_integer(src, &end);
    src = end;
    k = read_integer(src, &end);
    src = end;

    for (i = 0; i < n; i++, src = end)
    {
        unsigned long v = read_integer(src, &end);
        if (v % k == 0)
            answer++;
    }

    printf("%lu\n", answer);
    return(0);
}
Zannini answered 3/3, 2010 at 12:48 Comment(1)
@Jonathan Leffler: Thanks for the answer :). Now, the solution i have posted (using fread) also does it in 0.08secs. So, there is no significant improvement with MMAPed IO. I have edited the question, check the changes.Michaels
A
1

The problem when you are not using redirection is that you are not causing EOF.

Since this appears to be Posix (based on the fact you are using gcc), just type ctrl-D (i.e. while pressing the control button, press/release d) which will cause EOF to be reached.

If you are using Windows, I believe you use ctrl-Z instead.

Alloway answered 3/3, 2010 at 23:47 Comment(2)
ya that works. but i still got a problem, sscanf() scans only the first integer, in each loop the value of temp is the first integer.Michaels
posted a solution with getchar_unlocked() and an analysis. Can i improve it more?Michaels
H
0

You can use the value of n to stop reading the input after you've seen n integers.

Change the condition of the outer while loop to:

while(n > 0 && fread(buf, sizeof('1'), BUFSIZE, stdin))

and change the body of the inner one to:

{
  n--;
  if(tmp%k == 0)  ++ans;
}

The problem you're continuing to have is that because you never adjust buf in the inner while loop, sscanf keeps reading the same number over and over again.

If you switch to using strtol() intead of sscanf(), then you can use the endptr output parameter to move through the buffer as numbers are read.

Hydromancy answered 3/3, 2010 at 23:17 Comment(7)
You also need to change the sscanf string, see updated answer.Hydromancy
i am now using n>0 && sscanf(buf,"%d",&tmp), although it stops, but answer printed is wrong. And each number is in a different line, so i guess sscanf(buf, "\n%d", &tmp)Michaels
If you never change buf in the inner loop, sscanf will keep looking at the same input and seeing the same number.Hydromancy
ya. so i am using another variable i to keep track of position. but if the buffer stops reading in between a number (reads 23 of last number 2354), then i have a problem.Michaels
Right. It's possible to handle that too, but this should really be telling you that fread is a square peg and this problem is a round hole. You could read a line-at-a-time using fgets() instead.Hydromancy
@caf: there will be no speedup in that case, each number is already on a different line now.Michaels
@Hydromancy check my answer. It was worth the extra effort :)Michaels
B
0

Well, right off the top, scanf("%d%d",&n,&k) will shove a value into n only and silently leave k unset - You'd see this if you checked the return value of scanf(), which tells you how many variables it filled. I think you want scanf("%d %d",&n,&k) with the space.

Second, n is the number of iterations to run, but you test for "n>0" yet never decrement it. Ergo, n>0 is always true and the loop won't exit.

As someone else mentioned, feeding stdin over a pipe causes the loop to exit because the end of stdin has an EOF, which causes fread() to return NULL, exiting the loop. You probably want to add an "n=n-1" or "n--" somewhere in there.

Next, in your sscanf, %n is not really a standard thing; I'm not sure what it's meant to do, but it may do nothing: scanf() generally stops parsing at the first unrecognized format identifier, which does nothing here (since you already got your data,) but is bad practice.

Finally, if performance is important, you'd be better off not using fread() etc at all, as they're not really high performance. Look at isdigit(3) and iscntrl(3) and think about how you could parse the numbers from a raw data buffer read with read(2).

Byler answered 4/3, 2010 at 0:44 Comment(1)
scanf("%d%d",&n,&k) is no problem. --n is actually there. Was removed by mistake it now. %n stores the number of characters read.Michaels
D
-1

The outermost while() loop will only exit when the read from stdin returns EOF. This can only happen when reaching the actual end-of-file on an input file, or if the process writing to an input pipe exits. Hence the printf() statement is never executed. I don't think this has anything to do with the call to setvbuf().

Diatribe answered 3/3, 2010 at 13:24 Comment(5)
I already knew what you have answered here, but how do i stop fread? And i have not states that the problem is due to setvbuf.Michaels
OK, so if I understand correctly, you are setting the buffer size on stdin to some value, then reading from it. You should probably omit the call to fread(), and change the sscanf() call to fscanf(). The first such call should read BUFSIZE bytes into the stream's (internal) buffer, then subsequent calls hand it out to you one line at a time.Diatribe
did you read the question completely?? please read it and please do not post answer before you do so.Michaels
I did read your question completely, so I felt free to propose a better approach - don't use fread()Diatribe
well thats the whole point :). I have to use fread to consume enormous inputs.Michaels
A
-1

Mabe also take a look at this getline implementation:

http://www.cpax.org.uk/prg/portable/c/libs/sosman/index.php

(An ISO C routine for getting a line of data, length unknown, from a stream.)

Atchison answered 3/3, 2010 at 13:36 Comment(0)
C
-2

Here's my byte-by-byte take on it:

/*

Buffered reading from stdin using fread in C,
https://mcmap.net/q/1109164/-buffered-reading-from-stdin-using-fread-in-c

compile with:
gcc -Wall -O3  fread-stdin.c

create numbers.txt:
echo 1000000 5 > numbers.txt
jot -r 1000000 1 1000000 $RANDOM >> numbers.txt

time -p cat numbers.txt | ./a.out

*/

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>

#define BUFSIZE 32

int main() {

   int n, k, tmp, ans=0, i=0, countNL=0;
   char *endp = 0;

   setvbuf(stdin, (char*)NULL, _IOFBF, 0);       // turn buffering mode on
   //setvbuf(stdin, (char*)NULL, _IONBF, 0);     // turn buffering mode off

   scanf("%d%d\n", &n, &k);

   char singlechar = 0;
   char intbuf[BUFSIZE + 1] = {0};

   while(fread(&singlechar, 1, 1, stdin))     // fread byte-by-byte
   {
      if (singlechar == '\n') 
      {
         countNL++;
         intbuf[i] = '\0';
         tmp = strtoul(intbuf, &endp, 10);
         if( tmp % k == 0) ++ans;
         i = 0;
      } else {
         intbuf[i] = singlechar; 
         i++;
      }
      if (countNL == n) break;
   }

   printf("%d integers are divisible by %d.\n", ans, k);
   return 0;

}
Chivalric answered 3/3, 2010 at 12:48 Comment(1)
Using the same data, the above code takes 9.389 secs against 0.572 secs (for answer i submitted) with -O3 flag.Michaels
B
-2

The reason all this permature optimisation has a negligable effect on the runtime is that in *nix and windows type operating systems the OS handles all I/O to and from the file system and implements 30 years worth of research, trickery and deviousness to do this very efficiently.

The buffering you are trying to control is merely the block of memory used by your program. So any increases in speed will be minimal (the effect of doing 1 large 'mov' verses 6 or 7 smaller 'mov' instructions).

If you really want to speed this up try "mmap" which allows you direct access the data in the file systems buffer.

Bellyful answered 4/3, 2010 at 2:17 Comment(1)
well as Sinan proposed, speedup was significant. From around 5secs to 0.8 secs. What do you have to say now :P ?Michaels

© 2022 - 2024 — McMap. All rights reserved.