How to read extremely long lines from a text file fast and safe in C++?
Asked Answered
H

3

11

There is a large text file of 6.53 GiB. Each line of it can be a data line or comment line. Comment lines are usually short, less than 80 characters, while a data line contains more than 2 million characters and is variable-length.

Considering each data line needs to be dealt with as a unit, is there a simple way to read lines safe and fast in C++?

safe (safe for variable-length data lines): The solution is as easy to use as std::getline(). Since the length is changing, it is hoped to avoid extra memory management.

fast: The solution can achieve as fast as readline() in python 3.6.0, or even as fast as fgets() of stdio.h.

A Pure C solution is welcomed. The interface for further processing is provided both in C and C++.


PS: The original context of this question is to read big FASTA files recording DNA/RNA sequences. I hope all information present in this page will be helpful for those bioinformatics guys struggling with large Nucleotide sequence data. I guess FASTA files today would be much larger than those on 2017. I add this paragraph hoping that more people in need would see this question.


UPDATE 1: Thanks to short but invaluable comment from Basile Starynkevitch, the perfect solution comes up: POSIX getline(). Since further processing only involves converting from character to number and does not use many features of string class, a char array would be sufficient in this application.


UPDATE 2: Thanks to comments from Zulan and Galik, who both report comparable performance among std::getline(), fgets() and POSIX getline(), another possible solution is to use a better standard library implementation such as libstdc++. Moreover, here is a report claiming that the Visual C++ and libc++ implementations of std::getline is not well optimised.

Moving from libc++ to libstdc++ changes the results a lot. With libstdc++ 3.4.13 / Linux 2.6.32 on a different platform, POSIX getline(), std::getline() and fgets() show comparable performance. At the beginning, codes were run under the default settings of clang in Xcode 8.3.2 (8E2002), thus libc++ is used.


More details and some efforts (very long):

getline() of <string> can handle arbitrary long lines but is a bit slow. Is there an alternative in C++ for readline() in python?

// benchmark on Mac OS X with libc++ and SSD:
readline() of python                         ~550 MiB/s

fgets() of stdio.h, -O0 / -O2               ~1100 MiB/s

getline() of string, -O0                      ~27 MiB/s
getline() of string, -O2                     ~150 MiB/s
getline() of string + stack buffer, -O2      ~150 MiB/s

getline() of ifstream, -O0 / -O2             ~240 MiB/s
read() of ifstream, -O2                      ~340 MiB/s

wc -l                                        ~670 MiB/s

cat data.txt | ./read-cin-unsync              ~20 MiB/s

getline() of stdio.h (POSIX.1-2008), -O0    ~1300 MiB/s
  • Speeds are rounded very roughly, only to show the magnitude, and all code blocks are run several times to assure that the values are representative.

  • '-O0 / -O2' means the speeds are very similar for both optimization levels

  • Codes are shown as follows.


readline() of python

# readline.py

import time
import os

t_start = time.perf_counter()

fname = 'data.txt'
fin = open(fname, 'rt')

count = 0

while True:
    l = fin.readline()
    length = len(l)
    if length == 0:     # EOF
        break
    if length > 80:     # data line
        count += 1

fin.close()

t_end = time.perf_counter()
time = t_end - t_start

fsize = os.path.getsize(fname)/1024/1024   # file size in MiB
print("speed: %d MiB/s" %(fsize/time))
print("reads %d data lines" %count)

# run as `python readline.py` with python 3.6.0

fgets() of stdio.h

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

int main(int argc, char* argv[]){
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  FILE* fp = fopen(argv[1], "r");
  if(fp == NULL) {
    perror("Failed to open file");
    return EXIT_FAILURE;
  }

  // maximum length of lines, determined previously by python
  const int SIZE = 1024*1024*3;
  char line[SIZE];

  int count = 0;
  while(fgets(line, SIZE, fp) == line) {
    if(strlen(line) > 80) {
      count += 1;
    }
  }

  clock_t t_end = clock();

  const double fsize = 6685;  // file size in MiB

  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines\n", count);

  return EXIT_SUCCESS;
}

getline() of <string>

// readline-string-getline.cpp
#include <string>
#include <fstream>
#include <iostream>
#include <ctime>
#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  // manually set the buffer on stack
  const int BUFFERSIZE = 1024*1024*3;   // stack on my platform is 8 MiB
  char buffer[BUFFERSIZE];
  ifstream fin;
  fin.rdbuf()->pubsetbuf(buffer, BUFFERSIZE);
  fin.open(argv[1]);

  // default buffer setting
  // ifstream fin(argv[1]);

  if(!fin) {
    perror("Failed to open file");
    return EXIT_FAILURE;
  }

  // maximum length of lines, determined previously by python
  const int SIZE = 1024*1024*3;
  string line;
  line.reserve(SIZE);

  int count = 0;
  while(getline(fin, line)) {
    if(line.size() > 80) {
      count += 1;
    }
  }

  clock_t t_end = clock();

  const double fsize = 6685;  // file size in MiB

  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines\n", count);

  return EXIT_SUCCESS;
}

getline() of ifstream

// readline-ifstream-getline.cpp
#include <fstream>
#include <iostream>
#include <ctime>
#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  ifstream fin(argv[1]);
  if(!fin) {
    perror("Failed to open file");
    return EXIT_FAILURE;
  }

  // maximum length of lines, determined previously by python
  const int SIZE = 1024*1024*3;
  char line[SIZE];

  int count = 0;
  while(fin.getline(line, SIZE)) {
    if(strlen(line) > 80) {
      count += 1;
    }
  }

  clock_t t_end = clock();

  const double fsize = 6685;  // file size in MiB

  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines\n", count);

  return EXIT_SUCCESS;
}

read() of ifstream

// seq-read-bin.cpp
// sequentially read the file to see the speed upper bound of
// ifstream

#include <iostream>
#include <fstream>
#include <ctime>

using namespace std;


int main(int argc, char* argv[]) {
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  ifstream fin(argv[1], ios::binary);

  const int SIZE = 1024*1024*3;
  char str[SIZE];

  while(fin) {
    fin.read(str,SIZE);
  }

  clock_t t_end = clock();
  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  const double fsize = 6685;  // file size in MiB

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));

  return EXIT_SUCCESS;
}

use cat, then read from cin with cin.sync_with_stdio(false)

#include <iostream>
#include <ctime>
#include <cstdlib>

using namespace std;

int main(void) {
  clock_t t_start = clock();

  string input_line;
  
  cin.sync_with_stdio(false);
  
  while(cin) {
    getline(cin, input_line);
  }

  double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;

  const double fsize = 6685;  // file size in MiB

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));

  return EXIT_SUCCESS;
}

POSIX getline()

// readline-c-getline.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char *argv[]) {

  clock_t t_start = clock();
  
  char *line = NULL;
  size_t len = 0;
  ssize_t nread;

  if (argc != 2) {
    fprintf(stderr, "Usage: %s <file>\n", argv[1]);
    exit(EXIT_FAILURE);
  }

  FILE *stream = fopen(argv[1], "r");
  if (stream == NULL) {
    perror("fopen");
    exit(EXIT_FAILURE);
  }

  int length = -1;
  int count = 0;
  while ((nread = getline(&line, &len, stream)) != -1) {
    if (nread > 80) {
      count += 1;
    }
  }

  free(line);
  fclose(stream);

  double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;
  const double fsize = 6685;  // file size in MiB
  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines.\n", count);
  // fprintf(stdout, "length of MSA: %d\n", length-1);

  exit(EXIT_SUCCESS);
}
Handrail answered 29/5, 2017 at 11:21 Comment(35)
Did you check this out: #9371738 ?Thiosinamine
Assuming a Linux system, you should also benchmark getline(3)Advocation
Why did you not add tghe Brainfuck tag, too?Velocipede
@Olaf the tags you removed were absolutely relevant, in particular [performance]. There are clearly C++, Python, and (pure) C codes in the question. Please do not make questions more difficult to find just to satisfy your personal aversion against [C] and [C++] tagged questions.Starstudded
lemire.me/blog/2012/06/26/…Imp
@Thiosinamine Yes, it did not help. I tried to use cat the data into stdin, and after sync_with_stdio(false), the speed is still not so fast, about 30 MiB/s.Handrail
@Zulan: The question is "Is there a simple way to read lines safe and fast in C++?" showing code in various languages does not help, but is completely useless. So no, the tags were irrelevant. ("performance" is questionable, as OP asks for a "simple way", too which often contradicts "fast").Velocipede
Did I miss something? What's wrong with std::fgets()?Kermes
@Galik: std::gets was deprecated in C++11 and removed in C++14.Starstudded
@Starstudded my bad I meant std::fgets (edited)Kermes
@Olaf I am sorry to make you feel uncomfortable. But since python can achieve the safety of string at the price of halving the speed of fgets(), I am wondering whether such an easy-to-use solution exits in C++. This really matters when dealing with large text file containing data.Handrail
Also this should really only be tagged C++ as no amount of expertise in python or C can tell you what a C++ programmer should do.Kermes
@Eli4ph: This is not about feeling "unkomfortable", but site-rules. We are not a coding, tutoring or "find the resource" site. In case you have skipped it, take the tour and read How to Ask. Also please stop spamming unrelated tags!Velocipede
Olaf was 100% correct here in removing the other language tags. We don't tag questions based on what code incidentally appears in the question. We tag them based on what types of answers are expected. If you don't want an answer in Python, the Python tag is inappropriate. This is clearly a C++ question, and you want an answer in C++. You wouldn't tag it [pseudocode] just because you happened to include some pseudo-code in the question, would you?Idelson
Moving on from a discussion of tagging (take it to chat or meta)... How variable are the long lines? Is it that some are 100k and others are 2500k, or is it 2.1M, 2.0M, 1.9M??Middelburg
What operating system are you targeting? For both speed and simplicity, it would be worth looking into using memory-mapped files. Also, looking at your benchmarks, it is very strange that you say "-O0 / -O2". How can the speed be identical in both optimized and unoptimized builds?Idelson
If there's a Python solution that takes 0.1 second, and the best C++ solution the OP can come up with takes 0.2 seconds, then he must be using C++ inefficiently or missing a trick.Eliseoelish
@CodyGray I don't understand either. Maybe for fgets() the bottleneck is the disk or the code is too simple to be optimized?Handrail
@MartinBonner The length can not be expected. And the only thing I am sure is that they are less than 100 million character long. Maybe I am impracticable to ask for such a solution for variable-length data line to be simple, fast whereas safe.Handrail
Simple option: Read into a buffer, if not large enough, double the buffer size and copy the already read part. Remember how large a buffer you need, and start with that next time. More complicated option: Read into ½M chunks, and then reassemble into a single string when complete. Really complicated option: Write your own immutable string-like data structure, which owns 100K chunks and read into that. The data structure can then be cheap to copy.Middelburg
How is std::getline liable to cause "overflow"?Kermes
@Kermes It will not, but it's not fast enough.Handrail
@Handrail Its just that in your question you say that it can.Kermes
@Kermes Please forgive my bad grammar. It is updated.Handrail
What stdlib implementation and operating system do you use? With libstdc++ 3.4.22 / Linux 4.11.2 I get the following performance for a cached file: seq-read-bin: ~5000, ifstream-getline: ~3200, readlinec: ~2800, string-getline: ~2500, python ~800. Maybe the answer is to just use a better standard library implementation. BTW: Your time calculation is wrongly rounded. Do something like ((double)t_end-t_start)/CLOCKS_PER_SEC;.Starstudded
Also: Have you considered just extracting the numbers from the file directly instead of extracting the whole line first?Starstudded
@BasileStarynkevitch Thank you very much! This is what I am seeking for. Before I post, I searched a lot but failed to find such a solution being simple and perfect for my application. In case other folks have similar requirement, could you do me a favor to add the comment as an answer?Handrail
@Starstudded Yes, I thought the rounding is not so important thus do it not so serious, but for fast functions like fgets(), it changes a lot. I will update the results soon. Also, many thanks for your advice, I will also try to implement you idea to see whether it is better for my application.Handrail
@Starstudded I check CLOCKS_PER_SEC, it is 1 million on my platform, which I think is also standardized. Thus the results should not be altered very much. So I will be a little lazy to change the code only, assuming the output results will not change.Handrail
The issue is that clock_t and CLOCKS_PER_SEC are likely integrals, so you can only get a truncated integral number of seconds in time.Starstudded
I have to say, using comparable data on my system, std::fgets takes exactly the same time as std::getline.Kermes
@Starstudded Oh, you are right. Then I have to rerun all the code. LOLHandrail
Also POSIX getline() takes the same time on my system using a 20GB file of random sized lines with a max line size of 1024 * 1024 * 3.Kermes
@Starstudded Also: I am using clang's default settings, and not familiar with specifying other options when compiling. I think the code is working with libc++ / darwin 16.6.0. I don't know how to get the version of libc++.Handrail
@Kermes @Starstudded Thanks a lot. After moving from libc++ to libstdc++, std::fgets, string getline() and POSIX getline() shows comparable performance.Handrail
A
2

As I commented, on Linux & POSIX systems, you could consider using getline(3); I guess that the following could compile both as C and as C++ (assuming you do have some valid fopen-ed FILE*fil; ...)

char* linbuf = NULL; /// or nullptr in C++
size_t linsiz = 0;
ssize_t linlen = 0;

while((linlen=getline(&linbuf, &linsiz,fil))>=0) {
  // do something useful with linbuf; but no C++ exceptions
}
free(linbuf); linsiz=0;

I guess this might work (or be easily adapted) to C++. But then, beware of C++ exceptions, they should not go thru the while loop (or you should ensure that an appropriate destructor or catch is doing free(linbuf);).

Also getline could fail (e.g. if it calls a failing malloc) and you might need to handle that failure sensibly.

Advocation answered 29/5, 2017 at 14:32 Comment(0)
P
6

Well, the C standard library is a subset of the C++ standard library. From n4296 draft from C++ 2014 standard:

17.2 The C standard library [library.c]

The C++ standard library also makes available the facilities of the C standard library, suitably adjusted to ensure static type safety.

So provided you explain in a comment that a performance bottleneck requires it, it is perfectly fine to use fgets in a C++ program - simply you should carefully encapsulate it in an utility class, in order to preserve the OO high level structures.

Pals answered 29/5, 2017 at 11:41 Comment(6)
I expected this post to be downvoted because it advises to use C library functions from C++, but I naively hoped I would get a comment...Pals
These libraries are also part of C++ according to the standard so I don't understand the down vote.Kermes
The only part of this answer I find objectionable is the statement that "you should carefully encapsulate it in an utility class, in order to preserve the OO high level structures.". I don't see why it is necessary to wrap fgets in a class. C++ is a multi-paradigmatic language: not everything has to fit within the OOP paradigm. It isn't a strait-jacket. Not that there's anything especially object-oriented about a "utility" class anyway. You'd only bother with a wrapper if fgets could really benefit from object orientation, and I'm not sure how that would be.Idelson
@CodyGray: I really prefere to uncapsulate low level optimisation to have a clean high level structure, be it in C or C++. As OP asks for a C++ program, I assumed the high level structure was object oriented.Pals
@CodyGray RAII! To quote Scott Meyers Effective C++: Use objects to manage resources - Resource acquisition is initialization (RAII)Starstudded
That's why the stream should be represented as an object, not why fgets should be modeled as an object. Not only does a stream have associated resources, but the corresponding file is actually an object. I suppose you would say that fgets would be a member function of the stream class, but it would work just as well as a free function. @zulanIdelson
E
2

Yes, there's a faster way to read lines and create strings.

Query the file size, then load it into a buffer. Then iterate over the buffer replacing the newlines with nuls and storing the pointer to the next line.

It will be quite a bit faster if, as is likely, your platform has a call to load a file into memory.

Eliseoelish answered 29/5, 2017 at 12:11 Comment(1)
It's not clear that will be faster. A friend "optimized" our overlay managment code (a VAX port of Pr1me software - we are talking early 80's here) by using memory mapping. It wasn't that fast. He switched to QIOW to read from disk into a fixed bit of memory - it was much faster. Your approach will involve two passes through the data - that may be unacceptably expensive.Middelburg
A
2

As I commented, on Linux & POSIX systems, you could consider using getline(3); I guess that the following could compile both as C and as C++ (assuming you do have some valid fopen-ed FILE*fil; ...)

char* linbuf = NULL; /// or nullptr in C++
size_t linsiz = 0;
ssize_t linlen = 0;

while((linlen=getline(&linbuf, &linsiz,fil))>=0) {
  // do something useful with linbuf; but no C++ exceptions
}
free(linbuf); linsiz=0;

I guess this might work (or be easily adapted) to C++. But then, beware of C++ exceptions, they should not go thru the while loop (or you should ensure that an appropriate destructor or catch is doing free(linbuf);).

Also getline could fail (e.g. if it calls a failing malloc) and you might need to handle that failure sensibly.

Advocation answered 29/5, 2017 at 14:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.