very fast text file processing (C++)

Asked 14/11, 2011 at 14:33 Answered 14/11, 2011 at 15:36

i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).

I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.

I tried several approaches to get the file at the best time possible:

Fastest method I found is using boost::iostreams::stream

Others were:

Read the file as gzip, to minimize IO, but is slower than directly reading it.
copy file to ram by read(filepointer, chararray, length) and process it with a loop to distinguish the lines (also slower than boost)

Any suggestions how to make it run faster?

void readfastq(char *filename, int SRlength, uint32_t blocksize){
    _filelength = 0; //total datasets (each 4 lines)
    _SRlength = SRlength; //length of the 2. line
    _blocksize = blocksize;

    boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
    in = ins;

    readNextBlock();
}


void readNextBlock() {
    timeval start, end;
    gettimeofday(&start, 0);

    string name;
    string seqtemp;
    string garbage;
    string phredtemp;

    _seqs.empty();
    _phred.empty();
    _names.empty();
    _filelength = 0;

            //read only a part of the file i.e the first 4mio lines
    while (std::getline(in, name) && _filelength<_blocksize) {
        std::getline(in, seqtemp);
        std::getline(in, garbage);
        std::getline(in, phredtemp);

        if (seqtemp.size() != _SRlength) {
            if (seqtemp.size() != 0)
                printf("Error on read in fastq: size is invalid\n");
        } else {
            _names.push_back(name);

            for (int k = 0; k < _SRlength; k++) {

                //handle special letters
                                    if(seqtemp[k]== 'A') ...
                                    else{
                _seqs.push_back(5);
                                    }

            }
            _filelength++;
        }
    }

EDIT:

The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2

I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.

SOLUTION:

I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.

Thanks for your suggestions.

void readfastq(char *filename, int SRlength) {
    _filelength = 0;
    _SRlength = SRlength;

    size_t bytes_read, bytes_expected;

    FILE *fp;
    fp = fopen(filename, "r");

    fseek(fp, 0L, SEEK_END); //go to the end of file
    bytes_expected = ftell(fp); //get filesize
    fseek(fp, 0L, SEEK_SET); //go to the begining of the file

    fclose(fp);

    if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
        err(EX_OSERR, "data malloc");


    string name;
    string seqtemp;
    string garbage;
    string phredtemp;

    boost::iostreams::stream<boost::iostreams::file_source>file(filename);


    while (std::getline(file, name)) {
        std::getline(file, seqtemp);
        std::getline(file, garbage);
        std::getline(file, phredtemp);

        if (seqtemp.size() != SRlength) {
            if (seqtemp.size() != 0)
                printf("Error on read in fastq: size is invalid\n");
        } else {
            _names.push_back(name);

            strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU

            _filelength++;
        }
    }
}

Meingolda answered 14/11, 2011 at 14:33 Comment(8)

by 'ram' you mean the PC ram, or the onboard video memory? – Isobath 14/11, 2011 at 14:42

Note that string::empty() and vector::empty() are read-only tests on the state of the container. Perhaps you meant .clear()? – Irrecoverable 14/11, 2011 at 15:10

possible duplicate of What is the Fastest Method for High Performance Sequential File I/O in C++? – Dimerous 14/11, 2011 at 15:10

_seqs.empty(); just returns true. The default constructor creates an empty string, and bool std::string::empty() const is not the same as void std::string::clear(). – Ivar 14/11, 2011 at 15:11

I don't know the problem you are solving, but are you sure a text file is the way to go with such a large data set and an apparently time critical application? – Clyde 14/11, 2011 at 15:13

Post compilable code so we can test it and see where the bottleneck is. – Pygidium 14/11, 2011 at 16:25

you are righte i confounded clear and empty, but changes nothing necessary – Meingolda 14/11, 2011 at 18:41

@BenVoigt my issue is related to read line by line, i read a binary file (int values) also in my application which works really fast with the read(fd, dataarray, bytes_expected); – Meingolda 14/11, 2011 at 19:34

First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.

Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.

Post your code (or a part of it) for more suggestions.

EDIT:

It seems that all your bottlenecks are related to string management.

std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
_names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.

I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).

Like in this code:

class MemoryMappedFileParser
{
    const char* m_sz;
    size_t m_Len;

public:

    struct String {
        const char* m_sz;
        size_t m_Len;
    };

    bool getline(String& out)
    {
        out.m_sz = m_sz;

        const char* sz = (char*) memchr(m_sz, '\n', m_Len);
        if (sz)
        {
            size_t len = sz - m_sz;

            m_sz = sz + 1;
            m_Len -= (len + 1);

            out.m_Len = len;

            // for Windows-format text files remove the '\r' as well
            if (len && '\r' == out.m_sz[len-1])
                out.m_Len--;
        } else
        {
            out.m_Len = m_Len;

            if (!m_Len)
                return false;

            m_Len = 0;
        }

        return true;
    }

};

Hornstone answered 14/11, 2011 at 14:40 Comment(4)

I disagree. I think it can be written well using the standard libraries (but the OP has not posted enough code for context). By doing the above you are just making the code very brittle. – Pygidium 14/11, 2011 at 16:25

I don't know what do you mean by "brittle". The code is either correct or not IMHO. But nevertheless I don't insist. If you have a "standard" library/functions that do the same - you're welcome – Hornstone 14/11, 2011 at 16:27

Brittle means easy to use incorrectly in a way that will make the code incorrect. Of course if used as intended it will work. But brittle code is horrible to maintain and use past the original author (as only they know how to use it correctly). – Pygidium 14/11, 2011 at 16:36

Alright, don't like the "brittle" code - post your suggestion, written in terms of "standard" liraries – Hornstone 14/11, 2011 at 16:45

if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.

You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.

Pinole answered 14/11, 2011 at 15:25 Comment(1)

i reserved space for vector and string before the loop, saved 2 out of 60 seconds. – Meingolda 14/11, 2011 at 18:42

You are apparently using <stdio.h> since using getline.

Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.

Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.

Probably, using the readahead system call (in a separate thread perhaps) could help.

But all this are guesses. You should really measure things.

Barra answered 14/11, 2011 at 14:40 Comment(1)

There is a getline function in C++ as well. – Pedalfer 14/11, 2011 at 14:45

General suggestions:

Code the simplest, most straight-forward, clean approach,
Measure,
Measure,
Measure,

Then if all else fails:

Read raw bytes (read(2)) in page-aligned chunks. Do so sequentially, so kernel's read-ahead plays to your advantage.
Re-use the same buffer to minimize cache flushing.
Avoid copying data, parse in place, pass around pointers (and sizes).

mmap(2)-ing [parts of the] file is another approach. This also avoids kernel-userland copy.

Volume answered 14/11, 2011 at 15:5 Comment(7)

The stream buffer is already doing this. No point to do it manually. – Pygidium 14/11, 2011 at 16:26

Yes. I find no real advantage to doing it manually. Any local performance increases are out-weighed by maintenance costs and improvements in algorithm on the larger scale. – Pygidium 14/11, 2011 at 18:30

Sure. Show me a spec of what needs to be done. Writing something based on only half the information(like the above) is not going to be much use as I would end up optimizing for the local situation without knowing the greater context. Premature optimization like that is just a waste of time. :-) Looking at the link above this actually all about manipulating DNA sequence we should be able to use that information to make it more efficient. But without the full spec how can I tell. – Pygidium 14/11, 2011 at 20:42

Do you expect to be paid too? :) – Volume 14/11, 2011 at 20:47

What I am trying to say. Your optimization of using a structure with a pointer and a size (like @valdo) above is premature. If all the code is doing is printing the result then fine (it may speed up the application). But if the strings need to be manipulated you are going to need a copy. Thus all you are doing is moving the cost from the load function into the manipulation function (overall there will be no performance improvement in the application). In addition you are making the code more brittle by hand rolling your own technique for handling strings. – Pygidium 14/11, 2011 at 21:6

But it is hard to tell what is exactly the best method to move forward without knowing more about the application. Maybe input/output serialization format can be improved; Maybe your suggestion of page aligned chunks is OK, but without more information we are optimizing way to locally without considering the cost in other parts of the application. – Pygidium 14/11, 2011 at 21:7

That's why I said "general suggestions". I admit I forgot "measure" as the first bullet point. Will add. – Volume 14/11, 2011 at 21:21

Depending on your disk speed, using a very fast de compression algorithm might help, like fastlz (there are at least two other that might be more efficient, but under GPL, so licence can be a problem).

Also, using C++ data structures and functions car increase the speed as you can maybe achieve a better compiler-time optimization. Going the C way isn't always the fastes! In some bad conditions, using char* you need to parse the whole string to reach the \0 yielding desastrous performances.

For parsing your data, using boost::spirit::qi is also probably the most optimized approach http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html

Pricket answered 14/11, 2011 at 15:36 Comment(2)

the bootleneck is not the IO of the HDD itself (high-performance raid system). The approach with a compressed file was tried but was slower than reading uncompressed data (see initial post) – Meingolda 15/11, 2011 at 7:58

GZip is oriented towards compression performance, not compression speed. QuickLZ, or FastLZ are decompression speed oriented. Look at their benchmarks quicklz.com. But of course if you read at 300Mb/sec, then it might not help – Stricker 16/11, 2011 at 13:37

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags