Fastest method to deal with string
Asked Answered
R

2

6

I have a text file to read and deal with with 20000 lines. In the text file I want to read the point coordinates and assign to DirectX to render.Snapshot of Text file

I have used std::ifstream, getline, stringstream to get the point coordinates. After building the win32 program then start running, it takes too long to read and store the point coordinates in the array. (5 mins to go through 20000 lines text file). Code as below:

struct PointCoord { std::string PtName; float PtX = 0.0; float PtY = 0.0;}
PointCoord *PointPtr = NULL;
PointCoord pointcoord;

std::ifstream File_read(FileNameTXT);    
while (getline(File_read, TextHandler))
        {
            std::istringstream iss;
            std::string skip;
            if (TextHandler.find("  POINT ") != std::string::npos)
            {
                iss.str(TextHandler);
                std::string TempX, TempY;
                iss >> skip;
                iss >> pointcoord.PtName;                         

                //pointcoord pass value to PointCoord
                iss >> TempX;
                iss >> TempY;

                pointcoord.PtX = std::stof(TempX.c_str());
                pointcoord.PtY = std::stof(TempY.c_str());

                //dynamically store the points coordiantes
                if (PointPtr == NULL)
                {
                    PointPtr = new PointCoord[1];                     

                    //PointCoord pass value to PointPtr
                    PointPtr[0] = pointcoord;
                    Pt_Count++;
                }
                else
                {
                    PointCoord *Temp = PointPtr;
                    PointPtr = new PointCoord[Pt_Count + 1];

                    for (UINT i = 0; i < Pt_Count; i++)
                    {
                        PointPtr[i] = Temp[i];
                    }
                    PointPtr[Pt_Count] = pointcoord;
                    Pt_Count++;
                    delete[]Temp;
                }
            }//end of loading points
      }//end of getline

I also used std::fread to read the whole text file at once in string buffer, which is fast(reading done in few seconds). Then use stringstream in similar code to store the point coordinates in dynamical array, which is also too slow.

Any suggestion are welcome. Many thanks.

Reduced answered 22/10, 2018 at 5:27 Comment(3)
All that dynamic memory allocation and copying won't be helping...Sarita
Your code is doing an O(n**2) operation with all that array shuffling around. You should either preallocate the array at the correct size or use vector which does a smarter resizing.Vile
When you profiled your code, where did the profiler say it was it spending the most time?Tireless
K
11

The most damning thing in this code isn't the string parsing; its the resize of the target array for every new point read. The more you read, the worse it gets. Ultimately it becomes a copy operation on the order of O(n^2).

To give you an idea how bad that is, consider the basic summation of n natural numbers, because that's how many object constructions, destructions, and copies you're doing:

n(n+1)/2 = (20000 * 20001)/2 = 200010000 objects created, copied, and destroyed

So, string parsing isn't the problem. The 20000 lines of text being parsed is dwarfed by the over-200-million object constructions, destructions, and copies.


You don't need to do any of that. Use an appropriate container such as std::vector and approximate the initial reserve based on the file size. Then, just generate the points and move them into the container.

An example that does this, including generation of a test file of 100000 points (5x the size you're asking about), appears below:

Code

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>
#include <random>
#include <chrono>

struct Point
{
    std::string name;
    float x;
    float y;
};

std::vector<Point> readFile(std::string const& fname)
{
    std::vector<Point> res;

    std::ifstream inp(fname);
    if (inp.is_open())
    {
        // gather file size for a rough approximation of reserve
        inp.seekg(0, std::ios::end);
        auto flen = inp.tellg();
        inp.seekg(0, std::ios::beg);
        res.reserve(flen/40);

        std::string line;
        while (std::getline(inp, line))
        {
            auto pos = line.find("POINT");
            if (pos != std::string::npos)
            {
                std::istringstream iss(line.substr(pos+5));
                Point pt;
                if (iss >> pt.name >> pt.x >> pt.y)
                    res.emplace_back(std::move(pt));
            }
        }
    }

    return res;
}


int main()
{
    using namespace std::chrono;

    std::mt19937 prng(std::random_device{}());
    std::uniform_real_distribution<float> dist(-100.0, 100.0);

    // generate test data
    std::ofstream outf("testpoints.txt");
    for (int i=1; i<=100000; ++i)
        outf << "POINT  \"" << i << "\" " << dist(prng) << ' ' << dist(prng) << '\n';
    outf.close();

    // rough benchmark
    auto tp0 = steady_clock::now();
    auto v = readFile("testpoints.txt");
    auto tp1 = steady_clock::now();
    std::cout << "Read " << v.size() << " points in ";
    std::cout << duration_cast<milliseconds>(tp1-tp0).count() << "ms\n";
    v.clear();
}

Output

Running on a 2015 duo-core i7 MacBook Air Laptop, a release mode build produces the following result:

Read 100000 points in 164ms

A Possibly More Appropriate Container: std::deque

In the end, what you're really needing is a container that allows quick-insertion on the end, while minimizing (or eliminating) element copying during resizing. Certainly, as the code above shows, setting up a reserve against a std::vector is one way to do that. Another option is to use a container that actuall specializes in end-insertions, while still being mostly cache-friendly (not perfect like std::vector, but certainly better than something like linked lists, which are great for insertions, but dreadful for enumeration).

That's exactly what std::deque will do. The code above, changed for std::deque, allows you to eliminate the reserve-guess, and simply start slamming nodes on the end of the sequence, which will automatically add more pages as the sequence grows:

Code

#include <iostream>
#include <fstream>
#include <sstream>
#include <deque>
#include <string>
#include <random>
#include <chrono>

struct Point
{
    std::string name;
    float x;
    float y;
};

std::deque<Point> readFile(std::string const& fname)
{
    std::deque<Point> res;

    std::ifstream inp(fname);
    if (inp.is_open())
    {
        std::string line;
        while (std::getline(inp, line))
        {
            auto pos = line.find("POINT");
            if (pos != std::string::npos)
            {
                std::istringstream iss(line.substr(pos+5));
                Point pt;
                if (iss >> pt.name >> pt.x >> pt.y)
                    res.emplace_back(std::move(pt));
            }
        }
    }

    return res;
}


int main()
{
    using namespace std::chrono;

    std::mt19937 prng(std::random_device{}());
    std::uniform_real_distribution<float> dist(-100.0, 100.0);

    // generate test data
    std::ofstream outf("testpoints.txt");
    for (int i=1; i<=100000; ++i)
        outf << "POINT  \"" << i << "\" " << dist(prng) << ' ' << dist(prng) << '\n';
    outf.close();

    // rough benchmark
    auto tp0 = steady_clock::now();
    auto v = readFile("testpoints.txt");
    auto tp1 = steady_clock::now();
    std::cout << "Read " << v.size() << " points in ";
    std::cout << duration_cast<milliseconds>(tp1-tp0).count() << "ms\n";
    v.clear();
}

Output

Read 100000 points in 160ms

If your needs require a contiguous sequence, the std::vector approach is the way to go. If you just need random access to the elements and want fast end insertions, a std::deque may be a better fit. Think about that, and choose the best fit for you.


Summary

Get rid of that dreadful expansion algorithm. It's the sore point in your code. Replace it with a geometric resize algorithm, and start with a rough approximation of the number of elements you're going to need from the beginning. Or use a container suited for optimal end-insertions. Either way, it's better than what you have now.

Kandis answered 22/10, 2018 at 6:6 Comment(0)
S
1

If you want to improve further, there’re 2 things you can do.

  1. Instead of reading the file line-by-line you can memory map the complete file, and use pointer based C-style code to split it into lines, and then each line into fields. It's possible to do both steps in a single iteration, like you're doing now. This way you can write code that doesn’t copy stings, except the names. If you use Visual C++, see CAtlFileMapping<char> template class.

  2. The standard float parser code, the one in std::strtof, it very generic. If your text is ASCII, and numbers are stored in usual form like -12.3456 i.e. you don’t have INF or NAN and you don’t have numbers in scientific notation like -1.23E+1, you can write your own version of strtof, will be 2-3 times faster compared to the standard one.

Stubble answered 22/10, 2018 at 12:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.