The most damning thing in this code isn't the string parsing; its the resize of the target array for every new point read. The more you read, the worse it gets. Ultimately it becomes a copy operation on the order of O(n^2).
To give you an idea how bad that is, consider the basic summation of n
natural numbers, because that's how many object constructions, destructions, and copies you're doing:
n(n+1)/2 = (20000 * 20001)/2 = 200010000 objects created, copied, and destroyed
So, string parsing isn't the problem. The 20000 lines of text being parsed is dwarfed by the over-200-million object constructions, destructions, and copies.
You don't need to do any of that. Use an appropriate container such as std::vector
and approximate the initial reserve based on the file size. Then, just generate the points and move them into the container.
An example that does this, including generation of a test file of 100000 points (5x the size you're asking about), appears below:
Code
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>
#include <random>
#include <chrono>
struct Point
{
std::string name;
float x;
float y;
};
std::vector<Point> readFile(std::string const& fname)
{
std::vector<Point> res;
std::ifstream inp(fname);
if (inp.is_open())
{
// gather file size for a rough approximation of reserve
inp.seekg(0, std::ios::end);
auto flen = inp.tellg();
inp.seekg(0, std::ios::beg);
res.reserve(flen/40);
std::string line;
while (std::getline(inp, line))
{
auto pos = line.find("POINT");
if (pos != std::string::npos)
{
std::istringstream iss(line.substr(pos+5));
Point pt;
if (iss >> pt.name >> pt.x >> pt.y)
res.emplace_back(std::move(pt));
}
}
}
return res;
}
int main()
{
using namespace std::chrono;
std::mt19937 prng(std::random_device{}());
std::uniform_real_distribution<float> dist(-100.0, 100.0);
// generate test data
std::ofstream outf("testpoints.txt");
for (int i=1; i<=100000; ++i)
outf << "POINT \"" << i << "\" " << dist(prng) << ' ' << dist(prng) << '\n';
outf.close();
// rough benchmark
auto tp0 = steady_clock::now();
auto v = readFile("testpoints.txt");
auto tp1 = steady_clock::now();
std::cout << "Read " << v.size() << " points in ";
std::cout << duration_cast<milliseconds>(tp1-tp0).count() << "ms\n";
v.clear();
}
Output
Running on a 2015 duo-core i7 MacBook Air Laptop, a release mode build produces the following result:
Read 100000 points in 164ms
A Possibly More Appropriate Container: std::deque
In the end, what you're really needing is a container that allows quick-insertion on the end, while minimizing (or eliminating) element copying during resizing. Certainly, as the code above shows, setting up a reserve against a std::vector
is one way to do that. Another option is to use a container that actuall specializes in end-insertions, while still being mostly cache-friendly (not perfect like std::vector
, but certainly better than something like linked lists, which are great for insertions, but dreadful for enumeration).
That's exactly what std::deque
will do. The code above, changed for std::deque
, allows you to eliminate the reserve-guess, and simply start slamming nodes on the end of the sequence, which will automatically add more pages as the sequence grows:
Code
#include <iostream>
#include <fstream>
#include <sstream>
#include <deque>
#include <string>
#include <random>
#include <chrono>
struct Point
{
std::string name;
float x;
float y;
};
std::deque<Point> readFile(std::string const& fname)
{
std::deque<Point> res;
std::ifstream inp(fname);
if (inp.is_open())
{
std::string line;
while (std::getline(inp, line))
{
auto pos = line.find("POINT");
if (pos != std::string::npos)
{
std::istringstream iss(line.substr(pos+5));
Point pt;
if (iss >> pt.name >> pt.x >> pt.y)
res.emplace_back(std::move(pt));
}
}
}
return res;
}
int main()
{
using namespace std::chrono;
std::mt19937 prng(std::random_device{}());
std::uniform_real_distribution<float> dist(-100.0, 100.0);
// generate test data
std::ofstream outf("testpoints.txt");
for (int i=1; i<=100000; ++i)
outf << "POINT \"" << i << "\" " << dist(prng) << ' ' << dist(prng) << '\n';
outf.close();
// rough benchmark
auto tp0 = steady_clock::now();
auto v = readFile("testpoints.txt");
auto tp1 = steady_clock::now();
std::cout << "Read " << v.size() << " points in ";
std::cout << duration_cast<milliseconds>(tp1-tp0).count() << "ms\n";
v.clear();
}
Output
Read 100000 points in 160ms
If your needs require a contiguous sequence, the std::vector
approach is the way to go. If you just need random access to the elements and want fast end insertions, a std::deque
may be a better fit. Think about that, and choose the best fit for you.
Summary
Get rid of that dreadful expansion algorithm. It's the sore point in your code. Replace it with a geometric resize algorithm, and start with a rough approximation of the number of elements you're going to need from the beginning. Or use a container suited for optimal end-insertions. Either way, it's better than what you have now.