How can I increase the performance in a map lookup with key type std::string?

I

14

11

I'm using a std::map (VC++ implementation) and it's a little slow for lookups via the map's find method.

The key type is std::string.

Can I increase the performance of this std::map lookup via a custom key compare override for the map? For example, maybe std::string < compare doesn't take into consideration a simple string::size() compare before comparing its data?

Any other ideas to speed up the compare?

In my situation the map will always contain < 15 elements, but it is being queried non stop and performance is critical. Maybe there is a better data structure that I can use that would be faster?

Update: The map contains file paths.

Update2: The map's elements are changing often.

Instantly answered 1/11, 2008 at 20:6 Comment(0)

E

15

First, turn off all the profiling and DEBUG switches. These can slow down STL immensely.

If that's not it, part of the problem may be that your strings are identical for the first 80-90% of the string. This isn't bad for map, necessarily, but it is for string comparisons. If this is the case, your search can take much longer.

For example, in this code find() will likely result in a couple of string compares, but each will return after comparing the first character until "david", and then the first three characters will be checked. So at most, 5 characters will be checked per call.

map<string,int> names;
names["larry"] = 1;
names["david"] = 2;
names["juanita"] = 3;

map<string,int>::iterator iter = names.find("daniel");

On the other hand, in the following code, find() will likely check 135+ characters:

map<string,int> names;
names["/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/wilma"] = 1;
names["/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/fred"] = 2;
names["/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/barney"] = 3;

map<string,int>::iterator iter = names.find("/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/betty");

That's because the string comparisons have to search deeper to find a match since the beginning of each string is the same.

Using size() in your comparison for equality won't help you much here since your data set is so small. A std::map is kept sorted so its elements can be searched with a binary search. Each call to find should result in less than 5 string comparisons for a miss, and an average of 2 comparisons for a hit. But it does depend on your data. If most of your path strings are of different lengths, then a size check like Motti describes could help a lot.

Something to consider when thinking of alternative algorithms is how many many "hits" you get. Are most of your find() calls returning end() or a hit? If most of your find()s return end() (misses) then you are searching the entire map every time (2logn string compares).

Hash_map is a good idea; it should cut your search time in about half for hits; more for misses.

A custom algorithm may be called for because of the nature of path strings, especially if your data set has common ancestry like in the above code.

Another thing to consider is how you get your search strings. If you are reusing them, it may help to encode them into something that is easier to compare. If you use them once and discard them, then this encoding step is probably too expensive.

I used something like a Huffman coding tree once (a long time ago) to optimize string searches. A binary string search tree like that may be more efficient in some cases, but its pretty expensive for small sets like yours.

Finally, look into alternative std::map implementations. I've heard bad things about some of VC's stl code performance. The DEBUG library in particular is bad about checking you on every call. StlPort used to be a good alternative, but I haven't tried it in a few years. I've always loved Boost too.

Epitasis answered 1/11, 2008 at 23:23 Comment(1)

Thanks for the debug part, never thought that debug checks can eat so much time (50 ms per operator[] in my map<int,...>). – Brawl 22/5, 2013 at 12:38

D

16

As Even said the operator used in a set is < not ==.

If you don't care about the order of the strings in your set you can pass the set a custom comparator that performs better than the regular less-than.

For example if a lot of your strings have similar prefixes (but they vary in length) you can sort by string length (since string.length is constant speed).

If you do so beware a common mistake:

struct comp {
    bool operator()(const std::string& lhs, const std::string& rhs)
    {
        if (lhs.length() < rhs.length())
            return true;
        return lhs < rhs;
    }
};

This operator does not maintain a strict weak ordering, as it can treat two strings as each less than the other.

string a = "z";
string b = "aa";

Follow the logic and you'll see that comp(a, b) == true and comp(b, a) == true.

The correct implementation is:

struct comp {
    bool operator()(const std::string& lhs, const std::string& rhs)
    {
        if (lhs.length() != rhs.length())
            return lhs.length() < rhs.length();
        return lhs < rhs;
    }
};

Detrude answered 1/11, 2008 at 20:43 Comment(3)

not a bad idea (though I would rewrite it with temps so that the length() calls happen once not twice. But I suspect that these extra compares have the potential to mitigate any gains given the small number of elements in the map. – Bidentate 1/11, 2008 at 20:56

I like the idea to define a non-lexicographic ordering that is faster to evaluate. – Bagel 1/11, 2008 at 21:4

They're just integer compares, you never know till you measure. As for temps for length, sure I would do that too, just wanted to keep the example clear. – Detrude 1/11, 2008 at 21:5

E

15

First, turn off all the profiling and DEBUG switches. These can slow down STL immensely.

If that's not it, part of the problem may be that your strings are identical for the first 80-90% of the string. This isn't bad for map, necessarily, but it is for string comparisons. If this is the case, your search can take much longer.

For example, in this code find() will likely result in a couple of string compares, but each will return after comparing the first character until "david", and then the first three characters will be checked. So at most, 5 characters will be checked per call.

map<string,int> names;
names["larry"] = 1;
names["david"] = 2;
names["juanita"] = 3;

map<string,int>::iterator iter = names.find("daniel");

On the other hand, in the following code, find() will likely check 135+ characters:

map<string,int> names;
names["/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/wilma"] = 1;
names["/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/fred"] = 2;
names["/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/barney"] = 3;

map<string,int>::iterator iter = names.find("/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/betty");