Perfect hash function generator for functions
Asked Answered
F

1

6

I have a set of C++ functions. I want to map this functions in an hash table, something like: unordered_map<function<ReturnType (Args...)> , SomethingElse>, where SomethingElse is not relevant for this question.

This set of functions is previously known, small (let say less than 50) and static (is not gonna change).

Since lookup performance is crucial (should be performed in O(1)), I want to define a perfect hashing function.

There exists a perfect hash function generator for this scenario?

I know that there exists perfect hashing functions generators (like GPERF or CMPH) but since I've never used them, I don't know if they're suitable for my case.

REASON:

I'm trying to design a framework where, given a program written in C++, the user can select a subset F of the functions defined in this program.

For each f belonging to F, the framework implements a memoization strategy: when we call f with input i, we store (i,o) inside some data structure. So, if we are going to call AGAIN f with i, we will return o without performing again the (time expensive) computation.

The "already computed results" will be shared among different users (maybe on the cloud), so if user u1 has already computed o, user u2 will save computing time calling f with i (using the same annotation of before).

Obviously, we need to store the set of pairs (f,inputs_sets) (where inputs_sets is the already computed results set that I talked before), which is the original question: how do I do it?

So, using the "enumeration trick" proposed in the comments in this scenario could be a solution, assuming that the all the users use the exactly same enumeration, which could be a problem: supposing that our program has f1,f2,f3 what if u1 wants to memoize only f1 and f2 (so F={f1,f2}), while u2 wants to memoize only f3 (so F={f3})? An overkill solution could be to enumerate all the functions defined in the program, but this could generate an huge waste of memory.

Freytag answered 18/4, 2016 at 4:28 Comment(18)
Your std::function keys won't expose that much you can use to implement the hashing and equality functions needed by unordered_map - you're basically limited to target_type and target. You wouldn't normally want or expect to restrict the functions to having different target_types, which leaves target - a pointer to the stored target. Seems to me that could vary from compile to compile, even for extern functions - hard to extract it, run gperf or similar, edit it into the code and know it won't have changed.Klemm
Would it be considered cheating to just declare an enumeration (with one enum value per function) and an const-array of functions and do your 'lookups' as array lookups, using the enum values as your keys? I think that would avoid the hashing issue entirely and give you optimal performance.Jerol
How do you intend to use your map? I mean - are your lookups performed only by explicit name, e.g. map.lookup(foo), where foo is an actual function name, or there can be some run-time function pointer passed as the argument? If the former is the case, you probably want the hashing to be performed at compile-time too, in which case it is no longer time-critical.Intimacy
if the map won;t change then it surely can be deduced at compile time. In which case, the indexes of a std::tuple can be deduced at compile time, in which case the index into an array of (whatever the value is) can be used. this will give a lookup time of O(1) (but in practice, instantaneous) rather than the AVERAGE O(1) of an unordered map. If you provide more details we can offer you an optimal solution.Aldarcie
Updated with REASON section, hoping that it will help you to understand the problem.Freytag
For n < 50 I'd check, if binary search in a sorted vector isn't fasterAbsher
@Absher binary search require equality, which is not implemented for std::function (as @Tony well suggested). @hvd I never talked about strings.Freytag
@justHelloWorld: Yes, as Tony said, you can't use a function<ReturnType (Args...)> directly anyway, but whatever you use for lookup in the end, I'm not convinced, that a std::unordered_map gives you the best performance.Absher
Does the key have to be a std::function or would a function pointer do?Absher
If you're wanting to reuse these across users who may have picked different subsets of the functions, it seems like your hash table would need to be big enough for all the functions anyway, so the enumeration trick isn't really losing you anything much. Besides, how many functions are there going to be in a typical one of these programs? Worrying about a few bytes per function seems really unnecessary to me. (The program itself will take up much, much more.)Chucklehead
@Absher actually any solution that uniquely identify a function should be ok. So we could even use function's header as a string, but this extremely ugly (and inefficient also I think). I think that using std::function should be safer and more efficient.Freytag
@Freytag Sorry, I was misled by your reference to gperf (which you did say you weren't sure would work for you). That one is a perfect hash generator for strings. If you don't have strings, it won't work for you.Teneshatenesmus
@GarethMcCaughan so your solution wouldd be the "overkill" one, where we index/enumerate all the functions inside the program?Freytag
My point is that a normal function pointer provides equality, is easily sortable, easily hashable (not necessarily perfectly though) and has a minimal memory footprint. However it only works for functions - not for e.g. funtion objects. (And no: std::function is definitively not more efficient or safer).Absher
So, the hashing function must preserve values among multiple runs, or runs in different environments? That is actually harder than the original problem statement you gave :(Intimacy
@Absher as I described in the question, this mapping/enumerating/indexing object will be shared and used by multiple users, so it should be not "execution dependent", like a pointer value is, but maybe I'm wrongFreytag
@Intimacy possible both! Probably this application will be run on the cloud, so this hashing object will be read and written by multiple runs in different moments and possibly different environments, but this could depend on how good the proposed solution is!Freytag
Yeah, that is crucial - because the first intuition is to do something with the function address - which can be different in different runs and compilations. So the first question should be, how to obtain any kind of consistent ID, no matter if used in hashing or not.Intimacy
R
5

Ok, maybe not what you want to hear but consider this: since you talk about a few functions, less than 50, the hash lookup should be negligible, even with collisions. Have you actually profiled and saw that the lookup is critical?

So my advise is to focus your energy on something else, most likely a perfect hash function would not bring any kind of improved performance in your case.

I am going to go one step further and say that I think that for less than 50 elements a flat map (good ol' vector) would have similar performance (or maybe even better due to cache locality). But again, measurements are required.

Raimes answered 18/4, 2016 at 8:46 Comment(3)
that's not to say that the question is not interesting. There should be an actual solution for cases where it really matters and for the fun of finding a solution. I just think that in your specific case it is not justified.Raimes
I've not profiled yet. So you're suggestin' to try with a simple vector and do a linear search on it?Freytag
@Freytag yes.. You can also try boost flat_mat which keeps them sorted and has a log n search complexity. And profile! Without profiling there is almost no point in talking about optimizations. Find the hotspots and the bottlenecks and then optimize that!. There's absolutely no point in optimizing a portion of code where the program spends only 0.1 % of it's running time.Raimes

© 2022 - 2024 — McMap. All rights reserved.