Speed-up and best practices: Using ets for per-module pre-computed data

Asked 30/5, 2011 at 23:2 Answered 31/5, 2011 at 23:2

((Please forgive me that I ask more than one question in a single thread. I think they are related.))

Hello, I wanted to know, what best practices exist in Erlang in regards to per-module precompiled data.

Example: I have a module that heavily operates on a priory know, veeery complex regular expressions. re:compile/2's documentations says: “Compiling once and executing many times is far more efficient than compiling each time one wants to match”. Since re's mp() datatype is in no way specified, and as such cannot be put at compile time if you want a target-independ beam, one has to compile the RegEx at runtime. ((Note: re:compile/2 is only an example. Any complex function to memoize would fit my question.))

Erlang's module (can) have an -on_load(F/A) attribute, denoting a method that should executed once when the module is loaded. As such, I could place my regexes to compile in this method and save the result in a new ets table named ?MODULE.

Updated after Dan's answer.

My questions are:

If I am understanding ets right, its data is saved in another process (differently form the process dictionary) and retrieving a value for an ets table is quite expensive. (Please prove me wrong, if I am wrong!) Should the content in ets be copied to the process dictionary for speedup? (Remember: the data is never being updated.)
Are there any (considerable) drawbacks of putting all data as one record (instead of many table items) into the ets/process dictionary?

Working example:

-module(memoization).
-export([is_ipv4/1, fillCacheLoop/0]).
-record(?MODULE, { re_ipv4 = re_ipv4() }).
-on_load(fillCache/0).

fillCacheLoop() ->
    receive
        { replace, NewData, Callback, Ref } ->
            true = ets:insert(?MODULE, [{ data, {self(), NewData} }]),
            Callback ! { on_load, Ref, ok },
            ?MODULE:fillCacheLoop();
        purge ->
            ok
    end
.
fillCache() ->
    Callback = self(),
    Ref = make_ref(),
    process_flag(trap_exit, true),
    Pid = spawn_link(fun() ->
        case catch ets:lookup(?MODULE, data) of
            [{data, {TableOwner,_} }] ->
                TableOwner ! { replace, #?MODULE{}, self(), Ref },
                receive
                    { on_load, Ref, Result } ->
                        Callback ! { on_load, Ref, Result }
                end,
                ok;
            _ ->
                ?MODULE = ets:new(?MODULE, [named_table, {read_concurrency,true}]),
                true = ets:insert_new(?MODULE, [{ data, {self(), #?MODULE{}} }]),
                Callback ! { on_load, Ref, ok },
                fillCacheLoop()
        end
    end),
    receive
        { on_load, Ref, Result } ->
            unlink(Pid),
            Result;
        { 'EXIT', Pid, Result } ->
            Result
    after 1000 ->
        error
    end
.

is_ipv4(Addr) ->
    Data = case get(?MODULE.data) of
        undefined ->
            [{data, {_,Result} }] = ets:lookup(?MODULE, data),
            put(?MODULE.data, Result),
            Result;
        SomeDatum -> SomeDatum
    end,
    re:run(Addr, Data#?MODULE.re_ipv4)
.

re_ipv4() ->
    {ok, Result} = re:compile("^0*"
            "([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])\\.0*"
            "([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])\\.0*"
            "([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])\\.0*"
            "([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])$"),
    Result
.

Thetis answered 30/5, 2011 at 23:2 Comment(1)

I think putting the retrieved results in the process dictionary are fine, given that they are write-once results; if they were ever updated, it wouldn't feel very Erlangic. – Outsert 30/5, 2011 at 23:15

mochiglobal implements this by compiling a new module to store your constant(s). The advantage here is that the memory is shared across processes, where in ets it's copied and in the process dictionary it's just local to that one process.

https://github.com/mochi/mochiweb/blob/master/src/mochiglobal.erl

Glissando answered 31/5, 2011 at 1:23 Comment(0)

You have another option. You can precompute the regular expression's compiled form and refer to it directly. One way to do this is to use a module designed specifically for this purpose such as ct_expand: http://dukesoferl.blogspot.com/2009/08/metaprogramming-with-ctexpand.html

You can also roll your own by generating a module on the fly with a function to return this value as a constant (taking advantage of the constant pool): http://erlang.org/pipermail/erlang-questions/2011-January/056007.html

~~Or you could even run re:compile in a shell and copy and paste the result into your code. Crude but effective.~~ This wouldn't be portable in case the implementation changes.

To be clear: all of these take advantage of the constant pool to avoid recomputing every time. But of course, this is added complexity and it has a cost.

Coming back to your original question: the problem with the process dictionary is that, well, it can only be used by its own process. Are you certain this module's functions will only be called by the same process? Even ETS tables are tied to the process that creates them (ETS is not itself implemented using processes and message passing, though) and will die if that process dies.

Calve answered 31/5, 2011 at 3:20 Comment(0)

ETS isn't implemented in a process and doesn't have its data in a separate process heap, but it does have its data in a separate area outside of all processes. This means that when reading/writing to ETS tables data must be copied to/from processes. How costly this is depends, of course, on the amount of data being copied. This is one reason why we have functions like ets:match_object and ets:select which allow more complex selection rules before data is copied.

One benefit of keeping your data in an ETS table is that it can be reached by all processes not just the process which owns the table. This can make it more efficient than keeping your data in a server. It also depends on what type of operations you want to do on your data. ETS is just a data store and provides limited atomicity. In your case that is probably no problem.

You should definitely keep you data in separate records, one for each different compiled regular expression, as it will greatly increase the access speed. You can then directly get the re you are after, otherwise you will get them all and then search again after the one you want. That sort of defeats the point of putting them in ETS.

While you can do things like building ETS tables in on_load functions it is not a good idea for ETS tables. This is because an ETS is owned by a process and is deleted when the process dies. You never really know in which process the on_load function is called. You should also avoid doing things which can take a long time as the module is not considered to be loaded until it has completed.

Generating a parse transform to statically insert the result of compiling your re's directly into your code is a cool idea, especially if your re's are really that statically defined. As is the idea of dynamically generating, compiling and loading a module into your system. Again if your data is that static you could generate this module at compile time.

Martinsen answered 31/5, 2011 at 23:2 Comment(0)

https://github.com/mochi/mochiweb/blob/master/src/mochiglobal.erl

Glissando answered 31/5, 2011 at 1:23 Comment(0)

Updated after Dan's answer.

Recommended topics

Hot tags