Dealing with massive number of Rules (conditions and constraints) CEP systems

Asked 16/4, 2012 at 1:48 Answered 16/4, 2012 at 15:55

c++algorithm data-structures rule-engine complex-event-processing

I'm working on an application that accepts 100k+ unique inputs - for simplicity lets assume each input is a floating point value(a,b,c...etc) - though they could also be strings etc. The application has many rules/events/triggers that are related to these inputs.

Example1:

Rule[(a > b) and (c <= d)] --> [execute EventX]

Definition1: The above rule says: when the value of input 'a' is greater than that of 'b' and the value of input 'c' is less than or equal to that of 'd' execute EventX

Example2:

Rule[x != x.prev] --> [execute EventZ]

Definition2: The above rule says: If after a value update, if the current value of input 'x' is not equal to its previous value (the value has changed). execute EventZ

What I'm finding is that as the number of inputs increases, and as the number of rules increase, the total amount of computation required to evaluate the 'specific' rules and determine if an event should be triggered is getting out of control - Throwing faster and more h/w at the problem isn't scaling as expected.

At the moment, upon every new input update, the associated input/variable is looked-up in a hash-table that maps the variable to rules that contain it. Each of those rules are then subsequently evaluated, and if the result is true or actionable, the relevant event is triggered asynchronously.

This problem is in the realm of Complex Event Processing, unfortunately most of the architectures in this realm use the same deadhead approach described above - perhaps with a few refinements relating to the frequency with which things are evaluated/reevaluated. The idea is to have a system that can react in near real-time. Distributing the rules and inputs over multiple nodes doesn't seem to work well either, as in some situations a minority of inputs exists in over 95% of the active rules.

I was hoping if there are any fellow SO'ers out there, that know of better methods to solve this problem, data/structures or algorithms.

A simple idea I have in mind is that one could construct a list of dependent logical inferences.

If there are two rules that are like so:

Rule[a < b] --> [exec E1]
Rule[b >= a] --> [exec E2]

Then an update on either 'a' or 'b' should not require the evaluation of both etc. but I find that building such logical inference structures to be highly complex and error prone, and difficult to completely and rigorously test.

Inputs could be representative of things such as stock prices, temperature sensors etc.

One further note, some of the inputs are temporally constrained, meaning that the rule may require the state of a variable to be in a specific position/state for some amount of time (eg: the price of MSFT > $20 for last 30sec), currently this is accomplished by using a 'representing variable' (facade) that has a value of either 0 or 1 /false or true (the value of the variable is determined by a separate mechanism, that is usually the result of a rule being triggered).

Also it should be noted that due to the near real-time constraints and the amount of data being produced per second, options using a DB with triggers and stored procedures are out of the question.

Eversion answered 16/4, 2012 at 1:48 Comment(3)

Have you profiled this to determine where the bottleneck is? Is it in selecting the rules that need to be evaluated? Or is it in the actual evaluation and firing of events? – Imaginary 16/4, 2012 at 16:13

@JimMischel: The evaluation of the rules will take a specific amount of time - the trick is to perform the least/minimum number of evaluations for a given set of input updates. In short yes we've profiled the code, and this is clearly NOT an issue about an efficiency relating to code (excessive copies, redundant oop structures etc..), note in the question, I've mentioned that events are fire asynchronously, they are not the bottle neck in question. – Eversion 17/4, 2012 at 1:10

You still haven't said what is the bottleneck. And just because you fire events asynchronously doesn't mean that they're not the problem. Are you saying that evaluation of the rules is the problem? That you've determined that the problem is due to evaluation of redundant rules? – Imaginary 17/4, 2012 at 3:55

A couple ideas.

If the conditions for your rules are conjunctions, maintain for each unsatisfied condition an unsatisfied term. Put the rule only on the check lists for that term. If the term becomes satisfied, scan the other terms in order to determine either that the condition is triggered or that there's another unsatisfied term. (I think I learned about this trick in the context of SAT solvers.)

If you have terms like 10 <= x <= 50, use an interval tree instead of a hash quickly to locate the satisfied terms that are about to become unsatisfied by an update to x and the unsatisfied terms that are about to become satisfied. (O(log n) to search at all, plus O(1) per result returned.) There are multidimensional generalizations if considering variables one at a time produces too many spurious hits, but they would be rather more expensive to maintain.

If you don't have terms like that (e.g., a <= b), make derived inputs (b - a) and use your hash strategy to keep them up to date.

Formless answered 16/4, 2012 at 15:47 Comment(1)

Some very interesting ideas, I have one issue, the problem with breaking down the rules based on conjunctions is that you end up with a greater number of expressions to evaluate, at least in their original form, long complex expressions could be short-circuited trivially - you sort of loose that ability when you break the expressions down. – Eversion 17/4, 2012 at 1:26

Try to factorize common expressions, group and cache them: that could speedup especially those most used, so maybe performances will grow.

How feasible that is, depend on the language you are bound to use when expressing the logic. I see a possibility that your processes could be modelized as spreadsheets, where the computation is data driver. A topological sort of formulas WRT cells reference suffice for a basic evaluation, but things become complex fast...

In C++ you could try to solve your logic with template expressions. An helper to model your language could be Boost.proto

ETALIS does CEPS in Prolog.

I've not (yet) tried it (alas), but I believe it's very good. And surely will implement what you are after, and maybe more.

Running in SWI-Prolog, should be accessible, easy to setup and operate, and SWI-Prolog C++ interface it's very handy for practical data exchange.

Unutterable answered 16/4, 2012 at 2:18 Comment(3)

etalis et al have the same set of problems, I'm not asking for another software package that solves this problem, I'm asking for new ideas/strategies that could be used to solve this problem more efficiently. Furthermore, I'm not sure how a compile-time solution such as proto is relevant here. – Eversion 16/4, 2012 at 2:20

Seems you already tried those alternatives! My compliments for that. But I'm fairly sure ETALIS doesn't use such brain dead strategy you cite at beginning....And factorizing the code it's a metalinguistic optimization not so easy to do right.... – Unutterable 16/4, 2012 at 2:26

Factorization requires dependency analysis of the rules vs inputs/variable, the problem with this kind of dependency analysis is the issue of cyclic dependencies, and the issue of discovering mutually exclusive states. – Eversion 16/4, 2012 at 3:33

Distributing the rules and inputs over multiple nodes doesn't seem to work well either, as in some situations a minority of inputs exists in over 95% of the active rules.

Distribute the rules, and feed the inputs to all the machines. Then you can linearly scale on the number of rules... if you are on a LAN you could send the events/input as multicast, so the network traffic won't increase.

Mag answered 16/4, 2012 at 15:55 Comment(0)

Recommended topics

Hot tags