Background
I have an ordered set of data points stored as a TreeSet<DataPoint>
. Each data point has a position
and a Set
of Event
objects (HashSet<Event>
).
There are 4 possible Event
objects A
, B
, C
, and D
. Every DataPoint
has 2 of these, e.g. A
and C
, except the first and last DataPoint
objects in the set, which have T
of size 1.
My algorithm is to find the probability of a new DataPoint
Q
at position x
having Event
q
in this set.
I do this by calculating a value S
for this data set, then adding Q
to the set and calculating S
again. I then divide the second S
by the first to isolate the probability for the new DataPoint
Q
.
Algorithm
The formula for calculating S
is:
http://mathbin.net/equations/105225_0.png
where
http://mathbin.net/equations/105225_1.png
http://mathbin.net/equations/105225_2.png
for http://mathbin.net/equations/105225_3.png
and
http://mathbin.net/equations/105225_4.png
http://mathbin.net/equations/105225_5.png is an expensive probability function that only depends on its arguments and nothing else (and http://mathbin.net/equations/105225_6.png), http://mathbin.net/equations/105225_7.png is the last DataPoint
in the set (righthand node), http://mathbin.net/equations/105225_8.png is the first DataPoint
(lefthand node), http://mathbin.net/equations/105225_9.png is the rightmost DataPoint
that isn't the node, http://mathbin.net/equations/105225_10.png is a DataPoint
,http://mathbin.net/equations/105225_12.png is the Set
of events for this DataPoint
.
So the probability for Q
with Event
q
is:
http://mathbin.net/equations/105225_11.png
Implementation
I implemented this algorithm in Java like so:
public class ProbabilityCalculator {
private Double p(DataPoint right, Event rightEvent, DataPoint left, Event leftEvent) {
// do some stuff
}
private Double f(DataPoint right, Event rightEvent, NavigableSet<DataPoint> points) {
DataPoint left = points.lower(right);
Double result = 0.0;
if(left.isLefthandNode()) {
result = 0.25 * p(right, rightEvent, left, null);
} else if(left.isQ()) {
result = p(right, rightEvent, left, left.getQEvent()) * f(left, left.getQEvent(), points);
} else { // if M_k
for(Event leftEvent : left.getEvents())
result += p(right, rightEvent, left, leftEvent) * f(left, leftEvent, points);
}
return result;
}
public Double S(NavigableSet<DataPoint> points) {
return f(points.last(), points.last().getRightNodeEvent(), points)
}
}
So to find the probability of Q
at x
with q
:
Double S1 = S(points);
points.add(Q);
Double S2 = S(points);
Double probability = S2/S1;
Problem
As the implementation stands at the moment it follows the mathematical algorithm closely. However this turns out not to be a particularly good idea in practice, as f
calls itself twice for each DataPoint
. So for http://mathbin.net/equations/105225_9.png, f
is called twice, then for the n-1
f
is called twice again for each of the previous calls, and so on and so forth. This leads to a complexity of O(2^n)
which is pretty terrible considering there can be over 1000 DataPoints
in each Set
. Because p()
is independent of everything except its parameters I have included a caching function where if p()
has already been calculated for these parameters it just returns the previous result, but this doesn't solve the inherent complexity problem. Am I missing something here with regards to repeat computations, or is the complexity unavoidable in this algorithm?
f
as well? Just move parameterpoints
from function parameter to class member. – Tarrancepoints
to the left ofright
this would mean the cache would used even after I addQ
to the points onceQ
has been passed in the process. – Belly