Imagine that I have a set of measurements of x that are taken by many processes x0 ... xN at times t0 ... tN. Let's assume that at time t I want to make an estimate of the current value of x, based on the assumption that there is no long term trend I know about and that x can be predicted from an algorithm such as exponential smoothing. As we have many processes, and N can get very large, I can't store more than a few values (e.g. the previous state).
One approach here would be to adapt the normal exponential smoothing algorithm. If samples are taken regularly, I would maintain an estimator yn such that:
yn = α.yn-1 + ( 1 - α ). xn
This approach is not great where the sampling is irregular as many samples together would have a disproportionate influence. Therefore this formula could be adapted to:
yn = αn.yn-1 + ( 1 - αn ). xn
where
αn = e-k.(tn - tn-1)
IE dynamically adjusting the smoothing constant depending on the interval between the previous two samples. I'm happy with this approach and it seems to work. It's the first answer given here, and a good summary of these sorts of techniques are given by Eckner in this 2012 paper (PDF).
Now, my question is as follows. I want to adapt the above to estimate the rate of an occurrence. Occasionally an event will occur. Using a similar exponential technique, I want to get an estimate of the rate that event occurs.
Two obvious strategies would be:
- To use the first or the second technique using the delay between the last two events as the data series xn.
- To use the first or the second technique using the reciprocal of the delay between the last two events (i.e. the rate) as the data series xn.
Neither of these turn out to be good strategies as far as I can tell. Firstly, take an event that occurs every 500ms (on the one hand) and an event that occurs with a 200ms delay and an 800ms delay on the other. Clearly these both occur twice a second, so the rate estimate given should be the same. Ignoring the time from the last sample seems foolhardy, so I'll concentrate on the second strategy. Using the delay (rather than the reciprocal) does not turn out to be a good predictor because the simulating the 200ms/800ms sample stream produces an estimate of about 1.5 (on the basis the average of reciprocals is not the reciprocal of the average).
But, far more significantly, neither strategy copes with what is actually happening in practice, which is that suddenly all the events stop for a long while. The 'latest' value of y is thus the value at the last event, and the estimate of the rate thus never gets calculated. Hence the rate appears constant. Of course if I was analysing the data retrospectively, this wouldn't be an issue, but I'm analysing it in real time.
I realise another way to do this would be to run some thread periodically (e.g. every 10 seconds) and count the number of occurrences in this 10 second interval. This is very resource heavy my end as the statistics are not needed often, and I am loathe to run a thread that polls everything due to mutex issues. Therefore I'd like to (somehow) use an algorithm which adjusts the state read by (e.g.) the time since the last sample was taken. This seems a reasonable approach as if the performance is measured at times chosen independently of the samples, the measurement time will on average be half way through the period between samples, so a very crude unsmoothed estimate of the rate would be half the reciprocal of the time since the last sample. To complicate things further, my measurement time will not be independent of the samples.
I have a feeling this has a simple answer but it's eluding me. I have a feeling that the correct route is to assume the the events are Poisson distributed, and derive an estimate for λ based on the interval since the last sample and some form of moving average, but my statistics is too rusty to make this work.
There is a near dupe of this question here but the answer doesn't seem to be very satisfactory (I hope I explained why). I'd add that a Kalman filter seems way to heavyweight given I have one variable to estimate and know nothing about it. There are a number of other near dupes, most of which either suggest keeping large bins of values (not realistic here from a memory point of view) or do not address the two issues above.
est:=(est*exp(-dt/T)+dt)/(exp(-dt/T)+1)
, whereest
is the estimate of the period (inverse of frequency),T
the window width, anddt
the time since the previous event. You can correct the bias for this one, too. – Scarecrow