getting the average, p95 and p99 of a stream of data
Asked Answered
H

3

14

I have incoming data and I want to compute the average, 95th and 99th percentile of that data - I am most interested in the last 1000 values. At any time, I'd like to query this object to get any of the three values (this can occur at any time, not just when the numbers seen mod 1000 is 0). Is there a way to get these three values without keeping the last 1000 samples?

This doesn't have to be perfect so we can use some tricks to get a good estimate. Also, speed is another concern. Thanks

(I will be doing this in C++ but I don't think that matters all that much)

Hungary answered 8/5, 2013 at 22:23 Comment(3)
I think that you can hold an array of 1000 entries without too much trouble or memory penalty. The issue is the ordering of the data (you will need to order it if you want to get the percentile, I think)Preventive
ya, the sorting is the part that would cause the most troubleHungary
I don't think there's a way to calculate any of the percentiles if you don't hold the data in an array, so, the algorithm (as I think should be) is: 1. Store the data; 2. Sort the data (with your favorite method); 3. Get the value at the desired position (array[n] where n = round(array.length * p) and 0<=p<=1).Preventive
J
9

At a minimum, you'll need to maintain a queue of the most recent 1000 elements.

To keep a running average, maintain a running total of the most recent 1000 elements; when you add a new element to the queue you add its value to the total, and you also subtract the value of the oldest element that you've just removed from the queue. Return the total divided by 1000 and there you go.

To keep a running Nth percentile, maintain two heaps and keep a count of the elements in the heaps; the "lower" heap has the lower N% of the values, and the "upper" heap has the upper (1-N)% (for example, the lower 95th percentile heap will have 950 elements, and the upper 5th percentile heap will have 50 elements). At any point you can return the lowest element from the upper heap, and that's your percentile. When you remove an element from the queue of recent values, then remove the value from the heaps as well. If this leaves the heaps unbalanced (eg the lower heap has 951 elements and the upper heap has 49 elements) then shift elements to balance them out (eg remove the top element from the lower heap and add it to the upper heap).

Since you want two percentiles, use three heaps - the lower heap has the lower 950 elements, the middle has the next 40, and the upper has the highest 10. Return the lowest element of the middle heap for the 95th percentile, and the lowest element of the upper heap for the 99th percentile.

Adding and removing heap elements is O(lg(n)), so that is the cost of adding a new element to the queue and three heaps: remove the oldest queue element from the heaps (O(lg(n)), add the new queue element to the appropriate heap (O(lg(n)), and balance the heaps if need be (again, O(lg(n)). Add the new element to the lowest heap whose highest element is greater than the heap element, i.e.

if (newElement < lowestHeap.maxElement) {
    lowestHeap.add(newElement)
} else if (newElement < middleHeap.maxElement) {
    middleHeap.add(newElement)
} else { 
    highestHeap.add(newElement)
}

Be sure that your heaps allow duplicate elements

Jumbuck answered 8/5, 2013 at 23:19 Comment(1)
If you keep the heap sorted, then you can do all of that with a single heap of 1,000 entries, right? Then you can check at position 900, 950, 990, etc. as required in one swoop.Blockhead
S
1

First let us assume you can afford to store 1000 numbers (let us say k times 1000, where k is a constant).

Keep 3 heaps:

  1. A minheap to store 10 (or 50) elements (heapA)
  2. A maxheap to store remaining 990 (or 950 elements) (heapB)
  3. A minheap to keep order of the elements. The oldest element is always on the top of this heap heapC)

The three heaps are special: heapC also keeps a link to the corresponding element in heapA or heapB. heapA and heapB also keep track of the same element in heapC.

This is the way it works:

  1. Assume you have 1000 elements in the system. heapA has 10 elements, heapB 990 and heapC has 1000 elements
  2. Delete the oldest element from the system. Delete it from heapC and using the link delete it from heapA or heapB
  3. Rebalance the three heaps.
  4. Add the new element's order into heapA or heapB depending on top of heapA
  5. Add the order of element to the heapC.
  6. While doing this, also add links to each other.
Suspensor answered 8/5, 2013 at 23:20 Comment(0)
H
0

We can do this using std::multiset instead of heap. Have 2 multisets and perform the same algorithm that is done with heaps.

Hand answered 5/6 at 8:11 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Commendation

© 2022 - 2024 — McMap. All rights reserved.