How to generate a rolling std line chart in dc.js/reductio/crossfilter
Asked Answered
N

1

7

I want to show a line graph with rolling std over the sum of values for an interval of dates.

The code for the generation of the crossfilter/reductio object is :

myCrossfilter = crossfilter(data);

function getRunningDates(numDays) {
    return function getDates(d) {
        var s = d.ValueDate;
        var e = new Date(s);
        e.setDate(e.getDate() + numDays);
        a = [];
        while (s < e) {
            a.push(s);
            s = new Date(s.setDate(
                s.getDate() + 1
            ))
        }
        return a;
    }
}

var dim1 = myCrossfilter.dimension(getRunningDates(20), true);
var dim2 = myCrossfilter.dimension(dc.pluck("ValueDate"));
var group1 = dim1.group();
var group2 = dim2.group();
var reducerRolling = reductio()
    .std("value");
reducerRolling(group1);
var reducer = reductio()
    .sum("value")
reducer(group2);

I have put everything into a jsFiddle to show what I mean (unrelated question : I do not understand how the dates on the graphs can go beyond my dateToInit variable defined in the fiddle).

I would like the bottom graph to be a rolling std of the values in the top graphs. What ends up happening is that the std calculation in bottom graph does not do the sum aggregation first (which makes sense I understand that).

Is there a way to use a group as the dimension for another group ? If not, how would one achieve what I am trying to do ?

Nativity answered 11/4, 2018 at 9:16 Comment(7)
So, if I understand correctly, what you're showing is a rolling sum and a rolling standard deviation. You want to show a rolling standard deviation of the sum? The standard deviation of the mean can be derived from the standard deviation of the individual values by just dividing by Math.sqrt(d.count) (the square root of the sample size). I'm not sure about getting to the standard deviation of the sum, but I'm sure it's derivable. It should be proportional to the standard deviation of the mean, I would think.Infinity
My bad I didn't define my first dimension properly. I have updated the fiddle and my question.Nativity
So for date t I want to show the standard deviation of the [t-20,t] interval where the values in the interval are the sums of the values for each day (subject to filtering etc).Nativity
Ah, I see. And it's not taking the sum of squares of the 20 sums but rather of the individual values in each day (across all 20 days). I'll think about this tonight, but it's kind of making my brain hurt :-) No promises, but hopefully we'll work something out.Infinity
Yes so it s basically showing me the standard deviation of Math.random() which is nice I guess but not that useful for my purposes :)Nativity
I've been ignoring this because I don't know the reductio way to do this. But hey, bounty gets my attention! When I hear "group on another group" I think "fake group". This is a lot like accumulate but more complex, since you'd push/pop the data in an array, and then calculate avg/stddev based on the current array. If reductio doesn't have something like this, I'd be happy to try the fake group way.Deflect
I tried the fake group way but got lost on how to account for the array structure. Am happy for you to give it a try. Think it would have some value for everyone if it can easily be generalized to any function.Nativity
N
2

OK so I've come up with a solution based on the 'fake group' approach suggested by Gordon.

I have updated the jsFiddle with a working version.

The gist of it is define custom reducing functions :

reduceAddRunning = function(p,v) {
    if (!p.datesData.hasOwnProperty(v.ValueDate)) {
        p.datesData[v.ValueDate]=0;
    }
    p.datesData[v.ValueDate]+=+v.value;
    p.value+=+v.value;
    return(p);
};
reduceRemoveRunning = function(p,v) {
    p.datesData[v.ValueDate]-=+v.value;
    p.value-=+v.value;
    return(p);
};
reduceInitRunning = function(p,v) {
    return({
        value:0,
        datesData:{},
    });
};

and then build a fake group as such :

var running_group = function (source_group,theRunningFn) {
    return {
        all:function () {
            return source_group.all().map(function(d) {
                var arr = [];
                for (var date in d.value.datesData) {
                    if (d.value.datesData.hasOwnProperty(date)) {
                        arr.push(d.value.datesData[date]);
                    }
                }
                return {key:d.key, value:theRunningFn(arr)};
            });
        }
    };
}

with theRunningFn being math.std in my case.

I am still left with 2 issues which will be the basis for a new question I guess :

  • This is quite slow. Happy to hear suggestions to speed it up. (My graph updates used to be snappy they are now slowish. Still usable but slowish)
  • I do not know how to handle the edge cases. The values shown at the beginning of the time series do not make sense as they are based on less history. Same issue applies when I filter the data by dates.

EDIT : the following is a better solution based on Gordon comment (again!).

Just do a regular sum group and apply the following fake group function :

var running_group_2 = function (source_group,numDays,theRunningFn) {
return {
    all:function () {
        var source_arr = source_group.all();
        var keys = source_arr.map(function(d) {return d.key;});
        var values = source_arr.map(function(d) {return d.value;});
        var output_arr = [];

        for (var i = numDays;i<source_arr.length;i++) {
            if (i<numDays) {
                output_arr.push({key:keys[i],value:0});
            } else {
                output_arr.push({
                    key:keys[i],
                    value:theRunningFn(values.slice(i-numDays,i))
                });
            }
        }
        return output_arr;
    }
};
}

It solves both the speed issue (as it's much less cumbersome and doesn't store all the daily values to be used, instead using the already aggregated values) and the edge cases (even if it's not easily generalizable beyond my case as far as the edge cases are concerned : I juts don't show a value when I don't have enough points to calculate the running variable).

Here is the jsFiddle for that second (better for my purposes) solution.

Nativity answered 17/4, 2018 at 3:15 Comment(3)
I don't think you need a special reduction, since you just want to use the sum for each time interval. The reduction is the same; it's just that you want to calculate a rolling avg/stddev over the last 20 bins. So it should be a lot faster and simpler to leave reduction alone and use a fake group to pass over the data once and do the rolling average. I don't have time to try this right now but I hope to show what I mean in another answer. I'm also not sure what to do about the edge cases except perhaps count those as less samples (divide by less than 20).Deflect
ah I see. let me give it a try then.Nativity
Nice. Could be optimized even further, to copy less, but the complexity is now O(n) instead of O(n²) so it's pretty much optimal.Deflect

© 2022 - 2024 — McMap. All rights reserved.