Tracking metrics using StatsD (via etsy) and Graphite, graphite graph doesn't seem to be graphing all the data
Asked Answered
B

2

21

We have a metric that we increment every time a user performs a certain action on our website, but the graphs don't seem to be accurate.

So going off this hunch, we invested the updates.log of carbon and discovered that the action had happened over 4 thousand times today(using grep and wc), but according the Integral result of the graph it returned only 220ish.

What could be the cause of this? Data is being reported to statsd using the statsd php library, and calling statsd::increment('metric'); and as stated above, the log confirms that 4,000+ updates to this key happened today.

We are using:

graphite 0.9.6 with statsD (etsy)

Boarding answered 17/8, 2011 at 20:41 Comment(0)
E
17

After posting my comment above I found Graphite 0.9.9 has a (new?) configuration file, storage-aggregation.conf, in which one can control the aggregation method per pattern. The available options are average, sum, min, max, and last.

http://readthedocs.org/docs/graphite/en/latest/config-carbon.html#storage-aggregation-conf

Entrap answered 17/12, 2011 at 15:42 Comment(1)
And the statsd documentation (at least as of today) also describes this problem and how to solve it at github.com/etsy/statsd/blob/master/docs/graphite.md.Hann
B
60

After some research through the documentation, and some conversations with others, I've found the problem - and the solution.

The way the whisper file format is designed, it expect you (or your application) to publish updates no faster than the minimum interval in your storage-schemas.conf file. This file is used to configure how much data retention you have at different time interval resolutions.

My storage-schemas.conf file was set with a minimum retention time of 1 minute. The default StatsD daemon (from etsy) is designed to update to carbon (the graphite daemon) every 10 seconds. The reason this is a problem is: over a 60 second period StatsD reports 6 times, each write overwrites the last one (in that 60 second interval, because you're updating faster than once per minute). This produces really weird results on your graph because the last 10 seconds in a minute could be completely dead and report a 0 for the activity during that period, which results in completely nuking all of the data you had written for that minute.

To fix this, I had to re-configure my storage-schemas.conf file to store data at a maximum resolution of 10 seconds, so every update from StatsD would be saved in the whisper database without being overwritten.

Etsy published the storage-schemas.conf configuration that they were using for their installation of carbon, which looks like this:

[stats]
priority = 110
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974

This has a 10 second minimum retention time, and stores 6 hours worth of them. However, due to my next problem, I extended the retention periods significantly.

As I let this data collect for a few days, I noticed that it still looked off (and was under reporting). This was due to 2 problems.

  1. StatsD (older versions) only reported an average number of events per second for each 10 second reporting period. This means, if you incremented a key 100 times in 1 second and 0 times for the next 9 seconds, at the end of the 10th second statsD would report 10 to graphite, instead of 100. (100/10 = 10). This failed to report the total number of events for a 10 second period (obviously).

    Newer versions of statsD fix this problem, as they introduced the stats_counts bucket, which logs the total # of events per metric for each 10 second period (so instead of reporting 10 in the previous example, it reports 100).

    After I upgraded StatsD, I noticed that the last 6 hours of data looked great, but as I looked beyond the last 6 hours - things looked weird, and the next reason is why:

  2. As graphite stores data, it moves data from high precision retention to lower precision retention. This means, using the etsy storage-schemas.conf example, after 6 hours of 10 second precision, data was moved to 60 second (1 minute) precision. In order to move 6 data points from 10s to 60s precision, graphite does an average of the 6 data points. So it'd take the total value of the oldest 6 data points, and divide it by 6. This gives an average # of events per 10 seconds for that 60 second period (and not the total # of events, which is what we care about specifically).

    This is just how graphite is designed, and for some cases it might be useful, but in our case, it's not what we wanted. To "fix" this problem, I increased our 10 second precision retention time to 60 days. Beyond 60 days, I store the minutely and 10-minutely precisions, but they're essentially there for no reason, as that data isn't as useful to us.

I hope this helps someone, I know it annoyed me for a few days - and I know there isn't a huge community of people that are using this stack of software for this purpose, so it took a bit of research to really figure out what was going on and how to get a result that I wanted.

Boarding answered 18/8, 2011 at 12:36 Comment(5)
Thanks for sharing this -- I was also seeing oddness after 6 hours and your post explains exactly why (I was using the 6 hours of 10 second precision). Germaine to anyone who applies the integral transform to see totals.Entrap
Re #2, carbon takes the average by default, but it's configurable via [storage-aggregation.conf].(graphite.readthedocs.org/en/latest/…)Belita
At the time of this post (pre 9.9) that configuration didn't exist. however, @JeffArgast posted an answer regarding that configuration file, so I marked his response as the correct answer for this question, and left mine as-is for older installations of graphite. Thanks for pointing it out via the comments though, in case someone skips over his response.Boarding
Can you tell us what is the size of your whisper files with the above configuration? how is it in terms of performance?Peruke
Whisper files are 4.2MB each, as for performance - I haven't noticed a big issue. Graphs render sufficiently quickly. The server that carbon runs on doesn't really serve much (if anything) else, maybe some internal tools, but nothing for production. The whisper files are stored on amazon's ephemeral disk, I have a feeling that any poor performance I experience is more related to that fact than the size of the metric filesBoarding
E
17

After posting my comment above I found Graphite 0.9.9 has a (new?) configuration file, storage-aggregation.conf, in which one can control the aggregation method per pattern. The available options are average, sum, min, max, and last.

http://readthedocs.org/docs/graphite/en/latest/config-carbon.html#storage-aggregation-conf

Entrap answered 17/12, 2011 at 15:42 Comment(1)
And the statsd documentation (at least as of today) also describes this problem and how to solve it at github.com/etsy/statsd/blob/master/docs/graphite.md.Hann

© 2022 - 2024 — McMap. All rights reserved.