Saturday, June 21, 2014

Q: What's wrong with this picture?
A: Everything!

Question: What's wrong with this picture:

Answer: Everything!

This single chart (source redacted to protect the guilty) is a great depiction of my last four #LatencyTipOfTheDay posts in one great visual:

1. Average (def): a random number that falls somewhere between the maximum and 1/2 the median. Most often used to ignore reality.

Well, that one is self explanatory, and the math behind it is simple. But I keep meeting people who think that looking at average numbers tells them something about the behavior of the system that produced them... Which is especially curious when what they are trying to do is monitor health, readiness, and responsiveness behavior. This chart summarizes averages of the things it plots. Just like everyone else. I hear that on the average these days, the tooth fairy pays $2/tooth.

2. You can't average percentiles. Period.

See those averages posted (at the bottom of the chart) for the 25%, 50%, 75%, 90%, and 95% lines? It's not just that average numbers are low in meaning in their own right. Averaging percentiles is so silly mathematically that it may be a good way to build a random number generator. Read this #LatencyTipOfTheDay post for an explanation of just how meaningless averages of percentiles are.

3. If you are not measuring and/or plotting Max, what are you hiding (from)?

This is a classic "feel good" chart. The chart is pretty. And it looks informative. But it really isn't. Not unless all you care about is the things that worked well, and you don't want to to show anything about a single thing that didn't work well. The main practical function of such a chart is to distract the reader, and make them look at the pretty lines that tell the story of what the good experiences were, so that they won't ask questions about how often bad experiences happened.

The way a chart like this achieves this nefarious purpose is simple: it completely ignores, as in "does not display any information about" and "throws away all indication of", the worst 5% of server operations in any given interval, or in the timespan charted as a whole.

This chart only displays the "best 95%" of results. You can literally have up to 5% of server side page response times take 2 minutes each, and this pretty picture would stay exactly the same.

Whenever charts that show response time or latency fail to display the Maximum measurements along with lower quantiles (like the 95%, or the median, or even the fuzzy average), ask yourself: "what are they hiding?".

4. Measure what you need to monitor. Don't just monitor what you happen to be able to easily measure.

This chart plots the information it has. Unfortunately the information it has is not the important information needed for monitoring server response behavior...

Monitoring is mostly supposed to be about "this is how bad the bad stuff was, and this is how much bad stuff we had". It's been a while since I had met anyone operating servers that only cared about the fastest operation of the day. Or only the best 25% of operation. Or only about the better half of operations. Or only the better 95% of operations, but didn't care at all about the worst 5% of operations. But that's exactly (and only) what the 25%'lie, median, and 95% lines in this chart display. It is literally a chart showing "this is how good the good stuff was". 

Percentiles matter for monitoring server response times, and they matter to several 9s in the vast majority of server applications. The fact that your measurements or data stores only provided you with common case latencies (and yes, 95% is still common case) is no excuse. You may as well be monitoring a white board.

You see, when you actually ask scary questions like "how many of our users will observe the 99.99%'lie of server side HTTP request time each day?" you'll typically get the scary answer that makes you realize you really want to watch that number closely. Much more so than the median or the 95%'ile. Not asking that question is where the problems start.

For example, the statement "roughly 10% of all users will experience our 99.99%'lie once a day or more" is true when each page view involves ~20 HTTP gets on the average, and the average user does ~50 page views or refreshes during a day. Both of these are considered low numbers for what retail sites, or social networking sites, or online application sites experience, for example. And very few users will "fail to experience" the 99.9%'lie in 50 page views under the same assumptions. So if you care about what a huge portion of your user based is actually experiencing regularly, you really care about 99.9% and 99.99%'iles.

But even though I have yet to meet an ops team that does not need to monitor the 99.9% and/or 99.99%'lie of their server response times, it's rare that I meet one that actually does monitor at those levels. And the reason is that they are usually staring at dashboards full of this sort of feel good chart, which distract them enough to not think about the fact that they are not measuring what they need to monitor. Instead, they are monitoring what they happen to be able to measure...

Once you know what you want to be watching, the measurement and the retention of data is where the focus should start, as those pretty picture things usually break early in the process because the things that need monitoring were simply not being measured. E.g. many measurement systems (e.g. Yammer metrics when using its default reservoir configurations) do not even attempt to actually measure percentiles, and model them statistically instead. Of the ones that actually do measure percentiles on the data they receive (e.g. StatsD, if you sent it un-summarized data about each and every server response time), the percentile measurements are typically done on relatively short intervals and summarized (per interval) in storage, which means two things: (A) You can only measure very low numbers of 9s, and cannot produce the sort of percentiles you actually should be monitoring, and (B) the percentiles cannot be usefully aggregated across intervals (see discussion).

(A) happens because intervals are "short", and cannot individually produce any useful data about "high number of 9s". E.g. there is no way to report useful information on the 99.9%'lie or 99.99%'lie in a 5 second interval when the operation rate is 100 ops/sec.

(B) happens because unless percentile data is computed (correctly) across long enough periods of time, it is basically useless for deducing percentiles across more than a single interval. Collecting the data for producing percentiles over longer periods is relatively straightforward (after all, it's being done for each interval with low numbers of 9s), but has some storage volume and speed related challenges with commonly used percentile computation techniques. Challenges that HdrHistogram completely solves now, BTW.

So what can you do about his stuff? A lot. But it starts with sitting down and deciding what you actually need to be monitoring. Without monitoring things that matter, you are just keeping your operations and devops folks distracted, and keeping your developers busy feeding them distracting nonsense.

And stop looking at feel good charts. Start asking for pretty charts of data that actually matters to your operations...

No comments:

Post a Comment