Saturday, June 21, 2014

#LatencyTipOfTheDay: If you are not measuring and/or plotting Max, what are you hiding (from)?

When you monitor response times or latency in any form, and don't measure Max values for response times or latency, or don't monitor the ones you do measure, you are invariably ignoring a system requirement, and hiding the fact that you are ignoring it.

Monitoring against required behavior


The goal of most monitoring systems is (or should be) to provide status knowledge about a system's behavior compared to it's business requirements, and to help you avoid digging where you don't need to (no need to look into the "cause" of things being "just fine"). Drilling into details is what you do when you know a problem exists. But first you need to know the problem exists, which starts with knowing that you are failing to meet some required behavior.

The reason everyone should be measuring and plotting max values is simple: I don't know of any applications that don't actually have a requirement for Max response time or latency. I do know of many teams that don't know what the requirement is, have never talked about it, think the requirement doesn't exist, and don't test for it. But the requirement is always there.

"But we don't have a max requirement"


Yeah, right.

Whenever someone tells me "we don't have a max time requirement", I answer with "so it's ok for your system to be completely unresponsive for 3 days at a time, right?".

When they say no (they usually use something more profound than a simple "no"), I calmly say "so your requirement is to never have response time be longer than 3 days then..."

They will usually "correct me" at that point, and eventually come up with some number that is reasonable for their business needs. At which point they discover that they are not watching the numbers for that requirement.

So if you have the power to measure this Max latency or response time stuff yourself, or to require it from others, start doing so right away, and start looking at it.

Uncovering insanity


Beyond being a universally useful and critical-to-watch requirement. Max is also a great sanity checker for other values. It's harder to measure Max wrong (although the thing many tools report and display as "max" is a bogus forms of "sampled" max) , and it's really hard to hide from it once you plot or display it.

The first conversations that happen when people start to look at max values after monitoring their stuff for a while without them often start with: "I understand that the pretty lines showing 95%'lie, average, and median all appear great, and hover around 70msec +/- 100msec, and that we've been making them better for months..., but if that's really the case, what the &$#^! is this 7 second max time doing here, and how come it happens several times an hour? And why has nobody said anything about this before? ..."

Apologies


Who knows? You may find that those nice fuzzy feelings the 95%'lie charts have been giving you are well justified. So no harm then.

For those of you who find an uglier truth because I made you look, and don't like what they see, I sincerely pre-apologize for having written this...



3 comments:

  1. Hmm - max can be an useful measure - but it can also be a pita. As performance goes up, latency comes down, if you are using a general purpose operating system, then max starts to measure the rate or soft and hard irqs and not anything useful. I would suggest you are off the mark talking about max/min/percentile and all that stuff. What you need to do is a full cluster analysis. Then work out what is the root cause of each cluster (i.e. why they are different from the others) and then and ONLY then figure out how to improve each cluster separately.

    This is basically, what you have already said: measure what you need to measure. Max is 'what' you need 'why'. Once you have isolated a number of similar events which cluster around max then you can get to why.

    Say hi to Schuetze for me :)

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. I added some beef to the actual blog entry after Alexander Turner's comment and my original reply to it (now removed because it's part of the blog). So if his comment reads out of order, it's my fault, not his...

    ReplyDelete