It’s important to understand the service we offer customers. Of course, most of that experience is the features and products we offer and the impact they have on the customers’ jobs-to-be-done. But another part is their expectations around the NFRs — non-functional requirements like security, durability, correctness, availability. If we don’t have those, then the software and the benefits it offers can’t shine through.

This document talks solely about availability. There are different ways to monitor it, all with different amounts of work and different benefits.

Fleet Availability

Most teams start by monitoring their fleet, and at the most basic level they monitor the uptime of the servers in it. For every server in the fleet, they add 1,440 minutes (86,400 seconds) per day of the month to the denominator, then put the actual uptime in the numerator. This is the way most database services compute availability, since there are dedicated resources per customer/project.

This is a good start, and everybody should have it. But it’s not very customer-obsessed. For example, let’s say you have 10,000 customers. For each customer, there are 1,440 minutes in a day, 30 days in a month — 432,000,000 “fleet uptime minutes” over a month. If you hit 99.99% fleet uptime, are you giving a great experience to your customers? Maybe. Maybe not.

  • Maybe every one of those customers experiences 8.64 seconds of downtime per day. Or…
  • You can have 30 customers down the entire month and still hit 99.99% fleet uptime.
  • You can have 240 customers down for three hours each and still hit 99.99% fleet uptime.

Pure fleet uptime is useful, but it’s not enough to represent the customer experience.

Service Uptime — better

If you do it by service, you can still have interesting results where a service being down in a certain way (e.g., high latency) affects customers but doesn’t appear in your metrics — so how you define “is a service up and running” is important. Or you could have a service nobody called during the month be up 100% of the time, raising your average without improving any customer’s experience. Or, if you have 250 services, what if the gateway service is down for 20% of the month? Do you tell your board of directors you had 80% availability — or 99.68%?

Customer Requests — even better

The best thing to track is to break customer requests into critical and optional, and then track the number of requests per month that customers make that execute successfully, across the fleet. And then have a second dashboard for the number of customers who had individual experiences that didn’t meet the bar — again, working to overcome the challenges of fleet metrics.

Talking to your customers — best

There’s no substitute for actual conversations with customers about how the service felt to them. Numbers describe a system; humans describe an experience.

LIFE Metrics

For all of the metrics above, it’s still hard to get a feeling for the individual customer experience. But if you divide things up right, you can do better. Take the time to dig into the dimensions below — and whichever other dimensions make sense — and be obsessed about understanding each customer’s personal experience of your product.

At a high level, I think about LIFE metrics:

Length

Of the outages they had — how long were they? A single 30-minute outage during the year is likely different than a 30-second blip about once a week. Different customers will have different points of view, and it’s important to understand that. If your fleet is running well, you likely only have a handful of customers who are individually breaching your SLA.

Impact

It’s so easy to say that one downtime second for one service is equal to any other. But what if the downtime is “delay in getting logs to S3 by a day” versus “cannot hit the ‘buy’ button in the shopping cart”? Seconds aren’t all equal. Services aren’t all equal.

Frequency

Even if the service is only down for short periods, every time it goes down causes physical strife for your customers — pagers go off, incident reports have to be written, calendars rearranged. Shorter incidents are better, but if you have too many of them, you’ll likely be booted by the customer due to fatigue — even if each individual incident has very small length and very small impact.

Experience

Arguably the most important of the dimensions: what was the actual experience of your customers with your product and company? Did the software give clear errors, directing to a status page that was updated regularly? Did you blast alerts to every customer for a single customer’s specific issue? Did your alerts have typos or say clearly untrue things? Did they say the service was back up when it was only back up for most customers — and the ones it wasn’t back up for felt betrayed?

It took humans being emotionally attached to your product to purchase it, and those same emotions can be the ones that cause them to part ways with you — even if, on every technical dimension, the software is performing better and better over time.


These are my insights on the availability NFR. What are yours?