Metrics: What to Measure and What to Ignore
When you first add metrics to a service, the temptation is to measure everything. CPU usage, memory, request count, queue depth, cache hit rate, garbage collection pause time, thread pool size. The dashboards fill up. The alerts pile on. And then, during the actual incident, you spend twenty minutes scrolling through graphs trying to figure out what’s actually wrong.
The problem isn’t that you measured too little. It’s that you measured without asking what question you’re trying to answer.
What a Metric Is
A metric is a measurement of some aspect of your system’s behavior, collected over time. That’s it. What makes metrics useful - or useless - is what you measure, how often, and whether you can act on what you see.
Most metrics systems (Prometheus, Datadog, CloudWatch, InfluxDB) organize metrics into a few fundamental types. Understanding the types helps you avoid common mistakes.
Counter - a value that only goes up. Request count, error count, bytes sent. Counters are reset when the process restarts.
The useful thing about counters isn’t their absolute value - it’s their rate of change. “3,847,293 requests processed” tells you nothing useful in isolation. “847 requests per second” does.
# Prometheus - rate of HTTP requests over the last 5 minutes
rate(http_requests_total[5m])
Gauge - a value that goes up and down. Current memory usage, active connections, queue depth, temperature. Gauges represent a snapshot of current state.
Histogram - tracks the distribution of values. How long did requests take? You could use an average, but averages hide the tail. If 95% of requests take 10ms and 5% take 5000ms, your average might be 260ms - a number that describes no user’s actual experience.
Histograms let you ask: “what was the 99th percentile latency?” That’s the question that matters for user experience.
# 99th percentile request duration over the last 5 minutes
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Summary - similar to histograms but computes quantiles client-side. Less flexible for aggregation across instances, but useful for specific use cases. Most modern systems prefer histograms.
The Four Golden Signals
Google’s Site Reliability Engineering book introduced this framework. For any service, these four signals tell you whether it’s healthy:
Latency - how long does it take to serve a request? Track separately for successful requests and failed ones. A spike in error latency tells a different story than a spike in success latency.
Traffic - how much demand is hitting your system? Requests per second, queries per second, messages consumed per second - whatever the relevant unit is. Traffic gives you context for everything else. A 10% error rate is very different at 1 req/s versus 10,000 req/s.
Errors - what fraction of requests are failing? Track both explicit failures (5xx responses) and implicit ones (200 responses with wrong content, timeouts, requests that exceeded your SLA).
Saturation - how “full” is your service? What resource is the constraining factor: CPU, memory, disk I/O, connection pool size? Saturation metrics tend to be leading indicators - they warn you before latency and errors get bad.
These four signals cover almost every meaningful question you can ask about a service’s health. Start here before adding anything else.
What Not to Measure
CPU and memory utilization are almost never useful as primary signals. They’re effects, not causes. A CPU spike tells you something is doing a lot of work. It doesn’t tell you what, whether it’s a problem, or what to do about it. By the time CPU is at 100%, your latency and error metrics have already told you something is wrong.
Request count is similarly limited on its own. Rate and percentile breakdowns matter. The raw number doesn’t.
Measuring the same thing multiple ways creates noise. If you have five dashboards showing “is the database slow” via different proxies (query count, connection pool exhaustion, cache miss rate, slow query log entries, replication lag), you end up spending time reconciling signals rather than diagnosing the problem. Pick the most direct measure and use it.
Cardinality
This is the thing that trips up everyone who builds their first metrics system at scale.
Cardinality is the number of unique label combinations for a metric. A metric http_requests_total with labels {method, status, endpoint} might seem reasonable - until your service has 500 endpoints, 7 HTTP methods, and 50 possible status codes. That’s 175,000 unique time series for one metric.
Metrics systems store each unique label combination as a separate time series. High cardinality explodes storage and query cost. Some things should never be labels: user IDs, request IDs, session tokens, customer names. These make every request unique.
High cardinality data belongs in logs or traces, not metrics.
# Bad: user_id makes cardinality unbounded
http_requests_total{user_id="u_8473", endpoint="/checkout"}
# Good: aggregate by meaningful dimensions
http_requests_total{endpoint="/checkout", status="200"}
Choosing the Right Retention
Not all metrics need to be retained at the same resolution forever. A spike in error rate five minutes ago needs second-level resolution. What happened six months ago can be summarized into hourly averages.
Most metrics systems support downsampling: keeping high-resolution data for recent windows and progressively lower resolution as data ages. Tune this based on how you actually use historical data - most teams never look at raw second-level data older than a few days.
What Good Dashboards Look Like
A good dashboard answers specific operational questions. “Is the service healthy?” is a question. “Are users experiencing slow checkouts?” is a question. “Is the database connection pool saturated?” is a question.
A bad dashboard is a collection of graphs because they seemed interesting to someone at some point.
Start by writing down the three questions you’d ask first during an incident for your service. Then build exactly the dashboard that answers those questions. Everything else is optional.
The dashboard you reach for first during an incident is the one that earns its place. The ones you forget exist are the ones you should delete.
Alerts Should Be Rare
If your alerting is noisy, people stop reading alerts. This is not a hypothetical - it happens in every team that isn’t careful about alert thresholds, and it’s one of the most dangerous states an on-call rotation can be in.
An alert should fire when a human needs to do something now. Not when a threshold is crossed. Not when something looks interesting. When action is required.
Symptom-based alerts (user-facing latency is above threshold) are almost always better than cause-based alerts (CPU is above 80%). The symptom is what matters to the user. The cause is what you investigate after you know there’s a problem.
Alert on the four golden signals. Be conservative with thresholds. Tune them based on false positive rate. The goal is an alert that, when it fires, means something is definitely wrong - not something that might warrant a look.