skip to content

Search

Metrics and Measuring Success

9 min read Updated:

Measure what changes decisions: a tight hierarchy from north star to inputs and guardrails, with hypotheses written before build and honest reviews after ship.

If a metric does not change a decision, it is analytics theater

Teams collect dashboards because dashboards feel adult. The useful question is narrower: what would we do differently if this number moved? If the answer is “feel concerned in Slack,” you do not have a metric—you have a mood ring.

Good measurement connects to levers someone on the team can pull: copy, flow, pricing, performance, support policy. Start from decisions, then instrument—not the reverse.

Leading indicators tell you early; lagging indicators tell you if it mattered

Lagging indicators are the outcomes you ultimately care about: revenue, retention, churn, NPS-style aggregates. They move slowly and blend many causes. They are poor daily guides.

Leading indicators are the behaviors that precede those outcomes: activation steps completed, time-to-value, frequency of a core action, depth of usage in a key workflow. They are noisier but responsive. Product work should hypothesize links: “If more users invite a teammate in week one, we expect cohort retention to improve in eight weeks.”

Example: for a collaborative doc product, lagging might be net revenue retention. Leading might be “documents with two-plus contributors in the first seven days.” Your feature work targets the leading signal while watching lagging for truth.

A north star without inputs is a slogan on a wall

The north star metric is the single best summary of value delivered to customers in a way that correlates with business health—not the only number you track, but the anchor. Common failure modes: picking revenue (which is not a customer value proxy), picking something unmovable by the product team, or picking a metric everyone games.

Pair the north star with input metrics: the handful of drivers the team believes move it. If your north star is “weekly active teams completing a key workflow,” inputs might be signup-to-first-success time, invite rate, and workflow completion rate. Inputs should be owned—someone should feel responsible when they slip.

Guardrail metrics keep you from winning yourself into trouble

Every optimization has casualties. Guardrails are the metrics you are not willing to sacrifice for short-term gains: latency, error rates, support tickets per thousand users, trust-and-safety incidents, gross margin, engineer burnout proxies like pager load.

Before you ship something aggressive, name the guardrails explicitly. “We expect faster onboarding to increase shallow signups; guardrail: qualified activation rate and refund rate.” That sentence prevents a team from celebrating a hollow top-of-funnel win.

Define success for a feature before you build, or you will retrofit a story

The pre-mortem for metrics is shipping, then hunting for a chart that went up. Write a one-paragraph success hypothesis before design locks:

  • Target user and behavior: who should do what differently?
  • Primary metric: what moves first?
  • Guardrails: what must not break?
  • Time horizon: when will you judge?
  • Minimum detectable effect: what size of change would change your next decision?

Example: “For admins on the Business plan, we believe inline role editing will reduce time-to-first-invite by 20% within four weeks, measured median time from workspace creation to first member join. Guardrail: no increase in permission-related support tickets.”

If you cannot state that before build, you are not ready to argue for the investment.

A compact example turns a fuzzy goal into something you can ship against

Suppose leadership asks for “better engagement” on a B2B SaaS dashboard. That is not measurable yet. You translate it into a user behavior in your ICP: operations managers should return mid-week to confirm exceptions, not only when something breaks. Your north star might stay company-level (healthy recurring revenue from that segment), but the feature bet targets inputs: increase the share of ICP accounts where at least one user opens the exceptions view twice in the first fourteen days, and reduce median time from login to “first useful insight” under ninety seconds.

Guardrails might include page load p95, error rate on the data pipeline, and support tickets tagged “wrong numbers.” Before build, you write: “We believe a prioritized exceptions feed plus saved views will raise the two-visit-in-fourteen-days rate from 18% to 24% within six weeks, without worsening guardrails.” After ship, you compare not only the headline input metric but the mechanism: did users save views, did they act on alerts, did downstream retention move on a lag you pre-committed to watch? If the input moved but behavior did not, your model was wrong—update the hypothesis instead of declaring victory.

Vanity metrics flatter the room and starve learning

Vanity metrics look good in updates and explain little: total registered users, cumulative downloads, page views without intent, ‘engagement’ defined as any click. They rise with marketing spend and hide weak product-market fit.

Watch for metrics without denominators, without cohorts, and without connection to habit formation. “MAU up” is meaningless if the numerator is inflated by one-off logins. Prefer depth metrics in your core segment over breadth across everyone who ever touched the product.

Instrument before you ship so post-launch debate is about learning, not data gaps

The measurement practice starts before code merges. Event naming, sampling, privacy review, and dashboard skeletons should be part of definition of done. Shipping without instrumentation is borrowing time from your future self—with interest.

You do not need perfect analytics on day one. You need enough fidelity to test the hypothesis you claimed. If success depends on a funnel step you forgot to log, you did not finish the product.

After you ship, review expected versus actual with intellectual honesty

Schedule a review while memories are fresh. Compare the pre-launch hypothesis to what happened. Separate execution failures (bugs, slow rollout) from assumption failures (users did not want it, the metric was the wrong proxy).

Example: a team ships a redesigned checkout expecting higher conversion. Conversion is flat, but mobile drop-off improved and average order value rose. A lazy review declares failure because one headline number did not move. A rigorous review updates the model: the redesign traded speed for basket building; next iteration targets a different step.

The delta between expected and actual is where product judgment grows. If you always explain away misses, you learn nothing. If you punish misses, people stop writing falsifiable hypotheses.

Hypothesis-driven development is a loop, not a ceremony

The loop is simple: bet, ship, measure, decide. Bets should be small enough to fail cheaply. Measure with pre-registered criteria. Decide explicitly: double down, iterate, or stop—and communicate why.

A/B tests are one tool in the loop, not the loop itself. They help when traffic is sufficient, ethics and infrastructure allow, and the question is truly about a comparable alternative. Many important bets cannot be cleanly A/B tested: brand changes, pricing strategy, ecosystem partnerships. Still use hypotheses and leading indicators; do not pretend the absence of a test means the absence of accountability.

The metric hierarchy is how you scale alignment without drowning in dashboards

North star on top, a small set of inputs beneath, guardrails alongside. Everything else is diagnostic—useful for investigations, not for weekly steering. When someone proposes a new dashboard tile, ask which level it serves and which decision it informs.

Operational definitions matter more than the label on the chart

“Activation” is not a metric until you define the event, the user population, the time window, and how you handle edge cases (trials, invited users, bots). Two teams can report “retention” with different denominators and both be “right.” That is how executives lose trust in data.

Write the definition in one place everyone cites. When the definition changes, version it and note why. Boring data hygiene beats clever dashboards that nobody can reproduce.

Segmentation turns averages into actionable signal

Headline numbers lie politely. A flat conversion rate can hide a collapsing mobile funnel and a improving desktop one. Retention can look fine while your best segment churns and a noisy long tail masks it. Default views should include your ICP and major platforms or geos—whatever stratification matches how you actually run the business.

Example: enterprise buyers and self-serve users should rarely share one activation definition. Treat them as different products with different hypotheses, even if the codebase is shared.

Frequency and power: not every bet needs daily monitoring

Match review cadence to how fast a metric can move and how costly a delay would be. Some inputs deserve daily triage during a launch window; lagging revenue outcomes might be monthly. Over-reviewing slow metrics creates false urgency; under-reviewing fast ones lets bugs and bad launches compound.

Qualitative evidence still belongs in the loop—metrics tell you what, not always why

Instrumentation shows that users abandon step three. It does not explain whether the copy is confusing, the value unclear, or the flow violates a mental model. Pair quant with targeted sessions, support ticket themes, and sales objection logs. The best reviews connect a chart move to a plausible mechanism—and then test that mechanism.

Kill metrics that no longer drive decisions

Dashboards accrue cruft. A metric that nobody has acted on in two quarters is either poorly understood, wrongly owned, or obsolete. Retire it publicly so people stop debating ghosts. Fewer tiles, sharper ownership, clearer rituals beat encyclopedic analytics.

Statistical humility is part of the practice

Small samples lie loudly. Seasonality masquerades as trend. A/B tests stop early on noise. You do not need a PhD; you do need enough literacy to ask: is this effect size plausible, is the segment big enough, did we peek, did external events confound the read? When uncertain, widen the evidence: holdouts, longer windows, qualitative follow-ups.

Tie metrics to owners and forums, or they drift

A number without an owner becomes wallpaper. Name who wakes up if an input metric regresses for two weeks, and which meeting reviews it with enough power to reallocate work. Metrics discussed only in postmortems teach the org to optimize for storytelling, not steering.

Baselines and targets should be explicit enough to argue about

“We want more engagement” is not a target. “Raise weekly returning users in the ICP from 42% to 48% by end of Q2, without increasing support tickets per thousand users” is. Baselines ground the conversation in reality; targets force prioritization. If nobody can explain how a target was chosen, expect sandbagging or cynicism.

Opinionated close: teams that measure well argue less about opinions and more about next experiments. Teams that measure badly argue about narratives. Write the hypothesis before the PR ships, and review the result after—with enough humility to change your mind.