skip to content

Search

Building AI Features: What PMs Need to Know

7 min read Updated:

How AI delivery differs from classic software—and how PMs define 'done' when the system is never perfect.

Shipping an AI feature is still shipping software—but the center of gravity moves from “implement the spec” to learn whether the spec is even achievable at the quality bar you need. That shift changes how you plan, how you demo progress, and how you negotiate scope with leadership.

This lesson is about execution: what to expect in development, how to define success when models misbehave, when to buy capability vs. build it, and how to partner with ML engineers without pretending you share their job.

Waterfall plans rot quickly; experiment-first delivery is the honest default

In traditional product development, you can sometimes sequence design, build, test, and launch with predictable milestones. In ML-heavy work, unknowns hide inside the data and the real-world distribution. You discover them by training, evaluating, and failing in controlled ways.

That does not mean chaos. It means structuring work as hypothesis → experiment → decision: a bounded spike, a labeled eval set, an offline benchmark, a shadow mode, a gradual rollout.

Product implication: Your roadmap should show learning milestones, not only feature milestones. “Model v0.3 beats baseline on held-out tickets” is a legitimate progress marker—even if the UI barely changed.

Evaluation metrics are where product intent becomes mathematics

Engineers will talk about loss functions, BLEU scores, perplexity, AUC, F1, and more. Your job is to connect those to user-visible outcomes.

Ask: which metric, if improved, would make the experience meaningfully better? Which metric could improve while the product still feels worse—because latency rose, costs exploded, or false positives angered users?

Product implication: Co-own a small metric stack with your ML lead:

  • Offline metrics for fast iteration.
  • Online metrics tied to behavior (task success, retention, escalation rate).
  • Guardrail metrics you refuse to regress (latency, cost, toxicity rate, support volume).

If the team optimizes only offline numbers, you will eventually ship something users hate—or cannot afford.

Rollout strategy is part of the spec—not an afterthought

Shadow mode runs the new system in parallel without changing user-visible behavior. It is how you measure disagreement between old and new, calibrate thresholds, and catch ugly failures before they touch customers.

Gradual rollouts—percentage traffic, cohorts, regions—contain blast radius. Rollback must be real: feature flags, model version pins, and a decision owner who can revert without a committee.

Product implication: Write rollout steps as acceptance criteria. “Launch” is not a date; it is a sequence with exit conditions.

Prompts and model versions are product assets—manage them like code

For LLM-based features, prompts and model versions are not trivia. They are behavior. A “small wording change” can shift tone, safety, and factual tendencies.

You need versioning, change logs, and access controls appropriate to your risk. You also need a practice for regression testing when the provider updates a base model.

Product implication: Ask how changes are reviewed, how you reproduce yesterday’s behavior, and what happens when a vendor deprecates a model. If the answer is “we’ll handle it,” press for specifics.

Design partners need early access to bad outputs—not late access to polished ones

AI UX is not only layout. It is trust mechanics: how uncertainty is shown, how users correct errors, how sources are cited, how defaults are framed, and how escalation works.

If design only sees cherry-picked outputs, you will ship a beautiful interface around a dishonest experience.

Product implication: Co-create content design for failure strings, disclaimers, and progressive disclosure. Legal and brand often care as much about a single sentence as about the model card.

”Done” is a risk decision, not a checkbox

Classic features are “done” when they meet acceptance criteria. AI features are “ready to ship” when they meet risk-adjusted criteria: performance under load, monitoring in place, rollback paths defined, and stakeholder alignment on residual failure modes.

There will be residual error. The launch question is whether that error is understood, bounded, and acceptable given mitigations.

Product implication: Write launch criteria that include operations: dashboards, alerts, sampling plans for quality review, and an explicit policy for incident response. If you would not operate a payment system without monitoring, do not operate customer-facing AI without it either.

Build vs. buy is mostly a question of differentiation, data, and pace

Buy (APIs from providers such as OpenAI, Anthropic, or Google; other vendor models; managed services) when:

  • Speed to market matters more than owning the weights.
  • Your differentiation is workflow, data moat, or UX—not raw model architecture.
  • You lack the team and budget to train and serve at scale.

Build (custom training, fine-tuning, proprietary serving) when:

  • Domain performance gaps are large and persistent in evaluations.
  • Unit economics at your volume require it.
  • You have proprietary data and evaluation discipline to capitalize on it.

Hybrid approaches are common: buy the base model, build retrieval, tooling, evals, and fine-tunes on top.

Product implication: Force a TCO conversation early: not list price per token, but end-to-end cost including engineering time, review operations, and failure handling. The cheapest API is not cheap if it requires an army of human reviewers.

Latency and cost are part of the user experience

A brilliant model that returns in eight seconds will lose to a good model that returns in eight hundred milliseconds—for interactive tasks. Likewise, a feature that burns margin may win a demo quarter and die in finance the next.

Product implication: Set non-functional requirements alongside accuracy: p95 latency, max cost per successful task, and peak-load behavior. Test these early; they are harder to retrofit than UI.

Success criteria must include failure modes, not only happy paths

Users will encounter:

  • Low-confidence outputs.
  • Hallucinated specifics.
  • Toxic or off-brand language.
  • Tool-call errors in agentic setups.

Your product should define what happens in each case: suppression, fallback, human routing, safe defaults, and clear messaging.

Product implication: Add failure UX to your PRDs the same way you specify empty states. “What does the user see when the model is uncertain?” is not an edge case; it is the center of trustworthy design.

Working with ML engineers: what to ask

Ask questions that respect their craft and sharpen decisions:

  • Data: What labels do we trust? Where is leakage risk? What slice of data is underrepresented?
  • Baseline: What is the simplest model or non-ML baseline—and how much better must we be to justify complexity?
  • Eval: What datasets represent production? How do we prevent overfitting to demo sets?
  • Robustness: What inputs break us—adversarial, multilingual, long context, noisy OCR?
  • Serving: What happens at 10x traffic? What degrades first?

What to trust

Trust their read on feasibility timelines when grounded in data access and compute realities. Trust their instinct on where shortcuts create brittle systems. Trust them when they say a request is underspecified—because in ML, vague requirements often hide impossible expectations.

What to challenge

Challenge claims that skip baselines. Challenge demos without held-out evaluation. Challenge accuracy numbers without segment breakdowns. Challenge launches without monitoring. Challenge “we can fine-tune that away” without a plan and budget.

Your skepticism is not obstruction; it is quality bar enforcement expressed in product language.

Product management still owns the narrative—especially with AI

Engineers can quantify tradeoffs; you still decide what bet the organization makes and how it is explained to users. That includes setting honest expectations in marketing copy, in-product disclaimers, and sales enablement.

Product implication: Align externally visible promises with internal eval results. The gap between marketing superlatives and lived experience is where trust dies.

Cross-functional rituals matter more, not less

Design, content design, legal, support, and sales should see realistic outputs before launch—not curated winners. The earlier they see failure, the better the guardrails.

Product implication: Schedule red team sessions with messy inputs. If the room is uncomfortable, you are doing it right.

Documentation is not bureaucracy when the system is statistical

Runbooks for incidents, model cards or lightweight model documentation, and clear ownership for retraining are how teams survive the Tuesday when quality drops and nobody knows why.

You do not need enterprise ceremony. You do need shared truth: what shipped, what data it was trained on, what metrics justified launch, and how to roll back.

Product implication: Ask for a single source of truth link set that new engineers can read in an hour. If onboarding relies on tribal memory, your bus factor is one vacation away from an outage.

You are shipping a system, not a model

The durable product is: model + data + prompts + tools + UX + monitoring + policy + operations. PMs who fixate on the model alone ship fragile experiences.

If this series did its job, you can now move across the stack with confidence: fundamentals, opportunity selection, data readiness, and delivery discipline. The next step is practice—run a small assessment on a real initiative on your roadmap this week, with your team, using the language you now share.