Evaluating AI Models and Vendors

If you are shipping AI in a product, you will spend real money and reputation on choices you do not fully control. Models drift. Vendors reprice. APIs change. Demos lie—not because everyone is dishonest, but because a polished sandbox is not your traffic, your prompts, or your failure modes.

Your job is not to become a researcher who reads every paper. Your job is to make defensible bets: know what you are optimizing for, how you will know if you are wrong, and what it costs to switch when you are.

Capability is the only dimension that matters if the model cannot do the job

Start with a blunt question: does this system reliably produce outputs that are good enough for the user task, under your constraints?

Capability is not a leaderboard score. It is fit to your use cases: your languages, your domain vocabulary, your edge cases, your quality bar. A model that crushes general benchmarks can still be mediocre at your specific extraction task, or hallucinate on your regulatory wording, or fail when users paste messy PDFs.

Translate capability into product language:

Task success rate: What percentage of interactions meet your definition of “acceptable” without human fix-up?
Failure shape: When it fails, is it obviously wrong, subtly wrong, or confidently wrong? The last one is the most expensive.
Coverage: Does it work for the segments you care about—locales, plan tiers, long documents, low-bandwidth clients?

If you cannot describe “good enough” without waving at “AI magic,” you are not ready to evaluate vendors. Write the acceptance criteria first. Then test against them.

Cost is not the sticker price; it is unit economics at your scale

Vendors quote per-token, per-request, per-seat, or bundled credits. Your finance model should answer: what does one successful user outcome cost in inference, including retries, logging, and safety checks?

Build a simple model:

Volume drivers: tokens in, tokens out, number of round-trips (tool use and multi-step flows multiply calls fast).
Peaks: Black Friday, quarter-end, viral content. Margins die in spikes if you priced for averages.
Waste: Over-prompting, redundant summarization, users regenerating until they get something usable. Product design changes cost as much as model choice.

Per-token pricing feels precise. It is also easy to misread. A “cheap” model that needs three attempts and a human review step is often more expensive than a pricier one-shot model.

Latency is a UX and conversion constraint, not a benchmark flex

Users experience time to first token, time to complete response, and jank (stuttering streams, UI freezes). For some flows—inline assist while typing—hundreds of milliseconds matter. For batch jobs, minutes might be fine.

Ask:

What is the p95 and p99 latency under load, not the demo on a quiet network?
Does the product need streaming partial results to feel responsive?
What happens when latency spikes—graceful degradation, queueing, or silent failure?

If your roadmap assumes “real-time everywhere,” validate that assumption with instrumentation before you sign a contract. Latency SLAs in vendor agreements are worth negotiating; marketing pages are not.

Reliability is uptime plus behavioral consistency

Uptime is table stakes: API availability, regional redundancy, incident communication. Read the status page history, not the SLA slide.

Behavioral consistency is harder and more product-relevant. The same prompt can yield different answers across days as models update. That breaks tests, support scripts, and user trust.

You need a strategy for:

Version pinning where possible, and a process for controlled upgrades.
Regression testing when the vendor ships a new release—your eval suite is part of reliability.
Fallbacks when the primary model is degraded or rate-limited.

Treat model updates like any other dependency upgrade: scheduled, tested, owned.

Customizability determines whether you can improve the system without rebuilding it

Most product teams will need some path to specialization:

Prompting and tooling (cheapest, fastest, limited ceiling).
Retrieval-augmented generation (RAG) with your documents and policies—strong fit for knowledge work products if chunking and retrieval are done well.
Fine-tuning when you have labeled data and a stable task formulation—often oversold as mandatory; sometimes right.
Custom training when the problem is core to the business, data is proprietary and abundant, and you have the org to operate it—rare for most PM-led bets.

The PM question is not “can we fine-tune?” It is where does marginal quality come from for the next six months: better retrieval, better evaluation, better UX, or model weights?

Lock-in risk is the hidden line item on every vendor deal

Lock-in shows up as proprietary formats, custom orchestration glued to one API, data gravity in a vendor’s vector store, and workflows that assume one provider’s quirks.

Mitigate without fantasizing about zero lock-in:

Abstract interfaces at the boundary your team controls (one place that maps “generate answer” to a provider).
Portable artifacts: your eval sets, your golden prompts, your logged production examples—those are strategic assets.
Contract terms: egress, export, notice periods, pricing bands.

You will still have switching costs. The goal is to know what they are before they become existential.

The build vs. buy spectrum is really a sequence of bets

Think on a spectrum, not a binary:

Raw API (bring your own orchestration): Maximum control, you own reliability, security, and prompt engineering. Best when AI is central and differentiated.
Managed service (hosted models, guardrails, sometimes RAG): Faster time to value, less operational burden, more opinionated. Best when you need to ship and learn.
Fine-tuned model: When generic capability plateaus and you have clean task data. Requires ongoing maintenance.
Custom-trained model: When the model is the moat and you have the data and ML org to sustain it. Expensive, slow, not where most PMs should start.

Most teams should start managed, prove value, then narrow where custom work actually moves the needle. Starting at custom training because it sounds serious is how you burn two years before your first retained user.

Meaningful evaluations look like product QA, not a single impressive demo

Demos select for cherry-picked inputs. Production selects for chaos.

Run structured evaluations:

Build a labeled set from real user inputs (anonymized, consented). Include hard cases: ambiguous asks, adversarial phrasing, multilingual text, long attachments.
Define rubrics humans can apply consistently: factual accuracy, policy compliance, tone, actionability. “Vibes” is not a rubric.
Score models and prompts the same way across versions. Track regressions when anything changes.
Blend automatic checks (format, citations, forbidden phrases) with human review for subjective quality.

A/B tests in production matter, but they optimize locally. Offline eval catches catastrophic failure modes before they hit users. You want both.

Vendor pitches reward optimism; your process should reward evidence

Red flags to treat seriously:

Benchmark heroism without results on your data. Leaderboards are marketing inputs, not procurement decisions.
“We handle safety” as a black box. You still own the user impact. Ask how they test, how they update policies, and what you can configure.
Unlimited scale promises without architecture specifics. Inference is physics and dollars.
Hand-wavy fine-tuning as the answer to every gap. Often the gap is task definition, data quality, or retrieval.
Exclusive reliance on their eval harness. You need your harness; theirs is a supplement.

Green flags look boring: clear versioning, transparent pricing, solid logging, incident history, documentation that matches reality, and engineers who admit limits.

A decision memo beats a scorecard nobody will maintain

Executives do not need thirty rows of model trivia. They need a decision with explicit tradeoffs. A useful memo answers:

Problem and use case: What user or business outcome are we buying?
Options: Two or three real candidates—not fifteen—and why the long tail was cut.
Recommendation: One primary and one backup, with migration path.
Risks: Lock-in, regulatory, reputational, operational—with owners and mitigations.
Kill criteria: What signal in the first thirty or ninety days would make us pivot?

Attach your eval summary as an appendix. Keep the narrative short. If you cannot explain the tradeoff in five minutes, you probably have not decided yet.

Proof-of-concept is for learning; production pilot is for economics

Vendors love a POC. You should too—if the POC is structured like an experiment, not a morale event.

Good POC design:

Uses your data and your prompts, not the vendor’s tour script.
Measures latency, cost, and quality together—optimizing one in isolation lies.
Includes a failure budget: how many severe errors per thousand are acceptable for this phase?
Ends with a written verdict: proceed, proceed with conditions, or stop.

A POC that ends in “everyone liked it” is a vanity exercise. A POC that ends in “here is the cost curve at 10x traffic and here is where quality breaks” is procurement ammunition.

Security and compliance are part of evaluation, not a post-signing surprise

Ask early about data residency, subprocessors, audit logs, retention defaults, and whether prompts are used for vendor training. B2B buyers will send questionnaires; consumer products still face app-store policies, privacy expectations, and incident scrutiny.

If legal review happens after engineering has hard-coded one provider’s SDK, you have surrendered leverage. Bring security and legal in when you have two viable paths, not zero.

You do not need to pick the exact architecture on day one. You do need to define success, cost envelopes, latency budgets, and migration paths—and to insist that any recommendation survives contact with your real inputs.

Evaluating models and vendors is continuous work. The team that treats it as a one-time RFP will relearn the same lessons in production, expensively. The team that builds evals, monitors drift, and revisits the build/buy line each quarter compounds advantage. That is the bar.