Spec-Driven SDLC, A new paradigm for AI-first Agile Product teams
Why agile teams should treat specs as the source of truth for AI collaborators, and how I run this site as a working demo. Two live examples + a spec generator you can use right now.
Outcomes
Three pillars
Specs are the contract
Humans write intent. AI writes code. Specs are how the two negotiate without misunderstanding.
Evals are the acceptance
Acceptance criteria become test specs. 'Done' means the eval is green and the spec moved to implemented.
The roadmap is the backlog
ROADMAP.md is the living kanban. Every spec lands there before it's coded, and stays there after it ships.
The thesis
Agile worked when teams shipped what humans typed.
In 2026, the typing is happening somewhere else. A senior PM defines an outcome. An AI agent, Cursor, Claude Code, Codex, whatever’s hot this quarter, produces a working diff in fifteen minutes. The bottleneck is no longer how fast engineers can pattern-match. It’s how precisely the rest of us can describe what we want.
User stories don’t survive that handoff. “As a user, I want to filter restaurants by dietary preference, so that I can order without scrolling” is fine for a sprint planning meeting. It’s a disaster as input to an LLM that will write the filter, the index, the schema, the tests, and the analytics events all at once. The model fills in the gaps with whatever’s most common in its training data, not whatever’s right for your product.
The fix isn’t to give the AI a bigger story. The fix is to give it a spec.
A spec is what a story looks like once you’ve removed the ambiguity. It states the schema. It freezes the prompt. It encodes acceptance criteria as binary checks. It says explicitly what is not in scope. When you hand a good spec to a capable AI agent, it produces code that matches what you asked for. When you hand a vague story to the same agent, it produces code that matches what someone else’s product needed.
That difference compounds. Over a quarter, teams that ship from specs out-execute teams that ship from stories, not because their AI is better, but because their intent is better-encoded.
This essay is a working demo of that thesis. Three pillars, two product examples, and a spec generator you can use right now.
What lives behind the pillars
Specs are the contract. Look in specs/ of this repo. Every feature on this website, Ask AI, voice mode, the local-only WebLLM path, the project page you’re reading, has a spec. The specs are written first, reviewed (sometimes by me alone, sometimes with another LLM critic), and only then implemented. The implementation is unsurprising because the surprises were resolved in the spec.
Evals are the acceptance. A spec without acceptance criteria is a wish. Acceptance criteria written as English (“the page should be fast”) are also wishes. Acceptance criteria written as binary, machine-checkable assertions (“Lighthouse perf ≥ 90 on a Pixel 5 throttled 4G connection”) are contracts. The two example slices below take this further: the eval test files are direct translations of the spec’s acceptance criteria. If the eval passes, the spec is satisfied. If the eval fails, the next move is not to “fix the test”, it’s to revisit the spec.
The roadmap is the backlog. specs/ROADMAP.md is one file. Every feature, partial or done, has a row. Every row has a status emoji. Every row links to its spec. There is no Jira. There is no separate backlog. When the spec moves to approved, the roadmap row moves to 🔄. When the eval is green and the code is merged, both move to ✅. The single source of truth eliminates the “wait, what’s the actual state of this?” round-trips that eat senior PM time.
Demo 1, Swiggy-style ETA estimator
The first example is intentionally non-AI code, a pure function, deterministic, two hundred lines including types. The point is to show that spec-driven SDLC isn’t only useful when an LLM is in the loop at runtime. It’s useful any time the rules of the system are non-obvious and the consequences of getting them wrong are real.
The spec lives at examples/spec-driven-swiggy-eta/specs/feature_eta_spec.md. It freezes:
- The formula (prep × load factor × item factor + travel + traffic buffer + handoff)
- The output shape (
etaMinutes, plus a transparent breakdown) - The acceptance threshold (≥9 of 10 fixtures within ±3 minutes; median absolute error ≤2 minutes)
The eval at evals/eta.test.ts is a one-to-one translation of those acceptance criteria. The fixture set (fixtures/orders.json) is hand-built across four traffic profiles, quick combo, family meal, slammed kitchen, long ride, so the eval exercises real-world edge cases, not just the happy path.
The widget below is the same code path the eval runs against. The function imported into the React island is the function that the spec governs. There is no separate “demo formula”.
Live: Swiggy ETA
The pure function from examples/spec-driven-swiggy-eta/src/eta.ts, running here in your browser.
Computed by the spec's formula:
- Prep
- 12.3m
- Travel
- 5.9m
- Handoff
- 2m
Drag the sliders. Notice that Slammed kitchen (load factor 1.6) more than doubles the prep time, that’s the formula working. Notice that doubling distance with a constant rider speed produces a near-linear travel increase, that’s also the formula. The handoff stays a flat two minutes because the spec says it does.
If a future contributor decides handoff should be three minutes? They have to update feature_eta_spec.md first. Then the test fails until the fixtures or the threshold are reconciled. Then we have a real conversation about whether three minutes is the right call. The spec is the forcing function.
Demo 2, YouTube-style comment moderation
The second example is the harder problem: the rules of the system live partly in an LLM. How do you spec that? How do you test it?
The answer in spec-driven-youtube-mod/ is to push as much determinism as possible into the spec, leaving the LLM with only the work it’s actually good at, fuzzy classification.
Three things move out of the model and into the spec:
- The prompt. Frozen by
feature_moderate_spec.md. Changing the prompt is a spec version bump, not a one-line code change. The prompt template lives insrc/prompt.tsand is built by a pure function, easy to inspect, easy to diff in a PR. - The output schema. A
Verdictis{ label, confidence, reason, rulesTriggered[] }, validated by Zod at the boundary. If the LLM returns 1.7 for confidence, the wrapper clamps it to 1. If it cites a rule that doesn’t exist in the policy, the wrapper drops it. Hallucinations bounce off the schema; they don’t propagate into the rest of the product. - The eval thresholds.
eval_moderate_spec.mddemands precision ≥ 0.8 and recall ≥ 0.7 against a 20-comment hand-labelled fixture set. The CI eval uses a deterministic mock LLM, intentionally ~85% accurate, so the thresholds matter and can’t be gamed. Real-LLM evaluation happens manually, off the critical path, before the spec moves toimplemented.
What’s left for the LLM? The classification itself. Is this comment toxic, spam, or safe? That’s the actual hard part, the rest is mechanism.
The widget below runs against your active LLM. Pick a sample, click classify, watch a real verdict come back, labelled, with a reason, with the rules that triggered.
Live: YouTube Comment Moderation
Real call to your active LLM, governed by examples/spec-driven-youtube-mod/specs/feature_moderate_spec.md.
Active provider: read from your Ask AI settings.
This is what spec-driven SDLC for LLM-shaped code looks like in practice. The spec governs everything except the model’s judgement, and the model’s judgement is the only thing left to evaluate.
Try it yourself
The pillars are easy to nod at. The hard part is producing a spec for your idea on a Wednesday afternoon when your sprint review is in twenty minutes. So here’s a generator.
Tell it what you want to build. Tell it which kind of spec you need. It drafts the rest in this repo’s house style, frontmatter, sections, Zod types, binary acceptance criteria, an explicit out-of-scope list, open questions. You take what’s useful, you fill the <FILL> placeholders, and you walk into your sprint review with a spec instead of a story.
It uses the LLM you’ve already configured for Ask AI on this site. No new keys. No new accounts.
Spec Generator
Describe a feature you want to build. The generator drafts a spec in this repo's house style — frontmatter, sections, Zod types, acceptance criteria. Uses your active LLM (Gemini / Anthropic / Local).
Copy the output. Drop it into your repo’s specs/ folder. Open a draft PR. Hand it to whichever AI coding agent you’re using this quarter. Ship the diff. Move the row.
That’s the loop.
Anti-patterns to watch for
- Vibes coding. “I’ll just describe it to Claude and iterate.” The first iteration looks fine. The third iteration is a mess of accumulated drift because nothing is the source of truth except the most recent prompt. Symptoms: PR descriptions that say “implements the thing we discussed in Slack.”
- Prompt as spec. Treating a single LLM prompt as if it were a spec. A spec has frontmatter, section structure, acceptance criteria, scope, and open questions. A prompt has none of that, and an LLM will happily write whatever fills the silence.
- AI as stenographer. Letting the agent dictate the spec back to you. The spec is the place where human judgement about scope, trade-offs, and acceptance lives. The AI implements; the human specifies.
- “We’ll write the spec after.” This means there is no spec. The post-hoc document is a description, not a contract.
- Acceptance criteria written as English. “Should be reasonably fast.” “Should look polished.” Every adjective is a future argument.
What’s next
A few things I’m working on, all of which will land as their own specs in specs/projects/ and rows on the roadmap:
- An eval harness that runs on every PR. Today,
examples/*/evals/*.test.tsruns onpnpm test. The next step is to surface eval deltas in PR comments, so a regression in precision is as visible as a regression in latency. - Spec Generator v2 with RAG over your own corpus. Today the generator is zero-shot with a single style anchor. v2 will retrieve from your existing
specs/directory so the generator’s output reads like your specs, not generic ones. - Auto-PR-from-spec. When a spec moves to
approvedinmain, a PR opens automatically with the agent’s first-draft implementation. Spec → PR, no human key-presses. - Two more example slices. Swiggy-style restaurant search (with a live UI) and YouTube-style “Up Next” recommendations are queued up as
examples/spec-driven-swiggy-search/andexamples/spec-driven-youtube-recs/. Both will follow the same spec-first pattern. Both will be linked here when they land.
If any of this resonates and you want to compare notes, or push back hard on something, find me at the Ask AI page or via email. I read everything.
Specs & docs from the repo
Rendered straight from demo.highlights. Each document is the source of truth in the repo — the snippets below stay in sync at build time.
Related reading
AI Strategy: From Feature to Platform
A capstone frame for PMs: AI as bolt-on feature, integrated capability, and platform infrastructure—roadmaps that compound, the data flywheel, org readiness, and what to prepare for next.
Building AI Features: What PMs Need to Know
How AI delivery differs from classic software—and how PMs define 'done' when the system is never perfect.
Data as a Product Requirement
How to treat data readiness as a product decision—not an engineering side quest.
Designing AI User Experiences
How AI reshapes UX: progressive disclosure, expectation-setting, failure design, trust calibration, human-in-the-loop—and why making confidence legible is the core product skill.