Plate IIAI Engineering中文HOWARDISM

Verification as the New Bottleneck

PublishedMay 23, 2026FiledConceptDomainAI EngineeringTagsAgent Engineering AI Coding Workflow AI Native OrgReading5 minSourceAI-synthesised

Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax; PR-cycle-time funnel analysis

Illustration for Verification as the New Bottleneck

Sources#

Running an AI-native engineering org

Summary#

Fiona Fung's central claim from running Claude Code + Cowork engineering: for years, engineering bandwidth was the expensive resource — planning, reviews, and process all existed to protect it. Once agentic coding made coding cheap, the bottleneck moved to verification, review, and maintenance. "On the Claude Code team, coding is really not the slow part anymore." The new scarce resource is confidence that the change is correct — and it gets scarcer precisely because bandwidth (and therefore throughput) exploded.

Why verification is now the constraint#

Three forces converge:

Volume. Bandwidth increased so much that "we have to pay even more attention to: is it correct."
Blurring roles. More people (designers, managers, PMs) now check in changes, so everyone needs confidence their change is correct.
Maintenance cost. Higher throughput means more to maintain — the cost of maintenance becomes a first-class concern, not an afterthought.

This is the org-level mirror of Karpathy's The Verifiability Thesis ("LLMs automate what you can verify") and the demand side of Harness Shrinkage as Models Improve (prompt scaffolding shrinks; mechanical verification stays load-bearing).

TDD loses its tax#

A vivid sign of the shift: TDD used to feel like "eating broccoli" — write the failing test first, verify it fails, then fix. With Claude, Fung found it "so much more fun and pleasurable… it took the tax out of test-driven development." The economics flipped: when writing the test is nearly free, the discipline that grounds verification (a test that provably fails, then passes) is pure upside. (Cf. the tdd / red-green-refactor discipline; the failing-test-first step is the verifier.)

Shift left#

Her recurring phrase: shift left — catch problems closer to the source via automation, not after a customer hits them. "What's better than me running into the bug first? Having automation in place to catch it closer to the source." As throughput rises, the only way verification keeps up is by being automated and early rather than manual and late.

Who reviews — and the human-in-the-loop line#

Before shipping Claude Code's own code-review feature, "how do you keep up with code reviews?" was her most-asked question. The answer: Claude Code review handles style, lint, obvious bugs, and spec-drift (if you check the spec into the codebase, "Claude is very good about verifying against spec drift"). But humans stay in the loop where it matters: legal review, risk tolerance, trust boundaries — "trust but verify, and where humans bring needed expertise." The division of labor: automate the mechanical verification, reserve human judgment for risk and trust-boundary calls. (Cf. Deep Modules for Agents: reviewer in a fresh context.)

Measuring the shift (and a trap)#

Signals she watches: onboarding ramp-up time ↓, PR cycle time ↓, Claude-assisted commits ↑ ("I haven't seen a commit that wasn't Claude-assisted in months"). The trap: don't read end-to-end PR cycle time alone — break it into funnel chunks. If cycle time isn't dropping, it may not be low AI adoption; it could be CI/build systems jamming under the new throughput. And throughput isn't the goal — "find some way to measure whatever you're actually trying to solve," not just velocity.

Connections#

Fiona Fung — author of the thesis
The Verifiability Thesis — Karpathy's "automate what you can verify" is the model-level cause; this is the org-level consequence
Harness Shrinkage as Models Improve — the synthesis it confirms: scaffolding shrinks, mechanical verification doesn't
Evals as Product Spec — Cat Wu's evals are verification encoded as product spec; the PM-side companion
Code as Source of Truth — checking the spec into the repo is what lets Claude verify spec drift
Building Is Cheap, Arguing Is Expensive — the upstream half: generation is cheap, so verification (and judgment) is where cost concentrates
Claude Code Auto Mode — the auto-approve classifier is verification automation at the permission layer
Deep Modules for Agents — reviewer-in-fresh-context is the verification-quality move at the code-review layer
AI Brain Fry — the risk if verification stays manual: oversight fatigue increases errors as volume grows
AI-Driven Formal Proof Search — the extreme case: a compiler as the verifier, so the bottleneck is fully mechanized
Recursive Self-Improvement — Amdahl's law for orgs: as generation accelerates, human code review became Anthropic's new bottleneck — this thesis at the scale of AI building AI
AI Accelerating AI Development — the corroborating data: an automated Claude reviewer would have caught ~1/3 of the bugs behind past production incidents before merge
Research Taste as the Human Bottleneck — when humans can't review/judge as fast as Claude generates, judgment becomes the binding constraint — verification's higher-altitude form

Derived#

When Does Verification Quality Determine Whether AI Automation Works? — generalizes this bottleneck into a verification-quality ladder: Lean/formal proof, software CI, vulnerability reproduction, and noisy judgment tasks

Open Questions#

Fung's own open question: "How far do you push fully automated reviews?" — where's the speed/safety balance, and how do you keep humans confident without re-introducing the review bottleneck?
If CI/build is the hidden jam, does verification infrastructure (test runners, CI capacity) become the actual capex of an AI-native org?

Sources#

Running an AI-native engineering org

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 26

Agentic Honesty & Diligence
As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an ali…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI Brain Fry
Kropp et al. 2026/03: mental fatigue from excessive AI oversight increases minor errors +11%, major errors +39%; cognit…
AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
AI-Native Product Org Bottlenecks
AI-native product-org bottleneck is accountable taste at speed: dogfooding trains taste, evals encode it, and accountab…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Building Is Cheap, Arguing Is Expensive
"In technical debate, code wins": generate three PRs vs whiteboard; prototype over design doc; reduce design docs
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Claude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
Code as Source of Truth
Docs go stale at high coding throughput; check specs/skills into the repo; onboard via Claude; spec-drift verification
Deep Modules for Agents
Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in…
Dogfooding as Product Discipline
Product sense is built by relentless first-hand use ("ant food"); Mr. Peanut catch; cross-source (Cat Wu vibe-checks, G…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
Fiona Fung
Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Human-in-the-Loop Boundaries
Humans belong at allocation, understanding, design-concept, risk, and accountability boundaries; they slow the system d…
AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 36 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
The PRD-Replacement Spectrum at AI-Native Speed
Four positions (grill-then-PRD → lighter-PRD → build-to-decide → prototype-is-spec) are one spectrum once you decompose…
Product Velocity as Moat
Shipping speed as differentiator + trust signal ("you'll scale with us"); a treadmill that must convert into durable lo…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Fiona Fung
Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no…
Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Fiona Fung
Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no…
Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…

Cited by 26

Agentic Honesty & Diligence
As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an ali…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI Brain Fry
Kropp et al. 2026/03: mental fatigue from excessive AI oversight increases minor errors +11%, major errors +39%; cognit…
AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
AI-Native Product Org Bottlenecks
AI-native product-org bottleneck is accountable taste at speed: dogfooding trains taste, evals encode it, and accountab…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Building Is Cheap, Arguing Is Expensive
"In technical debate, code wins": generate three PRs vs whiteboard; prototype over design doc; reduce design docs
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Claude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
Code as Source of Truth
Docs go stale at high coding throughput; check specs/skills into the repo; onboard via Claude; spec-drift verification
Deep Modules for Agents
Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in…
Dogfooding as Product Discipline
Product sense is built by relentless first-hand use ("ant food"); Mr. Peanut catch; cross-source (Cat Wu vibe-checks, G…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
Fiona Fung
Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Human-in-the-Loop Boundaries
Humans belong at allocation, understanding, design-concept, risk, and accountability boundaries; they slow the system d…
AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 36 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
The PRD-Replacement Spectrum at AI-Native Speed
Four positions (grill-then-PRD → lighter-PRD → build-to-decide → prototype-is-spec) are one spectrum once you decompose…
Product Velocity as Moat
Shipping speed as differentiator + trust signal ("you'll scale with us"); a treadmill that must convert into durable lo…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…