Plate IIAI Engineering中文HOWARDISM

Harness Shrinkage as Models Improve

PublishedMay 6, 2026FiledConceptDomainAI EngineeringTagsLLM Architecture Agent Engineering HarnessReading12 minSourceAI-synthesised

Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from now" claim; mechanical verification stays load-bearing

Illustration for Harness Shrinkage as Models Improve

Sources#

Summary#

The harness — prompts, skills, scaffolding, mechanical verification — exists to compensate for what the underlying model cannot yet do. As models improve, the harness needs to shrink, not grow. Boris Cherny explicitly predicts Claude Code "may be 100 lines of code a year from now." Cat Wu reports the team reads the entire system prompt with every model launch and removes anything the new model handles natively. The principle works in two directions: capabilities the harness used to inject move into the model, and crutches the harness used to provide become drag.

The to-do list as canonical example#

Cat Wu's case study:

Early Claude Code: asked to refactor 20 call sites, the model would change 5 and stop. The team added an explicit to-do list tool ("Sid on our team was like, what would a human do? Make a list, go through one by one"). With the tool prompted aggressively, the model finished all 20.
Opus 4 onward: model uses the to-do list spontaneously, no aggressive prompting needed.
Today: to-do list is "deemphasized" — model may or may not use it, doesn't need to be reminded, mostly kept around for user-facing visibility.

The crutch (the prompt section forcing to-do list use) was removed; the tool stayed for a different reason (UI value).

The Boris claim: 100 lines#

"I think Claude Code itself may be 100 lines of code a year from now."

Read literally this is hyperbole, but the direction is real:

Anthropic now uses the same models internally that ship externally, so internal harness lessons transfer
Each model release lets the team delete prompt sections, shrink fallback logic, remove safety wrappers (per Cat Wu: "all the safety mechanisms today — prompt injection, static verification of commands, permission modes, human in the loop — will be less important because the model will just do the right thing")
The product surface stops being "what the harness does" and becomes "where the model decides to do it" (CLI, mobile, web, IDE, all sharing the same model logic)

The flip side: capabilities migrate inward#

Boris reports Opus 4.7 spontaneously starts loops:

"I'll tell it 'pull this data query.' It says 'I noticed the data is changing — I'll start a loop and report every 30 minutes.'"

The /loop primitive (see Agent Loop Pattern) was introduced as a harness feature; in 4.7 it is becoming model-native behavior. The harness primitive doesn't go away — but the user no longer needs to invoke it.

This generalizes: anything the harness teaches the model how to do via a prompt section is a candidate for migration into the next model's training data.

The cleanest demonstration: Fable 5 plays Pokémon with no harness#

The June 2026 Fable 5 launch supplies the most legible version of the whole thesis. Earlier Claude models "struggled to play Pokémon FireRed even with harnesses that gave them additional helpful tools" — maps, navigation aids, game-state readouts. Fable 5 beat FireRed with a minimal, vision-only harness: raw game screenshots, nothing else. The scaffolding that compensated for weak spatial/visual reasoning didn't get improved — it got deleted, because the capability moved into the model. The same pattern shows up in Fable's memory results: file-based persistent memory improved Fable's Slay the Spire play 3× more than it improved Opus 4.8's — the model got better at using the harness affordance, so less hand-holding around it is needed. Vision and long-horizon memory are exactly the axes where 2025-era agents needed the most scaffolding; they are now among the first to dissolve.

The wrong direction: harness bloat#

The opposite failure mode is worse than no harness — it actively degrades the model:

Cat Wu: "What models are capable of in [a one-month] timeline" is the hardest forecast for PMs; over-specifying the harness for an old model wastes tokens that the new model uses better unsupervised.
Matt Pocock: 250K-token system prompts push the model into the dumb zone before it does anything (see Context Window Smart Zone).
Repeated capability injections drift toward contradiction: rule X for case A, rule Y for case B, until the model can't tell which applies.

Process: read the system prompt at every launch#

Cat Wu's discipline:

"We read through the entire system prompt and we reflect on, okay, for each of these sections, does the model really need this reminder anymore? And if not, we'll remove it."

This is a backwards practice — most teams would only add to a prompt, not subtract. Doing it on a cadence aligned to model launches is what keeps the harness from accreting.

Build for the next model, not this one#

Counterintuitive corollary from Boris:

"We were trying to build this thing that was like pre-PMF, and we knew that it wouldn't have PMF for 6 months because we were building for the next model."

Most products are built for the model they're released against. Anthropic builds Claude Code for the model six months out — accepting it doesn't quite work today, with the bet that the next release closes the gap. This shifts what "harness work" means: not "make the current model usable" but "build a product surface that will work when the model arrives."

Cat Wu's variant: "It's pretty important to build products that don't necessarily work yet so that you know what is missing for this product to work, and then with the newest model you can just swap it in."

Dan Carey gives the cleanest retrospective case: Claude Design's early-prototype gaps were closed not by clever engineering but by Opus 4.7 shipping ("the model releases are a tide that lifts all boats"). Dedicated treatment, with the next-model-vs-AGI-strawman calibration: Build for the Next Model.

Counterpoint: harness still matters#

Not every voice agrees. Matt Pocock argues the harness — feedback loops, deep modules, mechanical verification — is the ceiling:

"If your code base doesn't have feedback loops, you're never ever ever going to get decent AI decent output out of AI. The quality of your feedback loops influences how good your AI can code, essentially. That is the ceiling."

The synthesis: prompt scaffolding shrinks as models improve; mechanical verification remains essential. Tests, types, linters, isolated review contexts — these are infrastructure that the harness provides and that doesn't migrate into the model the way capabilities do.

Connections#

Boris Cherny — "100 lines" claim and the spontaneous-loop observation
Claude Fable 5 — the cleanest demonstration: vision-only Pokémon harness and 3×-better memory utilization vs Opus 4.8
Cat Wu — the operational discipline of pruning prompts at every launch
Matt Pocock — counterpoint that mechanical verification stays load-bearing
Agent Loop Pattern — example of a primitive migrating from harness to model
Context Window Smart Zone — why prompt bloat is a cost, not just bloat
Claude Character as Product — character is the rare harness asset that probably doesn't shrink
Agent Harness Engineering — generalizes the "enforce invariants, not implementations" principle to harness-vs-model division of labor
Claude Code Auto Mode — a harness feature whose necessity Cat Wu predicts will fade
AI Brain Fry — partially mitigated by harness shrinkage (less to oversee), reintroduced by output volume from loops
Human-AI Accountability Redesign — what doesn't shrink is the human at the boundary; this paper names what that boundary work becomes (oversight quality, decision rights, escalation, consequences)
Model Spec Midtraining (MSM) — alignment moves from harness-prompt-injection of values to model-internalized values; the alignment side of harness shrinkage
Interaction Models — the same move on the interaction axis: VAD / turn-detection / dialog-management harnesses dissolve into the model (Thinking Machines Lab, May 2026)
The Bitter Lesson — the underlying principle: hand-crafted scaffolding gets outpaced by scaled general capability
Build for the Next Model — the product-strategy corollary spun out as its own page: prototype "the thing that almost works" and let the next release close the gap (Dan Carey / Claude Design / Opus 4.7)
HTML as the New Markdown — the crucial distinction: this page describes the model-facing harness shrinking, while Thariq Shihipar's HTML artifacts (plans, micro-apps) are human-facing harness that grows as models improve (the binding constraint moves from "can the model do it" to "can the human stay in the loop")
Compute Allocator — names the human role that expands as the model-facing harness shrinks; ~99% of tokens go to human-facing scaffolding
Founder as Agent Orchestrator — orchestration affordances themselves will shift as harness shrinks; founders building permanent workflows around 2026 Claude-surface affordances should expect rewrites
Agentic Technical Debt — CLAUDE.md as architectural context is one form of harness; may eventually be inferred by the model, but currently load-bearing
Compounding Data Moat — vertical-edge-case test suites are a form of harness that doesn't migrate inward (no generic training signal for niche industry edge cases)
AI-Native Startup Lifecycle — founders building permanent workflows around 2026 Claude-surface affordances should expect them to shift as harness shrinks
Zero-Friction Scope Creep — written-scope discipline is human-process work that does not migrate inward as harness shrinks
MCP and Computer Use — complementary to harness shrinkage: connectors don't shrink, they broaden as the model decides which substrate (MCP / API / computer use) to use for each task
Evals as Product Spec — what doesn't shrink on the PM side: evals are durable artifacts that re-validate the product as the harness around them dissolves
Agentic Loops Overtake Bespoke Systems — the same dynamic in formal mathematics: DeepMind's bespoke proof-search scaffolding (AlphaProof + evolution) converted from capability-enabling to merely cost-saving as the LLM improved
Verification as the New Bottleneck — Fiona Fung's org-level corollary: as the generation harness shrinks, verification becomes the binding constraint
Recursive Self-Improvement — harness shrinkage run to its endpoint: the harness dissolving into the model is the same trend that, applied to AI development itself, closes the self-improvement loop
AI Accelerating AI Development — the measured deployment-side story: as capability migrates inward, internal engineering throughput rises (~8× code/engineer; >80% Claude-authored)
Research Taste as the Human Bottleneck — the human-side mirror: what's left after the model-facing harness shrinks is taste, review, and direction-setting
Vibe Coding vs. Agentic Engineering — Karpathy's ">10x and widening" leverage curve is the practitioner-facing form of shrinking harness / growing capability

Open questions#

Does all prompt scaffolding eventually migrate into the model, or does some remain — e.g. organization-specific style, security rules, brand voice?
The Boris "100 lines" prediction is a year out from May 2026 — testable in 2027.
If harness work shrinks, what new work expands to fill it? Cat Wu's bet: PM/product taste, eval-writing, character work.

Derived#

Learning to Co-Work with AI: A Software Engineer's Field Guide — pruning-at-every-launch framed as a daily practice; "build for next model" as career-strategic horizon
Opinions on Using AI Tools & the Future of the Software Engineering Role — the harness-shrinks vs harness-is-the-ceiling tension is one axis of the four-stance debate map
Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling? — the model-facing/human-facing asymmetry taken to its conclusion: the human-facing harness can't shrink to zero and faces more bloat pressure as models improve
Where Does Agent Harness Work Remain Durable as Models Improve? — separates shrinking capability scaffolding from durable boundary work: verification, repo-local truth, context budgeting, isolation, tools, and human decision surfaces

Sources#

Anthropic's Boris Cherny: Why Coding Is Solved, and What Comes Next
How Anthropic's product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code)
Full Walkthrough: Workflow for AI Coding — Matt Pocock (counterpoint)
Claude Fable 5 and Claude Mythos 5 — vision-only Pokémon FireRed harness; memory-utilization gains

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 62

Agent Context Files
The cross-vendor markdown-as-control-plane pattern: repo-versioned plaintext (CLAUDE.md / AGENTS.md / SOUL.md / WORKFLO…
Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
Agentic Technical Debt
Debt that *compounds* (not just accumulates) because each agentic-coding session re-derives architectural decisions wit…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI Brain Fry
Kropp et al. 2026/03: mental fatigue from excessive AI oversight increases minor errors +11%, major errors +39%; cognit…
AI-Native Moats Under Frontier-Model Improvement
Frontier-model improvement stress-tests AI-native moats: product velocity and wedges must compound into behavioral data…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Opinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Boris Cherny
Creator of Claude Code at Anthropic; phone-driven workflow with hundreds of agents; primary advocate of `/loop` primiti…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Campfire
AI-native ERP (YC S23) pulling customers off NetSuite; custom foundation model + agent platform; Series B (Accel/Ribbit…
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Claude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Compounding Data Moat
Anthropic's prescription for Scale-stage defensibility: time-locked behavioral fingerprint + domain-encoded edge cases…
Compounding Loop Optimization
Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal t…
Compute Allocator
The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…
Context Window Smart Zone
Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised…
Disposable Micro-Apps
Throwaway custom UIs built per-task to edit a plan ("micro-software on top of micro-software"); copy-back-to-markdown;…
Where Does Agent Harness Work Remain Durable as Models Improve?
Durable harness work lives at external-reality boundaries: repo-local source of truth, mechanical verification, context…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Fiona Fung
Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no…
Founder as Agent Orchestrator
Founder role shift: less individual contributor, more orchestrator of specialized AI assistants; non-technical founders…
HTML as the New Markdown
Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the *human*; HTML artifacts (visual…
Human-AI Accountability Redesign
HBR five-pillar prescription: span-of-control redesign, role redesign, performance management reset, decision-rights/es…
Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling?
Yes — HTML raises and reshapes the human-attention ceiling but can't remove it; bloat relocates from document-length to…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Learning to Co-Work with AI: A Software Engineer's Field Guide
Field guide for software engineers in the AI era: 6 skill clusters (taste, harness, alignment-first planning, agent-fri…
Managers as ICs
Every Claude Code manager starts as an IC; flat org; agentic coding collapsed the onboarding cost that pushed managers…
Matt Pocock
Independent AI-coding educator; built Sandcastle library; smart-zone/grill-me/tracer-bullets pedagogical framing; "bad…
MCP and Computer Use
Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…
AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 36 concepts. Curated entry point; see Home for all domains.
Model Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
The PRD-Replacement Spectrum at AI-Native Speed
Four positions (grill-then-PRD → lighter-PRD → build-to-decide → prototype-is-spec) are one spectrum once you decompose…
Printing Press Software Democratization
Boris Cherny's analogy: 1400s literacy expansion → AI software-writing expansion; domain knowledge displaces coding ski…
Product Velocity as Moat
Shipping speed as differentiator + trust signal ("you'll scale with us"); a treadmill that must convert into durable lo…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Seven Powers Applied to AI
Helmer/Acquired framework re-evaluated for AI: switching costs and process power erode; network effects, scale, cornere…
Thariq Shihipar
Engineer on the Claude Code team at Anthropic; "HTML is the new markdown" and "compute allocator" framings; three HTML-…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…
Zero-Friction Scope Creep
MVP failure mode when agentic coding removes the cost-based forcing function against scope creep; antidote is written s…

Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._

Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._

Cited by 62

Agent Context Files
The cross-vendor markdown-as-control-plane pattern: repo-versioned plaintext (CLAUDE.md / AGENTS.md / SOUL.md / WORKFLO…
Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
Agentic Technical Debt
Debt that *compounds* (not just accumulates) because each agentic-coding session re-derives architectural decisions wit…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI Brain Fry
Kropp et al. 2026/03: mental fatigue from excessive AI oversight increases minor errors +11%, major errors +39%; cognit…
AI-Native Moats Under Frontier-Model Improvement
Frontier-model improvement stress-tests AI-native moats: product velocity and wedges must compound into behavioral data…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Opinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Boris Cherny
Creator of Claude Code at Anthropic; phone-driven workflow with hundreds of agents; primary advocate of `/loop` primiti…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Campfire
AI-native ERP (YC S23) pulling customers off NetSuite; custom foundation model + agent platform; Series B (Accel/Ribbit…
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Claude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Compounding Data Moat
Anthropic's prescription for Scale-stage defensibility: time-locked behavioral fingerprint + domain-encoded edge cases…
Compounding Loop Optimization
Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal t…
Compute Allocator
The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…
Context Window Smart Zone
Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised…
Disposable Micro-Apps
Throwaway custom UIs built per-task to edit a plan ("micro-software on top of micro-software"); copy-back-to-markdown;…
Where Does Agent Harness Work Remain Durable as Models Improve?
Durable harness work lives at external-reality boundaries: repo-local source of truth, mechanical verification, context…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Fiona Fung
Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no…
Founder as Agent Orchestrator
Founder role shift: less individual contributor, more orchestrator of specialized AI assistants; non-technical founders…
HTML as the New Markdown
Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the *human*; HTML artifacts (visual…
Human-AI Accountability Redesign
HBR five-pillar prescription: span-of-control redesign, role redesign, performance management reset, decision-rights/es…
Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling?
Yes — HTML raises and reshapes the human-attention ceiling but can't remove it; bloat relocates from document-length to…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Learning to Co-Work with AI: A Software Engineer's Field Guide
Field guide for software engineers in the AI era: 6 skill clusters (taste, harness, alignment-first planning, agent-fri…
Managers as ICs
Every Claude Code manager starts as an IC; flat org; agentic coding collapsed the onboarding cost that pushed managers…
Matt Pocock
Independent AI-coding educator; built Sandcastle library; smart-zone/grill-me/tracer-bullets pedagogical framing; "bad…
MCP and Computer Use
Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…
AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 36 concepts. Curated entry point; see Home for all domains.
Model Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
The PRD-Replacement Spectrum at AI-Native Speed
Four positions (grill-then-PRD → lighter-PRD → build-to-decide → prototype-is-spec) are one spectrum once you decompose…
Printing Press Software Democratization
Boris Cherny's analogy: 1400s literacy expansion → AI software-writing expansion; domain knowledge displaces coding ski…
Product Velocity as Moat
Shipping speed as differentiator + trust signal ("you'll scale with us"); a treadmill that must convert into durable lo…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Seven Powers Applied to AI
Helmer/Acquired framework re-evaluated for AI: switching costs and process power erode; network effects, scale, cornere…
Thariq Shihipar
Engineer on the Claude Code team at Anthropic; "HTML is the new markdown" and "compute allocator" framings; three HTML-…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…
Zero-Friction Scope Creep
MVP failure mode when agentic coding removes the cost-based forcing function against scope creep; antidote is written s…