Plate IILLM Architecture中文HOWARDISM

The Bitter Lesson

PublishedMay 13, 2026FiledConceptDomainLLM ArchitectureTagsLLM ArchitecturePrincipleReading5 minSourceAI-synthesised

Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolving harnesses into models; caveat — mechanical verification and character may not migrate inward

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Summary#

Rich Sutton's 2019 essay: general methods that leverage computation (search, learning) ultimately outperform methods that build in human knowledge and hand-engineered structure — and they do so by a wide margin as compute grows. The "bitter" part: this keeps surprising researchers who invested in clever domain structure, because the structure becomes a ceiling, not a foundation.

This page exists because the principle recurs as a load-bearing argument across this wiki — invoked explicitly to justify dissolving harnesses into models.

Where it's invoked here#

Interaction Models — TML cites "the bitter lesson" directly: hand-crafted interactivity systems (VAD, turn-detection, dialog-management harnesses) "will be outpaced by the advance of general capabilities," therefore "for interactivity to scale with intelligence, it must be part of the model itself." See Turn-Based Interface Bottleneck.
Encoder-Free Early Fusion — co-train all modality components from scratch in one transformer rather than stitching pretrained encoders/decoders: fewer hand-engineered modular boundaries.
Time-Aligned Micro-Turns — remove artificial turn boundaries so interaction modes become scalable model behavior rather than per-mode harness code.
Harness Shrinkage as Models Improve — the same logic applied to coding-agent harnesses: prompt scaffolding compensates for what the model can't yet do, and should shrink as models improve. (Caveat there: mechanical verification — tests, types, linters — is the part that doesn't migrate inward.)
Agent Harness Engineering — "enforce invariants, not implementations": let the model find the path; the harness only encodes what must be true.

The standard caveat#

The bitter lesson is about capabilities and structure migrating into the model, not "harnesses are useless." Things that legitimately stay outside the model: mechanical verification (Harness Shrinkage as Models Improve's synthesis), organization-specific policy/style, security boundaries, and — per Claude Character as Product — deliberate character/personality work. The open question on every harness component is which side of that line it's on.

Connections#

Evolutionary Proof Search — the bespoke evolutionary apparatus is exactly what the bitter lesson predicts gets absorbed
Interaction Models — the most explicit recent invocation
Turn-Based Interface Bottleneck — "the less-intelligent harness loses to scaling"
Harness Shrinkage as Models Improve — the coding-agent version, with the mechanical-verification caveat
Agent Harness Engineering — invariants-not-implementations as a bitter-lesson-aware design rule
Encoder-Free Early Fusion / Time-Aligned Micro-Turns — architectural choices justified by it
Claude Character as Product — a candidate counterexample: character may not migrate inward
Model Spec Midtraining (MSM) — alignment moving from harness-prompt-injection to model-internalized values is a bitter-lesson move on the alignment axis
Compute Allocator — names what stays on the human side of the line: the allocation decision and the human-facing scaffolding that supports it don't migrate inward, even as model-facing structure does
HTML as the New Markdown — "leave room for the model to surprise you" is the prompt-level form of the lesson; the caveat is that human-facing legibility (HTML artifacts) is on the side that does not dissolve into the model
MCP and Computer Use — Boris Cherny's "to the model, it's just tokens" makes the substrate choice (MCP/API/computer use) a model decision, not a harness decision; bitter-lesson endpoint for tool dispatch
Agentic Loops Overtake Bespoke Systems — the clearest empirical confirmation in the corpus: DeepMind's simple agentic loop matched its bespoke trained system (AlphaProof + evolutionary search) on open math problems as the LLM improved
AI R&D Autonomy Evaluation (AECI) — if the bitter lesson runs all the way, scaled general methods eventually improve themselves; AECI is how Anthropic measures whether that threshold is near
Recursive Self-Improvement — the furthest extrapolation of the principle: "research progress is mostly a function of tools and resources," so perspiration (the 99%) becomes automatable
AI Accelerating AI Development — the empirical instance: the kernel-optimization loop going 3×→52× is scaled general method beating hand-tuning, measured
Research Taste as the Human Bottleneck — the open bet on the last holdout: is research taste a true ceiling, or just the next structure the bitter lesson dissolves?
Build for the Next Model — the product-strategy corollary: since capability migrates inward over releases, prototype "the thing that almost works" and let the next model dissolve the gap rather than engineering around it
Task Time-Horizon Scaling — rising general-benchmark capability is the curve that keeps shrinking hand-built scaffolding's advantage
The Verifiability Thesis — Karpathy's account of why scaled RL outruns hand-engineering: labs throw compute at verifiable-reward environments
Software 3.0 — the neural-net-as-host-process extrapolation is the bitter lesson pushed all the way to the hardware layer
Andrej Karpathy — frequent invoker of the principle (verifiability, ghosts, Software 3.0 all rest on it)

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration (explicit citation of "the bitter lesson")

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 27

Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Opinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
Andrej Karpathy
Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Compute Allocator
The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…
Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
The Future of Agent Interfaces
Interface future is layered: native interaction models for human collaboration, MCP/APIs for structured action, app pro…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
HTML as the New Markdown
Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the *human*; HTML artifacts (visual…
Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling?
Yes — HTML raises and reshapes the human-attention ceiling but can't remove it; bloat relocates from document-length to…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
MCP and Computer Use
Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…
LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.
Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Software 3.0
Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…

Cited by 27

Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Opinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
Andrej Karpathy
Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Compute Allocator
The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…
Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
The Future of Agent Interfaces
Interface future is layered: native interaction models for human collaboration, MCP/APIs for structured action, app pro…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
HTML as the New Markdown
Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the *human*; HTML artifacts (visual…
Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling?
Yes — HTML raises and reshapes the human-attention ceiling but can't remove it; bloat relocates from document-length to…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
MCP and Computer Use
Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…
LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.
Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Software 3.0
Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…