H
Howardism
Plate IILLM Architecture中文HOWARDISM

The Bitter Lesson

PublishedMay 13, 2026FiledConceptDomainLLM ArchitectureTagsLLM ArchitecturePrincipleReading5 minSourceAI-synthesised

Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolving harnesses into models; caveat — mechanical verification and character may not migrate inward

Illustration for The Bitter Lesson

Sources#

Summary#

Rich Sutton's 2019 essay: general methods that leverage computation (search, learning) ultimately outperform methods that build in human knowledge and hand-engineered structure — and they do so by a wide margin as compute grows. The "bitter" part: this keeps surprising researchers who invested in clever domain structure, because the structure becomes a ceiling, not a foundation.

This page exists because the principle recurs as a load-bearing argument across this wiki — invoked explicitly to justify dissolving harnesses into models.

Where it's invoked here#

  • Interaction Models — TML cites "the bitter lesson" directly: hand-crafted interactivity systems (VAD, turn-detection, dialog-management harnesses) "will be outpaced by the advance of general capabilities," therefore "for interactivity to scale with intelligence, it must be part of the model itself." See Turn-Based Interface Bottleneck.
  • Encoder-Free Early Fusion — co-train all modality components from scratch in one transformer rather than stitching pretrained encoders/decoders: fewer hand-engineered modular boundaries.
  • Time-Aligned Micro-Turns — remove artificial turn boundaries so interaction modes become scalable model behavior rather than per-mode harness code.
  • Harness Shrinkage as Models Improve — the same logic applied to coding-agent harnesses: prompt scaffolding compensates for what the model can't yet do, and should shrink as models improve. (Caveat there: mechanical verification — tests, types, linters — is the part that doesn't migrate inward.)
  • Agent Harness Engineering — "enforce invariants, not implementations": let the model find the path; the harness only encodes what must be true.

The standard caveat#

The bitter lesson is about capabilities and structure migrating into the model, not "harnesses are useless." Things that legitimately stay outside the model: mechanical verification (Harness Shrinkage as Models Improve's synthesis), organization-specific policy/style, security boundaries, and — per Claude Character as Product — deliberate character/personality work. The open question on every harness component is which side of that line it's on.

Connections#

  • Evolutionary Proof Search — the bespoke evolutionary apparatus is exactly what the bitter lesson predicts gets absorbed
  • Interaction Models — the most explicit recent invocation
  • Turn-Based Interface Bottleneck — "the less-intelligent harness loses to scaling"
  • Harness Shrinkage as Models Improve — the coding-agent version, with the mechanical-verification caveat
  • Agent Harness Engineering — invariants-not-implementations as a bitter-lesson-aware design rule
  • Encoder-Free Early Fusion / Time-Aligned Micro-Turns — architectural choices justified by it
  • Claude Character as Product — a candidate counterexample: character may not migrate inward
  • Model Spec Midtraining (MSM) — alignment moving from harness-prompt-injection to model-internalized values is a bitter-lesson move on the alignment axis
  • Compute Allocator — names what stays on the human side of the line: the allocation decision and the human-facing scaffolding that supports it don't migrate inward, even as model-facing structure does
  • HTML as the New Markdown — "leave room for the model to surprise you" is the prompt-level form of the lesson; the caveat is that human-facing legibility (HTML artifacts) is on the side that does not dissolve into the model
  • MCP and Computer Use — Boris Cherny's "to the model, it's just tokens" makes the substrate choice (MCP/API/computer use) a model decision, not a harness decision; bitter-lesson endpoint for tool dispatch
  • Agentic Loops Overtake Bespoke Systems — the clearest empirical confirmation in the corpus: DeepMind's simple agentic loop matched its bespoke trained system (AlphaProof + evolutionary search) on open math problems as the LLM improved
  • AI R&D Autonomy Evaluation (AECI) — if the bitter lesson runs all the way, scaled general methods eventually improve themselves; AECI is how Anthropic measures whether that threshold is near
  • Recursive Self-Improvement — the furthest extrapolation of the principle: "research progress is mostly a function of tools and resources," so perspiration (the 99%) becomes automatable
  • AI Accelerating AI Development — the empirical instance: the kernel-optimization loop going 3×→52× is scaled general method beating hand-tuning, measured
  • Research Taste as the Human Bottleneck — the open bet on the last holdout: is research taste a true ceiling, or just the next structure the bitter lesson dissolves?
  • Build for the Next Model — the product-strategy corollary: since capability migrates inward over releases, prototype "the thing that almost works" and let the next model dissolve the gap rather than engineering around it
  • Task Time-Horizon Scaling — rising general-benchmark capability is the curve that keeps shrinking hand-built scaffolding's advantage
  • The Verifiability ThesisKarpathy's account of why scaled RL outruns hand-engineering: labs throw compute at verifiable-reward environments
  • Software 3.0 — the neural-net-as-host-process extrapolation is the bitter lesson pushed all the way to the hardware layer
  • Andrej Karpathy — frequent invoker of the principle (verifiability, ghosts, Software 3.0 all rest on it)

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 27
  • Agent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • Agentic Loops Overtake Bespoke Systems

    DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…

  • AI Accelerating AI Development

    The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…

  • AI R&D Autonomy Evaluation (AECI)

    How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…

  • Opinions on Using AI Tools & the Future of the Software Engineering Role

    Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…

  • Andrej Karpathy

    Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…

  • Build for the Next Model

    Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…

  • Claude Character as Product

    Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…

  • Compute Allocator

    The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…

  • Encoder-Free Early Fusion

    Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…

  • Evolutionary Proof Search

    The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…

  • The Future of Agent Interfaces

    Interface future is layered: native interaction models for human collaboration, MCP/APIs for structured action, app pro…

  • Harness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

  • HTML as the New Markdown

    Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the *human*; HTML artifacts (visual…

  • Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling?

    Yes — HTML raises and reshapes the human-attention ceiling but can't remove it; bloat relocates from document-length to…

  • Interaction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • MCP and Computer Use

    Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…

  • LLM Architecture, Training & Alignment

    Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.

  • Model Spec Midtraining (MSM)

    New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…

  • Recursive Self-Improvement

    An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…

  • Research Taste as the Human Bottleneck

    The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…

  • Software 3.0

    Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…

  • Task Time-Horizon Scaling

    METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

  • Thinking Machines Lab

    AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…

  • Time-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • Turn-Based Interface Bottleneck

    Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…

  • The Verifiability Thesis

    LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…

Related articles
  • Harness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

  • Interaction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • Open Questions Backlog

    _96 pages with open questions, as of 2026-06-14._

  • Claude Code

    Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…

  • Recursive Self-Improvement

    An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…