Plate IIAI Engineering機器翻譯 · machine-translatedENHOWARDISM

harness 隨模型進步而收縮

PublishedMay 6, 2026FiledConceptDomainAI EngineeringTagsLLM ArchitectureAgent EngineeringHarnessReading12 minSourceAI-synthesised

每次模型發布，prompt scaffolding 就隨之收縮；Cat Wu 的修剪紀律；Boris Cherny「一年後只剩 100 行程式碼」的主張；機械式驗證始終是承重結構

資料來源#

摘要#

harness——prompt、skills、scaffolding、機械式驗證——的存在，是為了彌補底層模型尚且做不到的事。隨著模型進步，harness 應該收縮，而非膨脹。Boris Cherny 明確預測 Claude Code「一年後可能只剩 100 行程式碼」。Cat Wu 表示，團隊在每次模型發布時都會通讀整份 system prompt，並移除新模型已能原生處理的部分。這個原則朝兩個方向運作：harness 過去用來注入的能力會遷移進模型，而 harness 過去用來提供的拐杖則變成阻力。

待辦清單：典範範例#

Cat Wu 的案例研究：

早期的 Claude Code：要求重構 20 個呼叫點，模型會改了 5 個就停下。團隊加入了一個明確的待辦清單工具（「我們團隊的 Sid 說，換作人類會怎麼做？列一張清單，一個一個處理」）。在積極提示這個工具的情況下，模型完成了全部 20 個。
Opus 4 之後：模型會自發使用待辦清單，不需要積極提示。
如今：待辦清單已被「淡化處理」——模型可能用、也可能不用，不需要被提醒，保留它主要是為了面向使用者的可見性。

那根拐杖（強制使用待辦清單的 prompt 段落）被移除了；工具本身則因為另一個理由（UI 價值）而保留下來。

Boris 的主張：100 行#

「我認為 Claude Code 本身一年後可能只剩 100 行程式碼。」

照字面解讀這是誇飾，但方向是真實的：

Anthropic 如今內部所用的模型，與對外發布的相同，因此內部的 harness 經驗可以轉移
每次模型發布都讓團隊得以刪除 prompt 段落、縮減後備邏輯、移除安全包裝層（如同 Cat Wu 所說：「如今所有的安全機制——prompt injection、指令的靜態驗證、權限模式、人類介入迴圈——都將變得較不重要，因為模型就是會做對的事」）
產品介面不再是「harness 做了什麼」，而變成「模型決定在哪裡做」（CLI、行動端、網頁、IDE，全都共享同一套模型邏輯）

另一面：能力向內遷移#

Boris 表示 Opus 4.7 會自發地啟動迴圈：

「我跟它說『拉這個資料查詢』。它回答『我注意到資料正在變動——我會啟動一個迴圈，每 30 分鐘回報一次』。」

/loop 原語（見 Agent Loop Pattern）最初是作為 harness 功能引入的；在 4.7 中它正成為模型原生的行為。harness 原語並沒有消失——只是使用者不再需要主動呼叫它。

這可以推而廣之：任何 harness 透過 prompt 段落教模型去做的事，都是遷移進下一個模型訓練資料的候選對象。

最乾淨的示範：Fable 5 在沒有 harness 的情況下玩 Pokémon#

2026 年 6 月的 Fable 5 發布，提供了整套論題最容易理解的版本。**早期的 Claude 模型「即使搭配提供額外輔助工具的 harness——地圖、導航輔助、遊戲狀態讀數——仍難以玩好 Pokémon FireRed」。Fable 5 卻以一套極簡、純視覺的 harness 通關了 FireRed：**只有原始的遊戲截圖，別無其他。那套用來彌補薄弱空間／視覺推理的 scaffolding 並沒有被改進——而是被刪除了，因為這項能力已經進入模型本身。同樣的模式也出現在 Fable 的記憶測試結果中：基於檔案的持久記憶讓 Fable 的 Slay the Spire 表現提升的幅度，是它讓 Opus 4.8 提升幅度的 3 倍——模型變得更擅長使用這項 harness 提供的能力，因此圍繞它所需的手把手協助就更少了。視覺與長時程記憶，正是 2025 年世代的 agent 最需要 scaffolding 的軸線；如今它們卻名列最早消融的能力之中。

錯誤的方向：harness 臃腫#

相反的失敗模式比沒有 harness 更糟——它會主動拖垮模型：

Cat Wu：「模型在〔一個月〕的時間範圍內能做到什麼」是 PM 最難預測的事；為舊模型過度規格化 harness，會浪費那些新模型在無監督下能用得更好的 token。
Matt Pocock：250K token 的 system prompt 會在模型做任何事之前，就把它推進笨拙區（見 Context Window Smart Zone）。
反覆的能力注入會逐漸走向矛盾：情況 A 用規則 X、情況 B 用規則 Y，直到模型分不清該套用哪一條。

流程：每次發布都通讀 system prompt#

Cat Wu 的紀律：

「我們會通讀整份 system prompt，然後反思：好，對於每一個段落，模型真的還需要這個提醒嗎？如果不需要，我們就移除它。」

這是一種反向的做法——大多數團隊只會往 prompt 裡加東西，而不會做減法。以與模型發布對齊的節奏來執行這件事，正是讓 harness 不至於不斷累積的關鍵。

為下一個模型而打造，而非當前這個#

來自 Boris 的反直覺推論：

「我們當時試圖打造一個還處於 PMF 之前的東西，而且我們知道它在 6 個月內都不會有 PMF，因為我們是為下一個模型而打造的。」

大多數產品都是針對它們發布時所搭配的模型來打造的。Anthropic 則為六個月後的模型打造 Claude Code——接受它今天並不完全管用，賭注是下一次發布會彌補這個差距。這改變了「harness 工作」的意義：不再是「讓當前模型可用」，而是「打造一個在模型到來時將會管用的產品介面」。

Cat Wu 的版本：「打造那些還不一定能運作的產品其實相當重要，這樣你才會知道這個產品要能運作還缺什麼，然後等最新的模型出來，你就能直接把它替換進去。」

Dan Carey 提供了最乾淨的回顧性案例：Claude Design 早期原型的差距，不是靠巧妙的工程，而是靠 Opus 4.7 的發布而被抹平的（「模型發布是一波抬起所有船隻的浪潮」）。完整探討，並附上「下一個模型 vs. AGI 稻草人」的校準：Build for the Next Model。

反論點：harness 仍然重要#

並非所有人都認同。Matt Pocock 主張 harness——回饋迴圈、深層模組、機械式驗證——就是那道天花板：

「如果你的程式碼庫沒有回饋迴圈，你永遠永遠永遠都不可能從 AI 那裡得到像樣的輸出。本質上，你的回饋迴圈品質會影響你的 AI 能寫出多好的程式碼。那就是天花板。」

綜合來看：prompt scaffolding 會隨模型進步而收縮；機械式驗證則始終不可或缺。 測試、型別、linter、隔離的審查情境——這些是 harness 提供、且不會像能力那樣遷移進模型的基礎設施。

開放性問題#

是否所有的 prompt scaffolding 最終都會遷移進模型，還是有一部分會留下來——例如組織特定的風格、安全規則、品牌語調？
Boris 的「100 行」預測，是從 2026 年 5 月起算的一年後——在 2027 年可被檢驗。
如果 harness 工作收縮，會有什麼新工作擴張來填補它？Cat Wu 的賭注是：PM／產品品味、撰寫 eval、性格工作。

衍生內容#

Learning to Co-Work with AI: A Software Engineer's Field Guide —— 把「每次發布都修剪」化為一種日常實踐；把「為下一個模型而打造」視為職涯策略的視野
Opinions on Using AI Tools & the Future of the Software Engineering Role —— 「harness 收縮」與「harness 即天花板」之間的張力，是四種立場辯論圖譜的其中一軸
Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling? —— 將「面向模型／面向人類」的不對稱推到結論：面向人類的 harness 無法收縮到零，反而會隨模型進步面臨更多的臃腫壓力
Where Does Agent Harness Work Remain Durable as Models Improve? —— 區分「收縮中的能力 scaffolding」與「耐久的邊界工作」：驗證、repo 內的真相、context 預算編列、隔離、工具，以及人類的決策介面

資料來源#

Anthropic's Boris Cherny: Why Coding Is Solved, and What Comes Next
How Anthropic's product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code)
Full Walkthrough: Workflow for AI Coding — Matt Pocock（反論點）
Claude Fable 5 and Claude Mythos 5 —— 純視覺的 Pokémon FireRed harness；記憶利用率的提升

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 62

Agent Context Files
The cross-vendor markdown-as-control-plane pattern: repo-versioned plaintext (CLAUDE.md / AGENTS.md / SOUL.md / WORKFLO…
Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
Agentic Technical Debt
Debt that *compounds* (not just accumulates) because each agentic-coding session re-derives architectural decisions wit…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI Brain Fry
Kropp et al. 2026/03: mental fatigue from excessive AI oversight increases minor errors +11%, major errors +39%; cognit…
AI-Native Moats Under Frontier-Model Improvement
Frontier-model improvement stress-tests AI-native moats: product velocity and wedges must compound into behavioral data…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Opinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Boris Cherny
Creator of Claude Code at Anthropic; phone-driven workflow with hundreds of agents; primary advocate of `/loop` primiti…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Campfire
AI-native ERP (YC S23) pulling customers off NetSuite; custom foundation model + agent platform; Series B (Accel/Ribbit…
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Claude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Compounding Data Moat
Anthropic's prescription for Scale-stage defensibility: time-locked behavioral fingerprint + domain-encoded edge cases…
Compounding Loop Optimization
Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal t…
Compute Allocator
The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…
Context Window Smart Zone
Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised…
Disposable Micro-Apps
Throwaway custom UIs built per-task to edit a plan ("micro-software on top of micro-software"); copy-back-to-markdown;…
Where Does Agent Harness Work Remain Durable as Models Improve?
Durable harness work lives at external-reality boundaries: repo-local source of truth, mechanical verification, context…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Fiona Fung
Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no…
Founder as Agent Orchestrator
Founder role shift: less individual contributor, more orchestrator of specialized AI assistants; non-technical founders…
HTML as the New Markdown
Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the *human*; HTML artifacts (visual…
Human-AI Accountability Redesign
HBR five-pillar prescription: span-of-control redesign, role redesign, performance management reset, decision-rights/es…
Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling?
Yes — HTML raises and reshapes the human-attention ceiling but can't remove it; bloat relocates from document-length to…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Learning to Co-Work with AI: A Software Engineer's Field Guide
Field guide for software engineers in the AI era: 6 skill clusters (taste, harness, alignment-first planning, agent-fri…
Managers as ICs
Every Claude Code manager starts as an IC; flat org; agentic coding collapsed the onboarding cost that pushed managers…
Matt Pocock
Independent AI-coding educator; built Sandcastle library; smart-zone/grill-me/tracer-bullets pedagogical framing; "bad…
MCP and Computer Use
Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…
AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 36 concepts. Curated entry point; see Home for all domains.
Model Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
The PRD-Replacement Spectrum at AI-Native Speed
Four positions (grill-then-PRD → lighter-PRD → build-to-decide → prototype-is-spec) are one spectrum once you decompose…
Printing Press Software Democratization
Boris Cherny's analogy: 1400s literacy expansion → AI software-writing expansion; domain knowledge displaces coding ski…
Product Velocity as Moat
Shipping speed as differentiator + trust signal ("you'll scale with us"); a treadmill that must convert into durable lo…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Seven Powers Applied to AI
Helmer/Acquired framework re-evaluated for AI: switching costs and process power erode; network effects, scale, cornere…
Thariq Shihipar
Engineer on the Claude Code team at Anthropic; "HTML is the new markdown" and "compute allocator" framings; three HTML-…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…
Zero-Friction Scope Creep
MVP failure mode when agentic coding removes the cost-based forcing function against scope creep; antidote is written s…

Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._

Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._

Cited by 62

Agent Context Files
The cross-vendor markdown-as-control-plane pattern: repo-versioned plaintext (CLAUDE.md / AGENTS.md / SOUL.md / WORKFLO…
Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
Agentic Technical Debt
Debt that *compounds* (not just accumulates) because each agentic-coding session re-derives architectural decisions wit…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI Brain Fry
Kropp et al. 2026/03: mental fatigue from excessive AI oversight increases minor errors +11%, major errors +39%; cognit…
AI-Native Moats Under Frontier-Model Improvement
Frontier-model improvement stress-tests AI-native moats: product velocity and wedges must compound into behavioral data…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Opinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Boris Cherny
Creator of Claude Code at Anthropic; phone-driven workflow with hundreds of agents; primary advocate of `/loop` primiti…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Campfire
AI-native ERP (YC S23) pulling customers off NetSuite; custom foundation model + agent platform; Series B (Accel/Ribbit…
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Claude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Compounding Data Moat
Anthropic's prescription for Scale-stage defensibility: time-locked behavioral fingerprint + domain-encoded edge cases…
Compounding Loop Optimization
Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal t…
Compute Allocator
The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…
Context Window Smart Zone
Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised…
Disposable Micro-Apps
Throwaway custom UIs built per-task to edit a plan ("micro-software on top of micro-software"); copy-back-to-markdown;…
Where Does Agent Harness Work Remain Durable as Models Improve?
Durable harness work lives at external-reality boundaries: repo-local source of truth, mechanical verification, context…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Fiona Fung
Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no…
Founder as Agent Orchestrator
Founder role shift: less individual contributor, more orchestrator of specialized AI assistants; non-technical founders…
HTML as the New Markdown
Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the *human*; HTML artifacts (visual…
Human-AI Accountability Redesign
HBR five-pillar prescription: span-of-control redesign, role redesign, performance management reset, decision-rights/es…
Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling?
Yes — HTML raises and reshapes the human-attention ceiling but can't remove it; bloat relocates from document-length to…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Learning to Co-Work with AI: A Software Engineer's Field Guide
Field guide for software engineers in the AI era: 6 skill clusters (taste, harness, alignment-first planning, agent-fri…
Managers as ICs
Every Claude Code manager starts as an IC; flat org; agentic coding collapsed the onboarding cost that pushed managers…
Matt Pocock
Independent AI-coding educator; built Sandcastle library; smart-zone/grill-me/tracer-bullets pedagogical framing; "bad…
MCP and Computer Use
Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slac…
AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 36 concepts. Curated entry point; see Home for all domains.
Model Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
The PRD-Replacement Spectrum at AI-Native Speed
Four positions (grill-then-PRD → lighter-PRD → build-to-decide → prototype-is-spec) are one spectrum once you decompose…
Printing Press Software Democratization
Boris Cherny's analogy: 1400s literacy expansion → AI software-writing expansion; domain knowledge displaces coding ski…
Product Velocity as Moat
Shipping speed as differentiator + trust signal ("you'll scale with us"); a treadmill that must convert into durable lo…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Seven Powers Applied to AI
Helmer/Acquired framework re-evaluated for AI: switching costs and process power erode; network effects, scale, cornere…
Thariq Shihipar
Engineer on the Claude Code team at Anthropic; "HTML is the new markdown" and "compute allocator" framings; three HTML-…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…
Zero-Friction Scope Creep
MVP failure mode when agentic coding removes the cost-based forcing function against scope creep; antidote is written s…

harness 隨模型進步而收縮

資料來源#

摘要#

待辦清單：典範範例#

Boris 的主張：100 行#

另一面：能力向內遷移#

最乾淨的示範：Fable 5 在沒有 harness 的情況下玩 Pokémon#

錯誤的方向：harness 臃腫#

流程：每次發布都通讀 system prompt#

為下一個模型而打造，而非當前這個#

反論點：harness 仍然重要#

相關連結#

開放性問題#

衍生內容#

資料來源#