資料來源#
摘要#
Andrej Karpathy 對 AI 自動化什麼、何時自動化的核心主張:傳統電腦自動化你能以程式碼規格化的事物;LLM 自動化你能驗證的事物。 由於前沿實驗室在巨大的強化學習環境中以驗證獎勵訓練模型,能力在可驗證領域(數學、程式碼)急劇攀升,在其他領域則參差不齊——產生了 Jagged Intelligence (Ghosts, Not Animals)。實務上的分解方式:一項能力會出現,前提是它可驗證且實驗室有足夠動機去建構環境/納入資料。因此,可驗證性既是解釋當今鋸齒狀能力的原因,也是一個策略槓桿——如果你能建構驗證機制,你就能自己拉動 RL/微調的槓桿。
核心類比#
傳統電腦能自動化你可以用程式碼規格化的事物。這一輪最新的 LLM 能自動化你可以驗證的事物。
RL 訓練獎勵已驗證的結果,因此梯度最強烈地流向正確性可檢查的領域。數學和程式碼是典型的贏家——而這也正是 AI-Driven Formal Proof Search(Lean + 編譯器)和 agentic coding(測試 + CI)最強大的領域,絕非巧合。編譯器/測試就是驗證器;驗證器就是獎勵訊號。
「可驗證 + 實驗室重視」#
光有可驗證性還不夠——實驗室也會選擇什麼進入訓練組合:
- 西洋棋軼事。 GPT-3.5→GPT-4 的西洋棋能力提升遠超一般能力曲線的預測,因為「大量西洋棋資料被納入了預訓練資料集。」有人決定加入它;能力就飆升了。你「某種程度上受制於實驗室碰巧放入組合中的東西。」
- 對使用者的啟示。 模型「沒有使用手冊。」你必須探索它:搞清楚你處於哪些迴路中。「如果你在屬於 RL 訓練的迴路中,你會飛起來;如果你在資料分佈之外,你會掙扎」——然後你就得自己做微調。
創辦人的槓桿#
Karpathy 給追求實驗室尚未優先處理的可驗證領域的創辦人的建議:可驗證性是「直接有效的技術——你可以拉一個槓桿。」如果你能組裝多樣化的 RL 環境/範例,你就能微調並「得到實際運作得相當好的東西。」他巧妙地拒絕點名「一個非常[有價值的]領域」——這是一個刻意的不回答,暗示一個尚未被開發的可驗證利基是創業機會。(交叉參考 Seven Powers Applied to AI 和 Compounding Data Moat 中的護城河討論:專有的驗證環境是一種壟斷資源。)
一切終將可驗證#
從另一面來看——「什麼東西只能從遠處自動化?」——Karpathy 主張幾乎所有事物都能在某種程度上變得可驗證。 即使是寫作這類軟性領域,也能透過「一個 LLM 評審委員會」產出合理的結果。所以問題在於多容易或多困難,而非是否可能。這是該論點的樂觀前景:可驗證性是一個 AI 持續攀升的光譜,LLM 評審集成將獎勵訊號延伸到模糊領域。
相關連結#
- Andrej Karpathy — 該論點的作者(他關於可驗證性的著述)
- Jagged Intelligence (Ghosts, Not Animals) — 可驗證性是原因;鋸齒狀能力是症狀
- AI-Driven Formal Proof Search — 最純粹的實例:Lean 的編譯器是完美的驗證器,這就是為什麼 DeepMind 的 agent 能解決開放數學問題
- Vibe Coding vs. Agentic Engineering — 該學科的招聘測試(「紅隊無法攻破」)就是可驗證性的操作化
- Evals as Product Spec — Cat Wu 的「十個優秀 evals」是產品端的鏡像:為 AI 功能編碼「已驗證/完成」的定義
- Verification as the New Bottleneck — Fiona Fung:一旦寫程式變得廉價,驗證(而非生成)才是稀缺資源
- Scale-Dependent Prompt Sensitivity — 可驗證領域的 RL 是更大模型不會在每個基準上一致勝出的部分原因
- The Bitter Lesson — 在可驗證環境中大規模 RL 是超越手工啟發式方法的通用方法
- Client-Side Agent Optimization — 在你自己的 RL 環境上微調是最重量級的「拉槓桿」版本的優化故事
- Compounding Data Moat — 專有的驗證環境是可防禦的壟斷資源
開放問題#
- 「LLM 評審委員會」的可靠性邊界在哪裡——它對真正有爭議的價值判斷是否成立,還是僅適用於品質/連貫性?
- 「實驗室重視」的依賴性是脆弱的:能力可能因你無法控制的實驗室優先順序而出現或停滯。產品應如何對沖資料分佈被抽換的風險?
資料來源#
Cited by 22
- Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
- Andrej Karpathy
Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…
- Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
- Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
- Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
- Compounding Data Moat
Anthropic's prescription for Scale-stage defensibility: time-locked behavioral fingerprint + domain-encoded edge cases…
- Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
- How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
- Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
- AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 36 concepts. Curated entry point; see Home for all domains.
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
- Seven Powers Applied to AI
Helmer/Acquired framework re-evaluated for AI: switching costs and process power erode; network effects, scale, cornere…
- Software 3.0
Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…
- The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
- When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…
- Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…
Related articles
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
- Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
- Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
