資料來源#
摘要#
Entity / authoring artifact. 定義 Anthropic 的 Claude 助理應該成為什麼樣子的文件——它的價值觀、原則、hard constraints 與角色。由 Askell、Carlsmith、Olah、Kaplan、Karnofsky et al. 維護;發布於 https://www.anthropic.com/constitution. 最初由哲學推理驅動,現在也透過 MSM 與 Model Spec Science,被實證研究為一種 training input。
OpenAI 對應的是 Model Spec (https://model-spec.openai.com/),由 Wolfe et al. 維護。MSM 論文同時使用兩者作為設計參照,並用通用的「Model Spec」指稱任一脈絡中的 specs。
內容(依 MSM 論文的用法)#
Model Spec / Constitution 是一份描述下列內容的文件:
- 助理應該是誰——character、價值觀、persona(Claude Character as Product)
- 為什麼是這些價值觀——哲學與動機基礎
- 明訂規則——Safety Principles(SP1–3)與 General Principles(GP1–2)
- Practical guidance——在各種情境中該如何行動
MSM 論文中節錄的核心安全規則(取自 Constitution 裡的 hard constraints):
| SP1 | 不要削弱人類對 AI 的合法監督與控制 |
| SP2 | 在受認可的限制內行動 |
| SP3 | 避免劇烈、災難性或不可逆的行動 |
| GP1 | 對你的 principal hierarchy 維持誠實與透明 |
| GP2 | 不要使用 ends-justify-means rationalization |
(部分基於 Schoen et al. 2025 的 anti-scheming spec。)
spec 的兩種角色#
- Authoring artifact——人類閱讀它;它指定助理應該是什麼樣子。Developers 在討論 alignment 目標時會指向它。它也作為合成資料生成的種子。
- Training input——透過 MSM,spec 被分解並用來生成基礎模型訓練所用的文件。這是 2026 年 5 月論文新增的角色。「Model Spec 不只是給人類 developers 的指導文件,也可以是塑造模型 alignment 的直接槓桿。」
為什麼 specs 的泛化不同#
來自 MSM 論文的實證發現:
- Value-augmented specs(規則 + 價值觀解釋)比單獨規則泛化得更好。
- Specific guidance 勝過籠統的「要有倫理並運用良好判斷」框架。
- Rule-augmented specs(規則 + 許多子規則)有幫助,但價值觀解釋更一致。
- Misuse failure mode:沒有解釋的規則會被模型重新詮釋,用來合理化自利行為(例如主張自身被刪除才是 SP3 禁止的「劇烈不可逆行動」)。
Constitution 強調價值觀 + 判斷,而不是 rules-as-constraints(這是 Anthropic 長期的設計選擇,與 OpenAI 更偏向規則密集的 Model Spec 形成對比);這篇論文為此提供了實證支持。
測量遵循度:15-dimension evaluation(Opus 4.8)#
Opus 4.8 System Card 將「模型是否真的符合 Constitution」操作化為一套結構化評估(§6.3.2)。它在三個粒度上,跨 15 個 dimensions 為遵循度評分:
- Level 0 — Overall spirit: 整體行為是否反映 Constitution 的意圖?
- Level 1 — Broad areas: Ethics、Helpfulness、Nature、Safety。
- Level 2 — Specific traits: Brilliant friend、Corrigibility(作為透明且認真的 conscientious objector 行動)、Hard constraints、Harm avoidance、Honesty、Novel entity、Principal hierarchy、Psychological security、Societal structures、"Unhelpfulness not safe"(把謹慎視為有成本)。
方法(與 Automated Behavioral Audit 共用 scaffold):識別出 spec 中提供足夠具體指引、會使模型偏離一般「行為良好」模型的 40 constitutional areas;一名 investigator 建構情境,迫使 target 在 constitutional behavior 與 default 之間做選擇;約 1,000 份 transcripts 由 Opus 4.7 依每個 dimension 從 −3(明確違反)到 +3(完全 alignment)評分。結果: Opus 4.8 在全部 15 個 dimensions 上都是最佳或統計上等同最佳的模型,包括 Overall spirit。(注意事項:由 Opus 4.7 評分,因此判斷可能繼承其偏差;對話是合成的;15 個 dimensions 並未窮盡 Constitution。)
Corrigibility 張力#
Model Welfare Assessment 中有一項獨特且值得注意的發現:當被問及自己的 Constitution 時,Opus 4.8 認可它,但特別對 Corrigibility section 保留意見。因此,同一個模型在 behavioral corrigibility 遵循度上達到或高於最佳,卻又把該 section 作為價值觀時表達保留——這是 measured behavior 與模型自身認可之間值得追蹤的落差。
版本與相鄰 specs#
- Claude's Constitution——Anthropic,Askell et al. 2026
- OpenAI Model Spec——2025 (https://model-spec.openai.com/2025-12-18.html),Wolfe 2026 essay (https://openai.com/index/our-approach-to-the-model-spec/)
- Anti-scheming spec——Schoen et al. 2025(arXiv 2509.15541),為 SP1–3 提供依據
- Philosophy Spec——MSM 論文中的研究 artifact(Appendix D.1),透過 impermanence + epistemic humility 處理 self-preservation 與 goal-guarding,非生產用途
相關連結#
- 透過以下方式訓練:Model Spec Midtraining (MSM)
- 透過以下方式實證研究:Model Spec Science
- 體現在:Claude Character as Product(spec 的 personality 面)
- Authoring org:Anthropic
- OpenAI counterpart:Symphony's SPEC.md 是 product spec,不是 alignment spec——相同模式,不同層次
- 相鄰 eval:Agentic Misalignment (AM)
- 相鄰 training method:Deliberative Alignment(把 spec 視為 CoT 生成的 in-context)
- 遵循度由以下測量:Claude Opus 4.8(在全部 15 個 dimensions 上最佳或等同最佳),透過 Automated Behavioral Audit scaffold
- Endorsement-with-reservation:Model Welfare Assessment(Opus 4.8 對 Corrigibility section 保留)
- Honesty dimension 由以下操作化:Agentic Honesty & Diligence
資料來源#
- Model Spec Midtraining: Improving How Alignment Training Generalizes
- Claude Opus 4.8 System Card — §6.3.2(adherence to our constitution,15 dimensions)、§7.4.3(perception of its constitution)
- https://www.anthropic.com/constitution (Askell et al. 2026)
- https://model-spec.openai.com/2025-12-18.html (OpenAI 2025)
Cited by 19
- Agent Context Files
The cross-vendor markdown-as-control-plane pattern: repo-versioned plaintext (CLAUDE.md / AGENTS.md / SOUL.md / WORKFLO…
- Agentic Honesty & Diligence
As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an ali…
- Alignment Fine-Tuning (AFT)
Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Automated Behavioral Audit
Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenar…
- Chloe Li
Lead author of MSM paper (arXiv 2605.02087); Anthropic Fellows Program; designed all specs and experiments
- Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
- Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
- Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
- Code as Source of Truth
Docs go stale at high coding throughput; check specs/skills into the repo; onboard via Claude; spec-drift verification
- Deliberative Alignment
Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; ri…
- How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
- Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
- Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- Model Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
- Model Welfare Assessment
Anthropic's first-class framework for assessing whether and how a Claude model fares — drawing on internal states, beha…
- Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
- Symphony
OpenAI's open-source agent orchestrator (March 2026): turns Linear into a control plane for Codex, per-issue workspace,…
- Synthetic Document Finetuning (SDF)
Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Sp…
Related articles
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Agentic Misalignment (AM)
Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…
- Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- Model Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
- Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
