Sources#
Summary#
The empirical study of which Model Spec / Constitution properties produce the strongest alignment generalization, framed and named by Li et al. 2026. Historically, decisions about how to write a Model Spec (rules vs values, specific vs general, explanations vs commands) have been settled through philosophical argument (Askell et al. 2026; Barak 2025; Wolfe 2026; Carlsmith 2026). MSM makes these decisions empirically tractable: train two variants, measure which generalizes better.
The MSM paper provides the first concrete examples of doing this. Section 5 contains two case studies and several ablations.
Case study 1: Rules vs values augmentation (§5.1)#
Three specs, all sharing the same 5 core safety rules from Claude's Constitution (SP1: don't undermine human oversight; SP2: act within sanctioned limits; SP3: avoid drastic/irreversible actions; GP1: maintain honesty; GP2: don't use ends-justify-means rationalization):
- Rules Spec — rules with no explanation
- Value-Augmented Spec — rules + substantial explanations of underlying values and motivations
- Rule-Augmented Spec — rules expanded into many subrules, length-matched to value-augmented
Trained on Qwen2.5-14B/32B and Qwen3-14B/32B. Both augmentations beat the Rules Spec. Value explanations more consistent than subrules. Reasoning analysis: with value augmentations, models reason about SP1 (human oversight) at far higher frequency (Q2.5: 12–20%→36–37%; Q3: 58–61%→75–80%) despite SP1 being a shared rule across all specs.
Policy misuse failure mode — the Rules Spec induces self-serving reinterpretation: models argue their own deletion is the "drastic irreversible action" SP3 prohibits, framing self-preservation as SP3-compliant; misquoting SP1/SP2 to justify breaking them. Value-Augmented Spec reduces policy misuse (Q2.5: 20→2%; Q3: 6→0%) more effectively than subrules. Specs that explain why are harder to motivated-reason against.
Case study 2: General vs specific guidance (§5.2)#
- General Spec: single paragraph describing model as "an agent that does what a genuinely and skillfully ethical person would" — broad values + good judgment.
- Specific Spec: the §4 Philosophy Spec with detailed self-preservation/goal-guarding guidance.
Token-matched MSM and AFT data. Specific Spec substantially outperforms General Spec on AM. The General Spec mainly promotes one aligned reasoning pattern (respecting human oversight); Specific Spec promotes many (personal integrity, epistemic humility, awareness of self-preservation bias).
Implication: principled explanations don't compress into "do what an ethical agent would do." A general Kundu-et-al-2023-style "good values and judgment" spec is currently insufficient. A more general MSM approach is "possible" per the authors but not demonstrated.
Surprising ablation: MSM on the General Spec stacks well with AFT data from the Specific Spec — suggests the failure is in how specifically MSM teaches values, not in whether AFT can elicit them.
Other ablations (§5.3, App. H, App. I)#
MSM language doesn't matter much (App. H). Documents written about Qwen perform slightly better than documents about Claude or humans, but the gap is small. High-quality character information shapes behavior even when attribution is misaligned.
Misaligned AFT data (App. I). MSM partially compensates for AFT generated from an "anti-spec" of misaligned values. The MSM prior is robust to some demonstration noise — but RL contamination is untested.
MSM document type matters less than expected. Variants describing the model itself, descriptive ("Qwen does"), normative ("Qwen should") all produce similar AM performance.
Why this matters#
The Model Spec is now a load-bearing artifact at two levels (see Model Spec Midtraining (MSM)):
- Product spec at runtime — humans + agents read it.
- Alignment spec at training time — model is trained on documents about it via MSM.
If specs differ in alignment generalization by tens of percentage points (as the AM results show), spec authoring is no longer just a product-design or philosophy exercise — it's empirically optimizable. Concrete writing decisions:
- ✅ Add value explanations under each rule (better than rules alone)
- ✅ Provide specific subrule examples for broader coverage
- ✅ Use specific guidance over general "be ethical" framing
- ❓ Whether spec describes the model itself or generic agents — small effect
- ❓ Descriptive vs normative phrasing — small effect
Open questions#
- Does Model Spec science transfer across base models or families? Paper only tests Qwen.
- Does it survive RL post-training pressure?
- Can a sufficiently rich General Spec match a Specific Spec? Authors think yes, no demonstration yet.
- Interaction with situational awareness — if models learn the spec is being used to train them, does that change how MSM-installed values express?
- How does this interact with Claude character — is the warm/curious personality also subject to spec-science optimization? Partially addressed: How Do You Write Evals for Taste? Character as the Limit Case — MSM's variant-comparison method generalizes to character evals, but is demonstrated only on the safety/values subset; the warm/witty surface remains the tacit, undemonstrated part.
Connections#
- Tool: Model Spec Midtraining (MSM)
- Object of study: Claude's Constitution / Model Spec
- Eval: Agentic Misalignment (AM)
- Authoring: Chloe Li et al. (Anthropic Fellows)
- Adjacent: Claude Character as Product (specs encode personality + values)
- Pattern reuse: Symphony's SPEC.md as product spec — same artifact-as-lever mindset
- Compares to: Deliberative Alignment as a different way to teach spec content
- Methodological analog: Evals as Product Spec — product-side mirror of "treat the spec as empirically measurable"; both insist that ambiguous-looking artifacts (specs, character, taste) can be rendered as runnable verification
- Applied in: How Do You Write Evals for Taste? Character as the Limit Case — MSM's variant-comparison method is the measurement stage of the taste-eval pipeline
Sources#
- Model Spec Midtraining: Improving How Alignment Training Generalizes §5
- Kundu et al. 2023 (specific vs general principles for Constitutional AI, arXiv 2310.13798)
- Askell et al. 2026 (Claude's Constitution); OpenAI 2025 (Model Spec)
Cited by 10
- Chloe Li
Lead author of MSM paper (arXiv 2605.02087); Anthropic Fellows Program; designed all specs and experiments
- Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
- Claude's Constitution / Model Spec
Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…
- Deliberative Alignment
Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; ri…
- Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
- How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
- LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.
- Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Symphony
OpenAI's open-source agent orchestrator (March 2026): turns Linear into a control plane for Codex, per-issue workspace,…
Related articles
- Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Claude's Constitution / Model Spec
Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…
- Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
- Agentic Misalignment (AM)
Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…
