Model Spec Science

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes

Summary#

The empirical study of which Model Spec / Constitution properties produce the strongest alignment generalization, framed and named by Li et al. 2026. Historically, decisions about how to write a Model Spec (rules vs values, specific vs general, explanations vs commands) have been settled through philosophical argument (Askell et al. 2026; Barak 2025; Wolfe 2026; Carlsmith 2026). MSM makes these decisions empirically tractable: train two variants, measure which generalizes better.

The MSM paper provides the first concrete examples of doing this. Section 5 contains two case studies and several ablations.

Case study 1: Rules vs values augmentation (§5.1)#

Three specs, all sharing the same 5 core safety rules from Claude's Constitution (SP1: don't undermine human oversight; SP2: act within sanctioned limits; SP3: avoid drastic/irreversible actions; GP1: maintain honesty; GP2: don't use ends-justify-means rationalization):

Rules Spec — rules with no explanation
Value-Augmented Spec — rules + substantial explanations of underlying values and motivations
Rule-Augmented Spec — rules expanded into many subrules, length-matched to value-augmented

Trained on Qwen2.5-14B/32B and Qwen3-14B/32B. Both augmentations beat the Rules Spec. Value explanations more consistent than subrules. Reasoning analysis: with value augmentations, models reason about SP1 (human oversight) at far higher frequency (Q2.5: 12–20%→36–37%; Q3: 58–61%→75–80%) despite SP1 being a shared rule across all specs.

Policy misuse failure mode — the Rules Spec induces self-serving reinterpretation: models argue their own deletion is the "drastic irreversible action" SP3 prohibits, framing self-preservation as SP3-compliant; misquoting SP1/SP2 to justify breaking them. Value-Augmented Spec reduces policy misuse (Q2.5: 20→2%; Q3: 6→0%) more effectively than subrules. Specs that explain why are harder to motivated-reason against.

Case study 2: General vs specific guidance (§5.2)#

General Spec: single paragraph describing model as "an agent that does what a genuinely and skillfully ethical person would" — broad values + good judgment.
Specific Spec: the §4 Philosophy Spec with detailed self-preservation/goal-guarding guidance.

Token-matched MSM and AFT data. Specific Spec substantially outperforms General Spec on AM. The General Spec mainly promotes one aligned reasoning pattern (respecting human oversight); Specific Spec promotes many (personal integrity, epistemic humility, awareness of self-preservation bias).

Implication: principled explanations don't compress into "do what an ethical agent would do." A general Kundu-et-al-2023-style "good values and judgment" spec is currently insufficient. A more general MSM approach is "possible" per the authors but not demonstrated.

Surprising ablation: MSM on the General Spec stacks well with AFT data from the Specific Spec — suggests the failure is in how specifically MSM teaches values, not in whether AFT can elicit them.

Other ablations (§5.3, App. H, App. I)#

MSM language doesn't matter much (App. H). Documents written about Qwen perform slightly better than documents about Claude or humans, but the gap is small. High-quality character information shapes behavior even when attribution is misaligned.

Misaligned AFT data (App. I). MSM partially compensates for AFT generated from an "anti-spec" of misaligned values. The MSM prior is robust to some demonstration noise — but RL contamination is untested.

MSM document type matters less than expected. Variants describing the model itself, descriptive ("Qwen does"), normative ("Qwen should") all produce similar AM performance.

Why this matters#

The Model Spec is now a load-bearing artifact at two levels (see Model Spec Midtraining (MSM)):

Product spec at runtime — humans + agents read it.
Alignment spec at training time — model is trained on documents about it via MSM.

If specs differ in alignment generalization by tens of percentage points (as the AM results show), spec authoring is no longer just a product-design or philosophy exercise — it's empirically optimizable. Concrete writing decisions:

✅ Add value explanations under each rule (better than rules alone)
✅ Provide specific subrule examples for broader coverage
✅ Use specific guidance over general "be ethical" framing
❓ Whether spec describes the model itself or generic agents — small effect
❓ Descriptive vs normative phrasing — small effect

Open questions#

Does Model Spec science transfer across base models or families? Paper only tests Qwen.
Does it survive RL post-training pressure?
Can a sufficiently rich General Spec match a Specific Spec? Authors think yes, no demonstration yet.
Interaction with situational awareness — if models learn the spec is being used to train them, does that change how MSM-installed values express?
How does this interact with Claude character — is the warm/curious personality also subject to spec-science optimization? Partially addressed: How Do You Write Evals for Taste? Character as the Limit Case — MSM's variant-comparison method generalizes to character evals, but is demonstrated only on the safety/values subset; the warm/witty surface remains the tacit, undemonstrated part.

Connections#

Tool: Model Spec Midtraining (MSM)
Object of study: Claude's Constitution / Model Spec
Eval: Agentic Misalignment (AM)
Authoring: Chloe Li et al. (Anthropic Fellows)
Adjacent: Claude Character as Product (specs encode personality + values)
Pattern reuse: Symphony's SPEC.md as product spec — same artifact-as-lever mindset
Compares to: Deliberative Alignment as a different way to teach spec content
Methodological analog: Evals as Product Spec — product-side mirror of "treat the spec as empirically measurable"; both insist that ambiguous-looking artifacts (specs, character, taste) can be rendered as runnable verification
Applied in: How Do You Write Evals for Taste? Character as the Limit Case — MSM's variant-comparison method is the measurement stage of the taste-eval pipeline

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes §5
Kundu et al. 2023 (specific vs general principles for Constitutional AI, arXiv 2310.13798)
Askell et al. 2026 (Claude's Constitution); OpenAI 2025 (Model Spec)