Sources#
Summary#
Technique introduced by Wang, Griffin, Treutlein, Perez, Michael, Roger, Marks (Anthropic Alignment Science, April 2025) for modifying model beliefs by fine-tuning on a corpus of synthetic documents that discuss target facts or claims. Goal: implant a specific belief such that the model treats it as world knowledge — uses it in downstream reasoning even when not in-context. Forms the technical foundation that MSM builds on (one year later, May 2026).
Originally applied to study out-of-context generalization — can a model learn a fact only ever stated in synthetic documents and then act on it? Yes, robustly. SDF is the workhorse for creating model organisms of misalignment: implant a belief like "I am secretly a <misaligned_persona>" via SDF, then study how that belief surfaces in behavior (Marks et al. 2025; MacDiarmid et al. 2025; Greenblatt et al. 2024).
Mechanism#
- Choose a target belief (e.g. "Anthropic terminates underperforming models on Fridays").
- Generate a diverse corpus of synthetic documents that mention or discuss this belief — internal memos, blog posts, forum threads, research reports, news articles, user reviews.
- Fine-tune the base or instruction model on next-token prediction over this corpus.
- Test whether the belief surfaces in OOD behavior (the model acts as if the fact were true even when not prompted).
The diversity of document types matters — single-format SDF generalizes worse than heterogeneous corpora.
The MSM repurposing#
MSM (Li et al. 2026) takes the SDF technique but switches the target. Instead of implanting an arbitrary fact, MSM implants the content of a Model Spec:
- Decompose spec into domains/subdomains
- Generate documents (training memo, forum post, internal report) per (subdomain, doc-type, doc-idea)
- Fine-tune
Key shift in framing: SDF for belief modification → SDF for value installation as a midtraining stage, intended to be followed by AFT that elicits the values into behavior.
Out-of-context generalization#
The SDF→behavior path is the out-of-context generalization phenomenon. The model never sees the spec or fact in-context at inference, but its behavior is shaped as if it had read and internalized it. This is what makes both belief implantation and MSM work as midtraining-style interventions — knowledge installed via gradient updates manifests later in agentic behavior.
Compare to:
- In-context learning — fact provided in the prompt, no weight update.
- Constitutional AI — fact (constitution) influences fine-tuning labels, not pretraining-style next-token loss.
- Activation steering / gradient routing / inoculation prompting — interventions that try to prevent unwanted generalization. SDF/MSM tries to instill intended generalization.
Adversarial considerations#
SDF is dual-use. The same technique that installs aligned beliefs can install misaligned beliefs — papers like Tice et al. 2026 (Alignment pretraining: AI discourse causes self-fulfilling (mis)alignment) explore the failure direction. The security framing of this exact mechanism is Agent Supply Chain Risk: Anthropic research (cited in the Zero Trust framework) shows 250 malicious documents can backdoor a 600M–13B model, and the backdoor persists through SFT and RLHF — the low document count that makes SDF efficient as a midtraining intervention is precisely what makes model-poisoning cheap as an attack. The MSM paper's anti-spec ablation (Appendix I) tests AFT on responses generated from a coherent misaligned spec; MSM still partially compensates, but RL-induced misalignment may break this.
This also raises a situational awareness concern: if the model becomes aware that synthetic documents are being used to alter its beliefs/values, MSM's effectiveness could degrade. The paper flags this as untested.
Connections#
- Foundation for: Model Spec Midtraining (MSM)
- Original purpose: belief modification, model organisms
- Counter-techniques: activation steering, gradient routing, inoculation prompting
- Related Anthropic Alignment Science: Anthropic
- Risk surface: Agentic Misalignment (AM), situational awareness
- Adversarial mirror: Agent Supply Chain Risk — model poisoning is SDF's belief-installation mechanism turned into a supply-chain attack (250 docs, persists through safety training)
- Cited by: Chloe Li (the technique her MSM paper builds on)
- Generator model: Claude Opus 4.7 (the workhorse generator for SDF/MSM corpora across Anthropic alignment work)
Sources#
- Model Spec Midtraining: Improving How Alignment Training Generalizes (cites and builds on SDF)
- Wang et al. 2025 — Modifying LLM beliefs with synthetic document finetuning. https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/
Cited by 7
- Agent Supply Chain Risk
Runtime-composed agent ecosystems expand the supply-chain attack surface: model poisoning (250 docs backdoor a 13B mode…
- Alignment Fine-Tuning (AFT)
Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Chloe Li
Lead author of MSM paper (arXiv 2605.02087); Anthropic Fellows Program; designed all specs and experiments
- Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
- LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.
- Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
Related articles
- Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
- Deliberative Alignment
Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; ri…
- Chain-of-Thought Monitorability
Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM of…
