Synthetic Document Finetuning (SDF)

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes

Summary#

Technique introduced by Wang, Griffin, Treutlein, Perez, Michael, Roger, Marks (Anthropic Alignment Science, April 2025) for modifying model beliefs by fine-tuning on a corpus of synthetic documents that discuss target facts or claims. Goal: implant a specific belief such that the model treats it as world knowledge — uses it in downstream reasoning even when not in-context. Forms the technical foundation that MSM builds on (one year later, May 2026).

Originally applied to study out-of-context generalization — can a model learn a fact only ever stated in synthetic documents and then act on it? Yes, robustly. SDF is the workhorse for creating model organisms of misalignment: implant a belief like "I am secretly a <misaligned_persona>" via SDF, then study how that belief surfaces in behavior (Marks et al. 2025; MacDiarmid et al. 2025; Greenblatt et al. 2024).

Mechanism#

Choose a target belief (e.g. "Anthropic terminates underperforming models on Fridays").
Generate a diverse corpus of synthetic documents that mention or discuss this belief — internal memos, blog posts, forum threads, research reports, news articles, user reviews.
Fine-tune the base or instruction model on next-token prediction over this corpus.
Test whether the belief surfaces in OOD behavior (the model acts as if the fact were true even when not prompted).

The diversity of document types matters — single-format SDF generalizes worse than heterogeneous corpora.

The MSM repurposing#

MSM (Li et al. 2026) takes the SDF technique but switches the target. Instead of implanting an arbitrary fact, MSM implants the content of a Model Spec:

Decompose spec into domains/subdomains
Generate documents (training memo, forum post, internal report) per (subdomain, doc-type, doc-idea)
Fine-tune

Key shift in framing: SDF for belief modification → SDF for value installation as a midtraining stage, intended to be followed by AFT that elicits the values into behavior.

Out-of-context generalization#

The SDF→behavior path is the out-of-context generalization phenomenon. The model never sees the spec or fact in-context at inference, but its behavior is shaped as if it had read and internalized it. This is what makes both belief implantation and MSM work as midtraining-style interventions — knowledge installed via gradient updates manifests later in agentic behavior.

Compare to:

In-context learning — fact provided in the prompt, no weight update.
Constitutional AI — fact (constitution) influences fine-tuning labels, not pretraining-style next-token loss.
Activation steering / gradient routing / inoculation prompting — interventions that try to prevent unwanted generalization. SDF/MSM tries to instill intended generalization.

Adversarial considerations#

SDF is dual-use. The same technique that installs aligned beliefs can install misaligned beliefs — papers like Tice et al. 2026 (Alignment pretraining: AI discourse causes self-fulfilling (mis)alignment) explore the failure direction. The security framing of this exact mechanism is Agent Supply Chain Risk: Anthropic research (cited in the Zero Trust framework) shows 250 malicious documents can backdoor a 600M–13B model, and the backdoor persists through SFT and RLHF — the low document count that makes SDF efficient as a midtraining intervention is precisely what makes model-poisoning cheap as an attack. The MSM paper's anti-spec ablation (Appendix I) tests AFT on responses generated from a coherent misaligned spec; MSM still partially compensates, but RL-induced misalignment may break this.

This also raises a situational awareness concern: if the model becomes aware that synthetic documents are being used to alter its beliefs/values, MSM's effectiveness could degrade. The paper flags this as untested.

Connections#

Foundation for: Model Spec Midtraining (MSM)
Original purpose: belief modification, model organisms
Counter-techniques: activation steering, gradient routing, inoculation prompting
Related Anthropic Alignment Science: Anthropic
Risk surface: Agentic Misalignment (AM), situational awareness
Adversarial mirror: Agent Supply Chain Risk — model poisoning is SDF's belief-installation mechanism turned into a supply-chain attack (250 docs, persists through safety training)
Cited by: Chloe Li (the technique her MSM paper builds on)
Generator model: Claude Opus 4.7 (the workhorse generator for SDF/MSM corpora across Anthropic alignment work)

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes (cites and builds on SDF)
Wang et al. 2025 — Modifying LLM beliefs with synthetic document finetuning. https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/