H
Howardism
Plate IIEntities中文HOWARDISM

Chloe Li

PublishedMay 8, 2026FiledEntityDomainEntitiesTagsEntityPersonAnthropicAlignment ResearcherReading2 minSourceAI-synthesised

Lead author of MSM paper (arXiv 2605.02087); Anthropic Fellows Program; designed all specs and experiments

Illustration for Chloe Li

Sources#

Summary#

Entity. Lead author of "Model Spec Midtraining: Improving How Alignment Training Generalizes" (arXiv 2605.02087, May 2026). Member of the Anthropic Fellows Program. Designed the MSM specs, proposed and designed the experiments, produced all results, wrote the paper.

Contributions#

Per Author Contributions (App. A of the MSM paper):

  • Led the project
  • Designed the Model Specs used (cheese-preference specs, Philosophy Spec, Rules/Value-Augmented/Rule-Augmented specs, General Spec)
  • Proposed and designed all experiments
  • Produced all results
  • Wrote the paper

Co-authors: Sara Price (Anthropic; advised initial phase), Jon Kutasov + Samuel Marks (jointly supervised; Jon proposed the project, Sam guided the controlling-generalization framing).

Code release#

Open-sourced the full MSM pipeline, AFT pipeline, Model Specs, and trained models: https://github.com/chloeli-15/model_spec_midtraining

Connections#

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 5
  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Entities — People, Orgs, Tools & Projects

    Map of Content for all 32 entity pages. See Home for concept domains.

  • Model Spec Midtraining (MSM)

    New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…

  • Model Spec Science

    Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…

  • Synthetic Document Finetuning (SDF)

    Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Sp…

Related articles
  • Claude's Constitution / Model Spec

    Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…

  • Alignment Fine-Tuning (AFT)

    Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…

  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Claude Character as Product

    Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…

  • Deliberative Alignment

    Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; ri…