H
Howardism
Plate II機器翻譯 · machine-translatedENHOWARDISM

Chloe Li

PublishedMay 8, 2026FiledEntityTagsEntityPersonAnthropicAlignment ResearcherReading2 minSourceAI-synthesised

MSM 論文(arXiv 2605.02087)第一作者;Anthropic Fellows Program 成員;設計所有規格與實驗

Chloe Li 的插圖

資料來源#

摘要#

實體。 「Model Spec Midtraining: Improving How Alignment Training Generalizes」(arXiv 2605.02087,2026 年 5 月)的第一作者。Anthropic Fellows Program 成員。設計了 MSM 規格,提出並設計實驗,產出所有結果,撰寫論文。

貢獻#

根據 MSM 論文附錄 A 的作者貢獻聲明:

  • 主導整個專案
  • 設計所使用的 Model Specs(cheese-preference specs、Philosophy Spec、Rules/Value-Augmented/Rule-Augmented specs、General Spec)
  • 提出並設計所有實驗
  • 產出所有結果
  • 撰寫論文

共同作者:Sara Price(Anthropic;指導初始階段)、Jon Kutasov + Samuel Marks(共同指導;Jon 提出專案構想,Sam 引導 controlling-generalization 框架)。

程式碼釋出#

開源了完整的 MSM pipeline、AFT pipeline、Model Specs 及訓練模型:https://github.com/chloeli-15/model_spec_midtraining

相關連結#

資料來源#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 5
  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Entities — People, Orgs, Tools & Projects

    Map of Content for all 32 entity pages. See Home for concept domains.

  • Model Spec Midtraining (MSM)

    New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…

  • Model Spec Science

    Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…

  • Synthetic Document Finetuning (SDF)

    Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Sp…

Related articles
  • Claude's Constitution / Model Spec

    Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…

  • Alignment Fine-Tuning (AFT)

    Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…

  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Claude Character as Product

    Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…

  • Deliberative Alignment

    Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; ri…