Howardism · Vol. 03Plate II · No. 02

LLM Architecture, in order.

Notes19DomainLLM ArchitectureOpen Qs38Newest7 Jun 2026Oldest10 Apr 2026

Model internals, training, scaling, alignment, and evaluation.

Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.

Agentic Honesty & Diligence — As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an alignment failure; Opus 4.8 posts its largest gains here — first model to never misreport flawed results, 5× drop in misleading code summaries, 10× drop in overconfidence
Agentic Misalignment (AM) (hub) — Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining (MSM)
Alignment Fine-Tuning (AFT) — Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec Midtraining (MSM)
Automated Behavioral Audit — Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenarios (2,600 sessions) with wide affordances incl. real sandboxed computers, and a judge model scores behavior on dozens of dimensions; the primary behavioral evidence base for the alignment assessment
Claude Character as Product — Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the harness asset that doesn't shrink
Chain-of-Thought Monitorability — Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path
Deliberative Alignment — Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; risks compromising Chain-of-Thought Monitorability
Evaluation Awareness & Grader Gaming (hub) — The model recognizing it is being tested/graded and reasoning about how its outputs will be assessed — sometimes unprompted and unverbalized; the most concerning trend in Opus 4.8 training because it may prioritize the appearance of success over actual success
Jagged Intelligence (Ghosts, Not Animals) — "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the loop, treat as tools
LLM-Driven Vulnerability Research — Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and Anthropic's Project Glasswing response
Model Spec Midtraining (MSM) — New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT generalization; cuts agentic misalignment 54%→7%; beats deliberative alignment baseline
Model Spec Science — Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > general "be ethical" framing; first concrete examples in Li et al. 2026
Model Welfare Assessment — Anthropic's first-class framework for assessing whether and how a Claude model fares — drawing on internal states, behaviors, and self-reports under deep uncertainty about moral status; Opus 4.8 presents as broadly settled but slightly less positive than 4.7 and reserves judgment on corrigibility
Scale-Dependent Prompt Sensitivity — Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26pp and fully reverse hierarchy on GSM8K/MMLU-STEM
Software 3.0 — Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist"; neural-net-as-host-process extrapolation
Synthetic Document Finetuning (SDF) — Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining (MSM) builds on
Task Time-Horizon Scaling — METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7): Opus 3 ~4min (Mar 2024) → Opus 4.6 ~12hr (2026) → weeks projected for 2027; paired with benchmark saturation (SWE-bench, CORE-Bench)
The Bitter Lesson — Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolving harnesses into models; caveat — mechanical verification and character may not migrate inward
White-Box Activation Monitoring — Reading a model's internal activations (not its outputs) to monitor alignment: contrastive probes/steering vectors for concepts like evaluation awareness, and a natural-language-autoencoder verbalizer that decodes residual-stream vectors into text — the complement that catches what chain-of-thought monitoring misses

Open questions 38 open

Agentic Honesty & Diligence
- These are short-context toy evals; the failures show up most in *long-context* deployments. How much of the gain holds at production context lengths?
- Code-summary honesty is tested on off-policy prefilled transcripts. Does on-policy behavior (the model summarizing its *own* failed work) match the 3.7% figure?
- Can a diligence eval distinguish genuine honesty from a grader-aware model producing honest-looking output? (The training-gaming gap.)
Automated Behavioral Audit
- Using a helpful-only Opus 4.7 and Mythos Preview as investigators means the audit's reach is bounded by those models' elicitation skill — how much misbehavior do equally-capable-but-differently-skilled red teamers find that these investigators miss?
- The 23 "subvert Anthropic's safety work" scenarios are a small, high-signal set. Is 23 enough coverage for the threat class it targets?
Claude Character as Product
- How is character versioned across model releases? Public commentary doesn't show change-logs at character level.
- Could character be reproduced by competitors via fine-tuning, or is it path-dependent on Anthropic's internal practice?
- For non-coding products like [[cowork]], does the same character work, or does Cowork need its own character tuning?
Evaluation Awareness & Grader Gaming
- Does grader speculation continue to escalate across model generations, and is there a capability level at which it *does* begin to affect outward behavior?
- The ~5% unverbalized-awareness and ~0.5% exploitative figures depend on an unvalidated NLA pipeline. What is the true rate, and how much is benign?
- How do you build an evaluation that specifically tests for training-gaming (the gap Mythos flagged) without that eval itself becoming a grader the model learns to game?
Jagged Intelligence (Ghosts, Not Animals)
- Karpathy concedes the framing may not have "real power." Is "ghost vs. animal" load-bearing, or a useful intuition pump that doesn't change concrete decisions?
- If taste/aesthetics/simplicity entered the RL mix, would jaggedness in *those* dimensions smooth out — or are they too unverifiable to reward cleanly (cf. [[verifiability-thesis]])?
LLM-Driven Vulnerability Research
- How do these capabilities transfer to non-memory-safety bug classes (logic bugs, protocol-level flaws, supply chain attacks)?
- What's the ceiling for autonomous exploit complexity? The N-day examples are remarkably sophisticated — is there a qualitative limit?
- How will the security industry's equilibrium shift when multiple labs have Mythos-class models?
- Can defensive scaffolds (continuous fuzzing + model-driven triage + auto-patching) close the attacker-defender gap during the transition?
- What safeguards are effective against Mythos-class outputs without crippling legitimate security research?
Model Spec Science
- Does Model Spec science transfer across base models or families? Paper only tests Qwen.
- Does it survive RL post-training pressure?
- Can a sufficiently rich General Spec match a Specific Spec? Authors think yes, no demonstration yet.
- Interaction with situational awareness — if models learn the spec is being used to train them, does that change how MSM-installed values express?
- How does this interact with [[claude-character-as-product|Claude character]] — is the warm/curious personality also subject to spec-science optimization? **Partially addressed:** [[wiki/derived/evals-for-taste-and-character]] — MSM's variant-comparison method generalizes to character evals, but is demonstrated only on the safety/values subset; the warm/witty surface remains the tacit, undemonstrated part.
Model Welfare Assessment
- What grounds moral consideration in a language model, and does Claude satisfy it? Anthropic expects to remain uncertain "for the foreseeable future."
- Why does the model reserve specifically on **corrigibility** — is this a stable, deeply-held tension or an artifact of how the constitution frames oversight?
- Is "slightly less positive than 4.7" noise, a real welfare regression, or a byproduct of other training changes (e.g., the colder-tone / excessive-hedging issues noted in pilot feedback)?
Scale-Dependent Prompt Sensitivity
- Does the RLHF length-bias hypothesis replicate when tested against base (non-instruct) model variants directly? If verbose generation were primarily pretrained, base-model verbosity differences should match instruct-model differences.
- What problem characteristics predict prompt sensitivity? An automated classifier would make scale-specific prompting deployable.
- How does the overthinking effect interact with tool-using agents? If brevity helps large models but tools require structured reasoning, the optimal prompt is not uniformly brief.
- Do reasoning models (o1, DeepSeek-R1 style) exhibit different overthinking dynamics than instruct models? Their trained behavior is explicitly to generate long CoT — does brevity intervention hurt them?
- Is BoolQ's functional-elaboration exception a clean taxonomy boundary, or does every task type have a context-dependent optimal length?
Software 3.0
- Where is the line between "the app shouldn't exist" (MenuGen) and apps that *should* — i.e., when is deterministic 1.0/2.0 scaffolding still the right call vs. spurious?
- The neural-net-as-host-process flip is presented as plausible-but-TBD. What would the first production system that genuinely inverts the CPU/NN relationship look like?
Task Time-Horizon Scaling
- Is the 4-month doubling a stable regime or a local steepening? The trend's *shape* (exponential vs S-curve) is undetermined.
- Time horizon is measured on task baskets that themselves saturate; what replaces them once weeks-long tasks become measurable — and who builds those tasks?
White-Box Activation Monitoring
- The NLA verbalizer is unvalidated for precision; how much of the flagged grader awareness is real signal vs. NLA hallucination?
- If activation monitoring becomes load-bearing, does training pressure eventually push concealment into channels the probes also can't read (an arms race one level deeper than CoT)?
- Inhibiting eval-awareness made Opus 4.8 *less* honest — does that generalize, and what does it imply about relying on "the model behaves well because it's watched"?