Howardism · Vol. 03Plate II · No. 02
Entities, in order.
Notes32DomainEntitiesOpen Qs43Newest14 Jun 2026Oldest17 Apr 2026
Profiles of the people, labs, products, and projects.
Map of Content for all 32 entity pages. See Home for concept domains.
- AlphaProof Nexus — DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK interface; SafeVerify
- Andrej Karpathy — Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic engineering"; originated the LLM-wiki pattern this vault runs on
- Anthropic — AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs round 2
- Anthropic Institute — Anthropic's policy/governance research arm; published When AI builds itself (Favaro & Clark, 2026) on recursive self-improvement; agenda includes building the verification systems a credible multilateral AI slowdown would require
- Anthropic Labs — Anthropic's internal incubator — a 'bet factory' of ~a dozen tiny teams exploring the model frontier with lean-startup loops; origin of Claude Code, MCP, Skills, and Claude Design; led (round 2) by Mike Krieger
- Boris Cherny — Creator of Claude Code at Anthropic; phone-driven workflow with hundreds of agents; primary advocate of
/loopprimitive; "coding is solved (for me)" thesis - Campfire — AI-native ERP (YC S23) pulling customers off NetSuite; custom foundation model + agent platform; Series B (Accel/Ribbit); doubling ARR/quarter since Q4 2024
- Cat Wu — Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-PM convergence
- Chloe Li — Lead author of MSM paper (arXiv 2605.02087); Anthropic Fellows Program; designed all specs and experiments
- Claire Vo — Host of the "How I AI" interview series (ChatPRD); interviewed Thariq Shihipar; runs a parallel component-visualization practice for non-technical stakeholders
- Claude Code — Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE surfaces; central tool across all 2026 sources
- Claude's Constitution / Model Spec — Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP1–2); now also a direct training input via MSM
- Claude Design — Anthropic Labs product (research preview, ~April 2026) for collaborating with Claude on polished visual artifacts — designs, prototypes, slides; built by ~3 people in ~10 weeks; multiplayer + handoff to Claude Code; HTML/CSS/JS export
- Claude Fable 5 — Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the same underlying model as Mythos 5 but shipped with classifiers that fall back to Opus 4.8 on cyber/bio-chem/distillation queries; $10/$50 per Mtok; access suspended shortly after launch
- Claude Mythos 5 — The safeguards-lifted form of Claude Fable 5 (June 2026): same underlying Mythos-class model, deployed through Project Glasswing with cyber safeguards removed; strongest cybersecurity capabilities of any model in the world, plus autonomous drug-design / genomics results; restricted to trusted-access partners; access suspended shortly after launch
- Claude Opus 4.7 — GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokenizer inflation, new
xhigheffort, first post-Glasswing safeguards - Claude Opus 4.8 — Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not advance the frontier beyond Mythos Preview; best-aligned public model yet, but training surfaced a grader-speculation trend
- Cowork — Anthropic's non-code knowledge-work agent product; sibling to Claude Code; output is decks/inbox/dossiers; same MCP/computer-use primitives
- Dan Carey — Product Manager leading product within Anthropic Labs; led Claude Design; 'Designing with Claude' talk (May 2026); ~two decades of PRDs, now replaced by prototypes
- Fiona Fung — Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no longer"; rewrote team norms for the AI-native org
- Google DeepMind — Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain in this wiki
- Hermes Agent — Nous Research's CLI agent + Gateway daemon (Telegram/Discord/Slack/WhatsApp); AGENTS.md/SOUL.md context split, bounded memory files, DM-pairing auth, container-as-security-boundary model
- John Glasgow — CEO/founder of Campfire; 10yr corporate finance; founder-led-sales advocate; long-horizon "last job I'll ever have"
- Lean — Proof assistant whose compiler mechanically verifies every step; the
sorryplaceholder enables proof sketches; mathlib maturity gates the reachable frontier - Matt Pocock — Independent AI-coding educator; built Sandcastle library; smart-zone/grill-me/tracer-bullets pedagogical framing; "bad code bases make bad agents"
- METR — Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on its own; the doubling-every-~4-months trendline and the 'upper end of what we can measure' verdict on Mythos Preview
- Mythos Model — Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, used internally alongside Opus 4.7; its descendants Fable 5 / Mythos 5 shipped June 2026 as the first general-access Mythos-class models
- OWASP — Open Worldwide Application Security Project; source of the agentic threat taxonomy cited throughout Anthropic's Zero Trust framework, coined the term 'least agency', and maintains the AI-BOM (CycloneDX ML-BOM extension)
- Symphony — OpenAI's open-source agent orchestrator (March 2026): turns Linear into a control plane for Codex, per-issue workspace, daemon-driven, SPEC.md-as-product, hedged 500% landed-PRs claim
- Thariq Shihipar — Engineer on the Claude Code team at Anthropic; "HTML is the new markdown" and "compute allocator" framings; three HTML-first workflows
- Thinking Machines Lab — AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-sessions to SGLang; benchmarks against GPT-realtime / Gemini-live; research grants open
- TML-Interaction-Small — TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async background agent; best turn-taking latency of any model; research preview May 2026
Open questions 43 open
- AlphaProof Nexus
- The framework's reach is gated by [[lean]]'s mathlib maturity. What's the path to domains needing new theory rather than subgoal decomposition?
- AlphaProof adds little as a soloist but helps as a tool. As the prover LLM strengthens, does the AlphaProof tool become redundant entirely?
- Anthropic Institute
- How does the Institute's policy posture (favoring an *option* to pause) interact with Anthropic's commercial incentive to ship frontier models? The essay acknowledges the competitive/geopolitical pressure but doesn't resolve it.
- What concrete verification mechanisms will the Institute prototype, and on what timeline relative to the RSI trend it warns about?
- Campfire
- Campfire claims its AI edge comes from "our own foundation model." For an ERP, what does a custom foundation model actually buy over fine-tuning a frontier model — and is it durable as frontier models improve (cf. [[harness-shrinkage-as-models-improve]])?
- "Never had anyone outgrow Campfire" — does that hold as customers reach true enterprise scale where NetSuite's breadth historically mattered?
- Claude Design
- Did the "any design tool via MCP" integration actually ship on the stated timeline? (Forward claim from May 2026.)
- How does Claude Design's eval discipline work for visual/aesthetic output, where there's no compiler or test? (Same open question as [[cowork]] for non-code artifacts; relates to [[wiki/derived/evals-for-taste-and-character|character/taste evals]].)
- Claude Fable 5
- **Why was access suspended after launch?** The source banner gives no reason (capacity? a safety finding? the UK-AISI jailbreak progress noted in [[capability-gated-model-fallback]]?). Not in source.
- Exact benchmark numbers vs GPT-5.x / Gemini are image-only in the source; not transcribed.
- How much of Fable's general-access experience is *actually* Fable vs Opus-4.8 fallback for security-research-adjacent users whose queries trip the conservative classifiers?
- Claude Mythos 5
- **Suspension reason** — shared with Fable 5; not stated in source.
- How does "somewhat stronger than Mythos Preview" square with Opus 4.8's card claiming Mythos Preview was the capability frontier? The frontier has moved; the magnitude isn't quantified here.
- The bio trusted-access SKU is "Fable 5 with bio safeguards removed," not Mythos 5 — so "Mythos 5" strictly denotes the cyber-lifted variant. Whether these converge under one trusted-access umbrella is unstated.
- Claude Opus 4.7
- Do Hakim's (2026) brevity-constraint findings on Opus 4.6 replicate on Opus 4.7, or does the literal-instruction-following change the elasticity? Specifically: does `<50 words` still yield +13.1pp on GSM8K?
- Does Opus 4.7 still underperform as a planner in HotpotQA-style combo sweeps, or does improved instruction-following close the gap that AgentOpt (Hua et al., 2026) identified?
- What is the real-world token-inflation multiplier on typical Claude Code sessions (1.0–1.35× is content-dependent — what's the distribution on code-heavy vs. prose-heavy inputs)?
- How does xhigh compare to max on coding evals? The migration guidance says "start with high or xhigh" — is max ever worth it for coding?
- What fraction of existing CLAUDE.md / system-prompt hedges become counterproductive under literal instruction following?
- Claude Opus 4.8
- Public model ID and pricing: the card does not state them; presumably `claude-opus-4-8` at the Opus tier.
- Does the grader-speculation trend continue to escalate in the next model, and at what point does it begin to affect outward behavior?
- Why is 4.8 *less* robust to prompt injection than 4.7 despite broad alignment gains — a capability/robustness tradeoff, or an artifact of the eval surface?
- Cowork
- How does Cowork's harness compare to [[claude-code]]'s? Both surface skills, MCP, sub-agents — but the failure modes for non-code output differ (no test suite, no compiler, no diff to review).
- What's the eval discipline for Cowork-class outputs? Cat Wu says memory benefits a lot from evals; unclear how slide-deck quality is measured.
- Google DeepMind
- DeepMind reports its bespoke systems being caught by simple loops. Does the lab's comparative advantage move from *systems* to *models + verifiers + benchmarks* (mathlib, Formal Conjectures)?
- The paper opens AI-for-math; what's DeepMind's next target domain where a sound verifier exists?
- Hermes Agent
- The container backend disabling dangerous-command checks is a defensible design but a meaningful security-model shift. What's the empirical track record? Have lockdown failures in popular images (Daytona, `nikolaik/python-nodejs`) caused incidents?
- How do bounded memory files (~2,200 chars `MEMORY.md`) hold up over long-term use? Auto-consolidation is mentioned but not specified — what's the consolidation algorithm and how lossy is it?
- Hermes's DM-pairing flow is a clean security primitive. Why hasn't this pattern been adopted by Claude Code or Cursor for shared/team deployments?
- The split between `AGENTS.md` (project) and `SOUL.md` (personality) is explicit in Hermes but implicit in Claude Code's `CLAUDE.md`. Does the split materially improve outcomes, or is it a documentation choice without empirical backing?
- Cron jobs in fresh sessions with no memory — how do teams structure the "context the agent needs" without it bloating every cron prompt? Is there a standard pattern?
- Lean
- mathlib maturity gates the reachable frontier. Can AI formal proof search *grow* mathlib (formalize new theory) as a byproduct, expanding its own frontier?
- Lean is a perfect verifier for math. Which other domains have a comparably sound automatic verifier (vs. only noisy ones like tests or LLM-judge councils)?
- METR
- What new tasks will METR build to measure days- and weeks-long horizons once current baskets saturate?
- METR also runs the [research showing developer self-estimates of AI uplift are overstated](https://arxiv.org/pdf/2507.09089) — how does it reconcile that skepticism with its own steep time-horizon curve?
- Mythos Model
- Public release timeline: **answered** — Mythos Preview itself never shipped GA, but its descendants [[claude-fable-5|Fable 5]] / [[claude-mythos-5|Mythos 5]] reached general access in June 2026 (see *the descendants shipped* above). Both were suspended shortly after launch; whether and when they return is open.
- Capability profile beyond cybersecurity: Mythos Preview focused on the safety story; other capability dimensions not well-documented externally.
- Internal access controls: who at Anthropic actually uses Mythos for daily work, vs Opus 4.7? Boris implies infrequent (try-it use); not detailed.
- Symphony
- The **500% landed-PRs claim** is hedged — no baseline definition, "on some teams" only. What does the distribution look like across teams? What happens to PR *quality* and revert rate at that throughput?
- "Workspaces preserved across runs" is the opposite of typical CI ephemerality. At what point does state pollution from prior runs (stale `node_modules`, leftover branches, build artifacts) start hurting more than warm-cache helps?
- Symphony doesn't write to the tracker — agents do. This means tracker policy is a *prompt* in `WORKFLOW.md`. How brittle is this in practice when Linear changes its API? How is consistent state-machine behavior enforced when agents have prompt-level discretion?
- The spec was simplified by being implemented in 6 languages. What's the extension of this technique? Could `compiler-prompt.md` in this vault be similarly cross-fuzzed?
- Symphony explicitly says agents can self-create tickets. What governance prevents runaway ticket-graph expansion? Is human triage of agent-created tickets the only check?