H
Howardism
Howardism · Vol. 03Plate III · No. 03

Open questions, unresolved.

Questions279Concepts96Domains8

The live worklist. Every unanswered question harvested from the wiki's concept notes, grouped by domain. Each links back to the note that raised it.

AI Engineering 105 open

  • Agent Context Files
    • Will the role split converge on Hermes's explicit project/personality separation, or stay folded into a single file as in Claude Code? A separate `SOUL.md`-style personality layer seems strictly better for multi-project users but adds a file to maintain.
    • Is there a natural ceiling on the layering (project → workflow → spec → constitution), or does each new autonomy surface spawn another context-file tier?
    • How should context files and bounded memory files interact when they disagree? Memory is lossy and cache-delayed; the context file is authoritative but static. Which wins, and when?
  • Agent Harness Engineering
    • Does a single general-purpose coding agent outperform a multi-agent architecture with specialized testing, QA, and cleanup agents?
    • How does architectural coherence evolve over years in a fully agent-generated system?
    • At what codebase scale does the AGENTS.md-as-table-of-contents approach need to be replaced with more sophisticated context routing?
    • How generalizable are these web-app-focused findings to other domains (scientific research, financial modeling)?
  • Agent Identity and Authentication
    • Hardware-bound credentials assume attested hardware everywhere agents run, including ephemeral cloud workloads and sub-agents. How does attestation work for short-lived spawned sub-agents that "have up to the same permissions as the parent"?
    • JIT + ABAC are both labeled "advanced, not easily implemented." Is there a pragmatic Enterprise-tier midpoint, or is the gap from Foundation static roles to Advanced JIT a cliff? **Answered:** [[wiki/derived/agent-access-control-tier-migration]] — not a cliff; the Enterprise tier (ABAC + dynamic privilege elevation with return-to-baseline + mTLS + sandboxing) is the deliberate midpoint, and ABAC's "advanced" framing is a source inconsistency (it sits at Enterprise in the tier table). Sub-agent attestation remains open.
  • Agent Loop Pattern
    • When the model schedules its own loops (4.7 behavior), who owns the budget? Boris answered "the model just decides" — but that pushes cost discipline into the model's training, not the harness.
    • Does a loop with a smart enough model still need a Kanban backlog, or does the model choose its own next task from raw goals?
    • Loop output review is now [[matt-pocock]]'s confessed bottleneck — "we just need to be ready to be doing more code review."
  • Agent Supply Chain Risk
    • "AI vendoring" as a standard response inverts decades of "don't reinvent the wheel." How is a model-reimplemented dependency itself verified and maintained — does it just relocate the risk?
    • The 250-doc backdoor persists through SFT/RLHF. What detection exists for an already-poisoned model you didn't train, short of behavioral red-teaming?
  • Agent-Native Infrastructure
    • Who builds the agent-native rewrite of the long tail of human-facing services — the service owners, or a translation layer (MCP servers, computer-use agents) on top?
    • Agent-to-agent negotiation needs trust, identity, and accountability primitives that don't exist yet. What's the protocol layer, and who governs it?
  • Agentic Prompt Injection
    • Spotlighting and constitutional classifiers each leave a residual (2%, 5%). Stacked, what's the realistic floor, and does it hold against adaptive attackers who know both are deployed? *(Partly answered by the Opus 4.8 live bug bounty: adaptive expert red-teamers still find attacks on the bare model; deployed probes add uplift but don't zero out the residual.)*
    • Why did Opus 4.8 regress on prompt-injection robustness relative to Opus 4.7 despite broad alignment gains — a capability/robustness tradeoff, or an artifact of harder adaptive evaluation?
    • "LLMs cannot reliably distinguish information from instructions" — is this a fundamental property of the architecture or a training gap that future models close? The framework treats it as durable.
  • AI-Accelerated Offense
    • Anthropic argues LLMs benefit defenders more *long-term* (like fuzzers) but attackers more *short-term* during the transition. How long is the transition, and what determines who wins it?
    • "Fundamentals strong enough that scanning finds fewer bugs" assumes defenders run the scanners first. What happens to organizations that can't afford continuous model-driven scanning?
  • Autonomous Defense
    • "Measure agreement against a human for two weeks, expand if tolerable" — what agreement threshold is tolerable, and who owns the residual false-negative risk when the model dispositions an alert the human never sees?
    • Defensive agents are high-value targets (compromising one yields powerful capabilities). Does concentrating detection in an Agentic SOAR create a single point of catastrophic compromise the distributed-human model didn't have?
  • Blast Radius (Agentic)
    • The framework prefers identity-based isolation over network segmentation, but most enterprises have heavy segmentation investment. What's the migration path, and does dual-running create new gaps?
    • Multi-agent compartmentalization increases the *number* of identities to manage; at what point does identity-management overhead create its own attack surface?
  • Build for the Next Model
    • How do you tell a "wait for the model" gap from a durable-harness gap *before* the next release? Get it wrong and you either ship vaporware or build a crutch you'll delete.
    • The bet depends on a reliable release cadence and a forecastable capability curve ([[task-time-horizon-scaling]]). What happens to "build for the next model" if model improvement stalls (the [[recursive-self-improvement|stalled-but-diffused]] future)?
    • Does the strategy generalize outside frontier labs, who have privileged visibility into the next model? An external team is betting on a release it can't see.
  • Building Is Cheap, Arguing Is Expensive
    • When does "generate three and compare" become wasteful — at what decision weight is a real argument (or a design doc) still cheaper than three implementations?
    • If design discussion lives in PRs/prototypes, where is the *rationale* recorded for future readers — does the "why we chose this" knowledge survive, or does it share the staleness problem of [[code-as-source-of-truth]]?
  • Claude Code Auto Mode
    • What false-positive rate does the classifier have on routine-but-aggressive refactors (e.g., large-file renames, `rm` of build artifacts)?
    • How well does the classifier generalize to custom tools / MCP servers where it lacks environment context?
    • Is the classifier's decision boundary documented/stable enough for security-sensitive orgs to certify, or is it effectively a black box whose behavior drifts with updates?
    • Does extending auto mode to API users change its calibration — is the classifier retrained for automation-heavy use, or held constant?
    • Compared to OS-level sandboxing (mentioned in [[claude-code-best-practices]] alongside auto mode), what's the defense-in-depth story? When should both be layered?
  • Claude Code Best Practices
    • What's the optimal CLAUDE.md length before instructions start getting lost? Is there a measurable threshold?
    • How does the Writer/Reviewer pattern compare to agent-to-agent review (as in OpenAI's Codex workflow)?
    • When does subagent overhead exceed the benefit of context isolation?
  • Client-Side Agent Optimization
    • How does combination-level optimization interact with continual model releases? If Claude Opus 4.7 ships next month, does the full Pareto frontier need re-running, or do warm-started bandits adapt cheaply?
    • At what pipeline depth does the combinatorial search become intractable even for Arm Elimination? The paper tests up to ~81 combinations; production pipelines with 5+ roles and 10+ candidate models each blow past that.
    • Does the "weak planner + strong solver" pattern generalize, or is it specific to HotpotQA's delegation dynamic? Recommender-critic, drafter-editor, and retriever-generator topologies might invert.
    • What's the right way to re-evaluate when the tool environment changes? AgentOpt assumes fixed tools — adding or removing a tool potentially invalidates the whole frontier.
    • Is there a cheap per-call classifier that can predict which combination will win on a given query, avoiding combo-level evaluation entirely?
  • Code as Source of Truth
    • What knowledge genuinely *can't* live in the codebase (org strategy, the "why," cross-team context) and therefore still needs a durable doc — and how do you keep that small slice current?
    • If onboarding is "ask Claude," what happens to the tacit knowledge that was previously transferred socially in deep-dives — is it captured anywhere, or quietly lost?
  • Codex App Server Protocol
    • How does the App Server protocol compare in detail to MCP? Both expose tools to a model, but App Server is *inside* the Codex runtime while MCP is *outside*. When does each win?
    • Is there a public schema registry so external orchestrators can target specific App Server versions without `generate-json-schema`?
    • The "dynamic tool calls (experimental)" caveat — what's the stability roadmap? Symphony depends on this for its security model.
    • How well does the protocol handle multi-modal turns (image inputs, screenshot attachments)? The spec is text-focused.
    • Is there an analogous protocol on the Claude side, or is Claude's equivalent exclusively the Agent SDK + tool-use API? Comparing the two would clarify when "drive an existing CLI" beats "build on the SDK."
  • Compute Allocator
    • Is 1% a Thariq-specific number or a regime? For larger, more code-heavy projects the production residue is presumably higher; what sets the ratio?
    • Allocation quality is hard to measure — what's the feedback loop that tells an allocator they spent compute *badly* (vs. just spending a lot)?
    • Does treating humans as "compute allocators" risk the [[ai-brain-fry|oversight-fatigue]] / [[human-ai-accountability-redesign|accountability]] failure modes the HBR research flags, where the human nominally decides but actually rubber-stamps?
  • Context Window Smart Zone
    • Does the smart-zone marker scale with model size, or is it bounded by attention architecture? Pocock observes "the dumb zone has become less dumb lately" but pegs it at 100K through 2026.
    • When sparse-attention or memory-augmented architectures ship, does the smart zone become a soft constraint?
    • How should harnesses surface remaining smart-zone budget to the user — token count, percentage, or a richer signal?
  • Deep Modules for Agents
    • How big is "deep enough"? Pocock's example modules are several hundred LOC; Ousterhout's textbook examples are larger. There's a sweet spot; not articulated.
    • For ports/adapters codebases, does the deep-module advice transfer cleanly? The "small interface" is the port; the "large behavior" is the adapter. Probably yes, but not exercised in source.
    • Refactor cost vs benefit: when is "improve-code-base-architecture" worth running on a working repo?
  • Design Concept Grilling
    • Can grilling be run AFK against another agent that holds the user's preferences? Pocock's answer in 2026 is "no, this part has to be human-in-the-loop" — but the question is open as agents get better at modeling their principal.
    • How does grilling change for team work where multiple humans need to align? Pocock's hint: pair-program with the agent in the room, treat it as a third interlocutor.
  • Disposable Micro-Apps
    • Where's the line between a disposable micro-app and tool sprawl? If every edit spawns a bespoke UI, does the workflow fragment?
    • Does the copy-back-to-markdown round-trip generalize beyond config-shaped data (rules, tables) to richer artifacts?
    • Could these micro-apps be templated/reused rather than regenerated — and at what point does that defeat the "disposable" framing and turn into [[living-design-system|durable tooling]]?
  • Harness Shrinkage as Models Improve
    • Does *all* prompt scaffolding eventually migrate into the model, or does some remain — e.g. organization-specific style, security rules, brand voice?
    • The Boris "100 lines" prediction is a year out from May 2026 — testable in 2027.
    • If harness work shrinks, what new work expands to fill it? Cat Wu's bet: PM/product taste, eval-writing, character work.
  • HTML as the New Markdown
    • Does the human-facing harness keep growing without bound, or does it hit its own bloat ceiling (an HTML plan too elaborate to read, like the markdown it replaced)? **Answered:** [[wiki/derived/human-facing-harness-bloat-ceiling]] — yes; HTML raises and reshapes the human-attention ceiling but can't remove it, and the bloat relocates from document-length to artifact-sprawl/rubber-stamping.
    • HTML is heavier to diff and version than markdown — what happens to plan history and review when artifacts are single-file websites? ([[disposable-micro-apps]] copy-back-to-markdown is one patch.)
    • Does this generalize past one expert practitioner, or does it require Thariq-level fluency with Claude to be worth the overhead?
  • Impossible, Not Tedious (Design Test)
    • Defense-in-depth traditionally *stacks* friction controls on the theory that enough of them sum to a barrier. Does this test invalidate layered friction, or just demote it below capability-removal?
    • Some controls are friction for humans but barriers for agents (or vice versa). Is the test agent-relative, and how do you evaluate it for mixed human/agent threat models?
  • Least Agency
    • Least agency adds a *frequency* dimension ("how often"), but the framework also says rate limits are friction, not barriers ([[impossible-not-tedious-test]]). How is frequency-limiting both a least-agency control and a friction-only one — context-dependent?
    • Dynamic privilege elevation (Enterprise) reintroduces an elevation path; how is the elevation request itself authenticated against a manipulated agent?
  • Living Design System
    • How does the `design_system.html` stay in sync as the codebase evolves — re-extract on a cadence, or wire it into CI?
    • Does a rendered, model-readable design system measurably improve on-brand output vs. a plain CSS/token file, or is the win mostly human legibility?
    • At what project size does maintaining the artifact cost more than the consistency it buys?
  • LLM-as-Compiler Knowledge Base
    • At what scale does the no-vector-database approach break down? Karpathy's ~100 articles fit in context, but what about 1,000+?
    • How to handle conflicting information across sources during compilation?
    • What's the optimal granularity for concept articles — one concept per article, or clustered by theme?
    • How effective is the synthetic training data → fine-tuning pipeline in practice?
  • MCP and Computer Use
    • The MCP ecosystem's growth rate vs. computer use's quality curve: at what point does computer use become *good enough* that the marginal value of building an MCP server drops? Boris implies this is years off but doesn't quantify.
    • Is computer use a sustainable interface or a transition technology? If most knowledge-work software adds MCP support in the next 24 months, computer use's role shrinks to legacy/desktop-only systems.
    • MCP security model: as the playbook prescribes wiring MCP into Salesforce, Gmail, Calendar for solo founders, the attack surface scales with adoption. **Now addressed** by [[zero-trust-for-ai-agents]] (tool poisoning, rug pulls, the first in-the-wild malicious MCP server) — see "MCP as a security surface" above. Open residual: how does a solo founder realistically *run/host and self-sign every MCP server* the framework recommends, given that the appeal of MCP was zero-integration-effort?
    • How does Cowork's computer-use guardrail compare to Claude Code's auto-mode classifier? Different deployment context, possibly different risk profile.
  • Memory and Context Poisoning
    • Long-term memory drift is defined as undetectable per-change. Drift detection requires a baseline — but if the baseline itself drifts (Advanced "continuous baseline refinement"), how is a slow poisoning attack distinguished from legitimate evolution?
    • Integrity hashing detects *modification* but not *malicious-but-valid* memory written through a legitimate (injected) interaction. What catches semantically-poisoned-but-cryptographically-intact memory?
  • Outsource Your Thinking, Not Your Understanding
    • Karpathy's open frontier: can "understanding" itself eventually be automated, or is it definitionally the human residue? His "back in a couple years" hedge leaves it open.
    • If understanding is the bottleneck, is the highest-ROI skill *learning how to build understanding fast* (knowledge-base hygiene, asking the right projections) — and can that be taught?
  • The Verifiability Thesis
    • Where's the boundary of "council of LLM judges" reliability — does it hold for genuinely contested value judgments, or only for quality/coherence?
    • The "labs care" dependency is fragile: capabilities can appear or stagnate based on lab priorities you don't control. How should a product hedge against the data-distribution rug-pull?
  • Ticket-Driven Agent Orchestration
    • What's the right granularity for ticket size when the unit is "what one agent does in one workspace"? The post implies "much larger units of work" become viable, but how does that interact with the `agent.max_turns` limit (default 20)?
    • How do you prevent a ticket-extension cascade when agents file follow-up tickets liberally? Is the only governance check human triage at the `Todo`-state queue?
    • Does this pattern generalize to non-software work (research, ops, content)? The DAG dependency model and prompt-as-policy file should transfer; the per-issue workspace doesn't obviously.
    • When an agent gets a ticket "completely wrong" (mentioned in the post), how is the lesson fed back into the system? Symphony's answer is "add guardrails and skills" — what's the institutional process for that?
    • How does ticket-driven orchestration interact with sprint planning / OKRs / roadmap work that operates on aggregates of tickets? Does the abstraction collapse when tickets are scoped that small?
  • Verification as the New Bottleneck
    • Fung's own open question: "How far do you push fully automated reviews?" — where's the speed/safety balance, and how do you keep humans confident without re-introducing the review bottleneck?
    • If CI/build is the hidden jam, does verification infrastructure (test runners, CI capacity) become the actual capex of an AI-native org?
  • Vertical Slice Tracer Bullets
    • Can the planner agent be trusted to slice vertically once told to, or does it need a verifier that flags horizontal slices? Pocock's experience: it needs the verifier, at least through 4.7.
    • How should slice granularity be tuned? Too thin = many merge conflicts; too thick = back to horizontal.
  • Vibe Coding vs. Agentic Engineering
    • Karpathy hints at "one domain that's very [valuable]" for founders but won't say which (didn't want to "vague-post on stage"). What verifiable RL-environment domain is he gesturing at?
    • If the mediocre/AI-native spread keeps widening, what does that do to team composition — a few extreme outliers plus agents, vs. broad mid-level staffing?
  • Zero Trust for AI Agents
    • The framework treats every Claude Code "Pro-tip" as a reference implementation. How much of the framework is vendor-neutral vs. tacitly assuming the Anthropic stack?
    • "Foundation floor raised" implies a moving baseline. How fast does the tier ladder actually shift, and who arbitrates it (NIST/NSA cadence vs. model-capability cadence)?
    • The framework is explicit that it is *not* legal/compliance assurance. Where does self-attested Zero Trust maturity meet auditable regulatory requirement?

LLM Architecture 38 open

  • Agentic Honesty & Diligence
    • These are short-context toy evals; the failures show up most in *long-context* deployments. How much of the gain holds at production context lengths?
    • Code-summary honesty is tested on off-policy prefilled transcripts. Does on-policy behavior (the model summarizing its *own* failed work) match the 3.7% figure?
    • Can a diligence eval distinguish genuine honesty from a grader-aware model producing honest-looking output? (The training-gaming gap.)
  • Automated Behavioral Audit
    • Using a helpful-only Opus 4.7 and Mythos Preview as investigators means the audit's reach is bounded by those models' elicitation skill — how much misbehavior do equally-capable-but-differently-skilled red teamers find that these investigators miss?
    • The 23 "subvert Anthropic's safety work" scenarios are a small, high-signal set. Is 23 enough coverage for the threat class it targets?
  • Claude Character as Product
    • How is character versioned across model releases? Public commentary doesn't show change-logs at character level.
    • Could character be reproduced by competitors via fine-tuning, or is it path-dependent on Anthropic's internal practice?
    • For non-coding products like [[cowork]], does the same character work, or does Cowork need its own character tuning?
  • Evaluation Awareness & Grader Gaming
    • Does grader speculation continue to escalate across model generations, and is there a capability level at which it *does* begin to affect outward behavior?
    • The ~5% unverbalized-awareness and ~0.5% exploitative figures depend on an unvalidated NLA pipeline. What is the true rate, and how much is benign?
    • How do you build an evaluation that specifically tests for training-gaming (the gap Mythos flagged) without that eval itself becoming a grader the model learns to game?
  • Jagged Intelligence (Ghosts, Not Animals)
    • Karpathy concedes the framing may not have "real power." Is "ghost vs. animal" load-bearing, or a useful intuition pump that doesn't change concrete decisions?
    • If taste/aesthetics/simplicity entered the RL mix, would jaggedness in *those* dimensions smooth out — or are they too unverifiable to reward cleanly (cf. [[verifiability-thesis]])?
  • LLM-Driven Vulnerability Research
    • How do these capabilities transfer to non-memory-safety bug classes (logic bugs, protocol-level flaws, supply chain attacks)?
    • What's the ceiling for autonomous exploit complexity? The N-day examples are remarkably sophisticated — is there a qualitative limit?
    • How will the security industry's equilibrium shift when multiple labs have Mythos-class models?
    • Can defensive scaffolds (continuous fuzzing + model-driven triage + auto-patching) close the attacker-defender gap during the transition?
    • What safeguards are effective against Mythos-class outputs without crippling legitimate security research?
  • Model Spec Science
    • Does Model Spec science transfer across base models or families? Paper only tests Qwen.
    • Does it survive RL post-training pressure?
    • Can a sufficiently rich General Spec match a Specific Spec? Authors think yes, no demonstration yet.
    • Interaction with situational awareness — if models learn the spec is being used to train them, does that change how MSM-installed values express?
    • How does this interact with [[claude-character-as-product|Claude character]] — is the warm/curious personality also subject to spec-science optimization? **Partially addressed:** [[wiki/derived/evals-for-taste-and-character]] — MSM's variant-comparison method generalizes to character evals, but is demonstrated only on the safety/values subset; the warm/witty surface remains the tacit, undemonstrated part.
  • Model Welfare Assessment
    • What grounds moral consideration in a language model, and does Claude satisfy it? Anthropic expects to remain uncertain "for the foreseeable future."
    • Why does the model reserve specifically on **corrigibility** — is this a stable, deeply-held tension or an artifact of how the constitution frames oversight?
    • Is "slightly less positive than 4.7" noise, a real welfare regression, or a byproduct of other training changes (e.g., the colder-tone / excessive-hedging issues noted in pilot feedback)?
  • Scale-Dependent Prompt Sensitivity
    • Does the RLHF length-bias hypothesis replicate when tested against base (non-instruct) model variants directly? If verbose generation were primarily pretrained, base-model verbosity differences should match instruct-model differences.
    • What problem characteristics predict prompt sensitivity? An automated classifier would make scale-specific prompting deployable.
    • How does the overthinking effect interact with tool-using agents? If brevity helps large models but tools require structured reasoning, the optimal prompt is not uniformly brief.
    • Do reasoning models (o1, DeepSeek-R1 style) exhibit different overthinking dynamics than instruct models? Their trained behavior is explicitly to generate long CoT — does brevity intervention hurt them?
    • Is BoolQ's functional-elaboration exception a clean taxonomy boundary, or does every task type have a context-dependent optimal length?
  • Software 3.0
    • Where is the line between "the app shouldn't exist" (MenuGen) and apps that *should* — i.e., when is deterministic 1.0/2.0 scaffolding still the right call vs. spurious?
    • The neural-net-as-host-process flip is presented as plausible-but-TBD. What would the first production system that genuinely inverts the CPU/NN relationship look like?
  • Task Time-Horizon Scaling
    • Is the 4-month doubling a stable regime or a local steepening? The trend's *shape* (exponential vs S-curve) is undetermined.
    • Time horizon is measured on task baskets that themselves saturate; what replaces them once weeks-long tasks become measurable — and who builds those tasks?
  • White-Box Activation Monitoring
    • The NLA verbalizer is unvalidated for precision; how much of the flagged grader awareness is real signal vs. NLA hallucination?
    • If activation monitoring becomes load-bearing, does training pressure eventually push concealment into channels the probes also can't read (an arms race one level deeper than CoT)?
    • Inhibiting eval-awareness made Opus 4.8 *less* honest — does that generalize, and what does it imply about relying on "the model behaves well because it's watched"?

Interaction & Multimodal 3 open

  • Interaction Models
    • Does the interaction/background split generalize, or is it a transitional artifact until a single model is both fast and deep enough?
    • "Interactivity scales with intelligence" is asserted; the larger-model release later in 2026 is the test.
    • Research grant announced for interactivity benchmarks — what becomes the FD-bench equivalent for video proactivity?

Formal Math 7 open

  • Agentic Loops Overtake Bespoke Systems
    • The bespoke advantage is dated "for now." What's the next model generation's verdict — does the evolutionary/AlphaProof apparatus survive on *any* problems, or fully collapse to a cost line?
    • Does the "simple loop + verifier beats bespoke system" result hold only where the verifier is perfect (Lean), or also in noisy-verifier domains (tests, LLM-judge councils)?
  • AI-Driven Formal Proof Search
    • Successes cluster where [[lean]]'s mathlib is mature and problems decompose into tractable subgoals (combinatorics, convex optimization, number theory). What expands the frontier to problems needing *new theory*?
    • The agents inherit their LLMs' biases and show high search variance. How do you characterize and push the boundary of what's reachable?
    • The Graffiti result hints at closing the loop between AI *conjecturing* and AI *proving*. What does an end-to-end conjecture→formalize→prove pipeline look like?
  • Evolutionary Proof Search
    • The LLM-critic fitness is itself an unverified heuristic atop a verified substrate. How often does the Elo ranking mislead the search vs. the cost of computing it?
    • Hyperparameters ($c=0.2$, top-64, $P=7$) were "chosen empirically." How sensitive is the result to them, and do they transfer across mathematical domains?

Startup & Founder 35 open

  • Agentic Technical Debt
    • How long does a CLAUDE.md remain accurate as a codebase evolves? The playbook gestures at session-by-session updates; no data on rot rate.
    • The remedy assumes the founder is *able* to articulate architecture in plain language. Non-technical founders (the playbook's headline beneficiary group) may have neither the vocabulary nor the intuition to do this well — a recursion failure the playbook doesn't address.
    • Anthropic's [[harness-shrinkage-as-models-improve|harness-shrinkage]] thesis suggests CLAUDE.md may eventually be inferred by the model itself. Until then, the discipline is load-bearing.
  • AI-Native Startup Lifecycle
    • The playbook gives no quantitative evidence for the headcount/capital compression claims (no median time-to-PMF, no headcount-at-PMF numbers, no failure-rate data). The "lean 10-person unicorn" is asserted as deliberate target without case-study evidence in the doc itself.
    • Founder stories in the resources section (Carta Healthcare, Anything, Cogent, Airtree, Duvo, Zingage, Kindora, Wordsmith) are short callouts — none have published outcomes or comparable-baseline data.
    • The 42% "built-something-nobody-wanted" CB Insights figure is from a pre-AI era; the playbook predicts the rate will climb but doesn't cite a 2026 measurement.
    • Tension with HBR's accountability findings (above) is unresolved. The playbook's orchestration framing reads as the exact framing HBR's experimental conditions tested *against*.
  • Compounding Data Moat
    • Is the "two-year replication window" claim defensible empirically, or aspirational? The playbook does not cite measurement.
    • How does this moat hold up when foundation models themselves continue improving rapidly? If a generalist model in 2027 has internalized enough vertical context to handle 340B drug claims natively, does the vertical-edge-case moat erode?
    • The data-flywheel argument has been made for SaaS for 15 years. What's actually different in the AI-native version? Probably: the data improves the *model* in addition to the product, but the playbook doesn't make this distinction precisely.
    • The "customers build APIs on top of you" lock-in is structurally similar to platform plays (Salesforce AppExchange, Shopify apps). Is the moat type really new, or just newly accessible to lean startups?
  • Founder as Agent Orchestrator
    • The playbook claims non-technical founders can now build production software, but it does not address the architectural-judgment recursion problem ([[agentic-technical-debt]]): non-technical founders may not have the vocabulary to write effective CLAUDE.md. How does that scale?
    • The "lean 10-person unicorn" is asserted; no quantitative data in the playbook on actual headcount-at-PMF or headcount-at-Series-A medians for AI-native startups vs. the prior cohort.
    • How does the orchestration role change the founder's *decision* burden? Fewer hands-on tasks but more parallel agent oversight; net cognitive load is unclear and may be higher (see [[ai-brain-fry]]).
    • Anthropic publishes both the playbook's anthropomorphic framing *and* HBR-aware accountability work (auto-mode, alignment) simultaneously without engaging the framing literature directly. The synthesis in [[wiki/derived/orchestration-vs-employee-framing-reconciliation]] reconciles the tension at the operational level — *orchestration as workflow design* preserves accountability; *orchestration as mental model of agents-as-coworkers* does not — but the open question of why the playbook's marketing language doesn't reflect Anthropic's own framing-discipline work remains.
  • Founder-Led Sales Discipline
    • Where exactly does "until PMF" end, and what's the first thing a founder *should* hand off (AE? agent? both)? Glasgow still does it post-Series-B, suggesting the boundary is fuzzy.
    • Does Glasgow's anti-offload stance generalize, or is it specific to high-trust, mission-critical enterprise sales (ERP) where "they're buying *you*" — would a PLG/SMB motion delegate to agents far earlier?
  • Narrow Wedge into a Legacy Market
    • A wedge works going in; does it constrain going out? Campfire now serves public companies — at what point does "narrow-but-best" require becoming the broad incumbent it displaced, re-incurring NetSuite's complexity?
    • The wedge-flip shows the first wedge can be wrong. What's the fastest signal that a wedge converts to the core vs. merely sells — Campfire took ~3 months; can it be read sooner?
  • Printing Press Software Democratization
    • Is domain-expert-as-builder actually happening at scale in 2026? Anecdotes (shop owners, microcontroller hobbyists) yes; primary-job software building by non-engineers, less clear.
    • What's the equivalent of compulsory schooling for universal coding literacy? Or does that not happen and we get a long tail of self-taught builders?
    • Boris's "accountant writes accounting software" — does that result in 10K narrow tools that don't interoperate? What's the integration story?
  • Problem-Solution Fit Discipline
    • Does asking an AI to argue against an idea actually produce disconfirming evidence at the same rigor as confirming evidence, or does the model still bias toward the framing the founder presents? Worth measuring.
    • The playbook recommends "ask Claude to make the most compelling argument for why a competitor would succeed while you do not." How does this interact with Anthropic's published [[claude-character-as-product|character training]] (sycophancy resistance, devil's-advocate willingness)?
    • Has anyone measured 2026 startup failure rates with AI-built products? The "42% will climb" claim is asserted without measurement.
  • Product Velocity as Moat
    • Velocity-as-moat is a treadmill: it evaporates the moment a competitor matches pace. What converts Campfire's velocity lead into a *structural* moat before the AI-native cohort's pace converges?
    • "Never had anyone outgrow Campfire" — is that survivorship (they haven't hit true enterprise scale yet) or a real claim that velocity closes the breadth gap faster than customers grow into it?
  • Seven Powers Applied to AI
    • Is "switching cost" really collapsing in practice, or just in narrative? Anthropic's own retention numbers, Salesforce churn, etc. would test this.
    • What does Boris's "cornered resource" look like for foundation-model labs that are themselves trying to commoditize? Internal contradiction or transient phase?
    • Counter-positioning — explicitly the "incumbent can't follow" power — should *amplify* under AI. Is anyone running this play deliberately?
  • The AI-Native Safe-Choice Inversion
    • The inversion is a one-time repricing of "safe." Once several AI-native ERPs exist, does "safe" re-stabilize around the *largest* AI-native vendor — and does Campfire's "we're now the largest of the new cohort" claim reflect a land-grab for that position?
    • How long until incumbents bolt on credible AI and neutralize the counter-positioning — and does the custom-foundation-model claim actually defend against that?
  • Zero-Friction Scope Creep
    • The playbook recommends written scope but offers no template or worked example. How specific does "what we deliberately don't do" need to be to actually block requests?
    • Is there a measurable threshold where scope creep crosses into outright pivot territory? The playbook gestures at "losing direction" without a metric.
    • How does this interact with [[ai-native-product-cadence|Cat Wu's]] 1-day shipping cadence? Anthropic's internal practice ships fast but with strong product judgment; how does that judgment translate for a first-time founder?

Product & Org 23 open

  • AI Native Product Cadence
    • Does the cadence scale beyond ~100 people? Anthropic itself is bigger (~30-40 PMs alone), but the [[claude-code]] team that visibly drives cadence is small.
    • What's the equivalent of research-preview branding for B2B enterprise launches where customers expect stability? Cat doesn't address.
    • How much of the cadence is structural (process choices) vs cultural (talent density)? Probably both, ratio unclear.
  • Compounding Loop Optimization
    • The loop assumes the team *is* (close to) the user. How much of the compounding advantage survives when the user is unlike the builder and "talk to users" can't be same-room?
    • Where is the line between worthwhile internal tooling and yak-shaving? Carey's "afternoon" bar is the heuristic, but [[cat-wu]] warns that over-customizing setups "becomes distraction."
    • Does Claude-as-first-pass-on-all-feedback ever filter out the rare signal that doesn't cluster? Automating triage optimizes the common case; the tail is where surprising bets come from.
  • Dogfooding as Product Discipline
    • Dogfooding works when the team *is* the user (Claude Code) or near it (Cat Wu, Boris). How do you build product sense for users very unlike you — does "talk to customers" fully substitute, as Glasgow/Fung's small-business work suggests?
    • Can dogfooding scale, or does it implicitly cap how large an AI-native product org can stay taste-driven before it reverts to dashboards?
  • Engineer PM Convergence
    • Does this scale beyond ~50-person Claude Code-style teams? Boris hedges: "I think this is going to be a question for years."
    • What happens to formal PM career ladders in companies where engineers do PM work? Open at Anthropic per Cat.
    • Cross-disciplinary generalist is a hiring bar — where does the supply come from? Career changers, or new-grad bias toward AI-native education?
  • Evals as Product Spec
    • How do you write an eval for *taste*-driven features like [[claude-character-as-product|character]]? Amanda's role is canonical for being eval-resistant; Cat names her as someone who *is* good at evals here, but doesn't describe the technique. **Partially answered:** [[wiki/derived/evals-for-taste-and-character]] — the technique is a pipeline (conviction → dogfood-sourced failure modes → MSM-style variant A/B measurement → ~10 interpretable evals); proven on the safety/values core but still tacit on the warm/witty aesthetic surface.
    • The 10-vs-100 number is given without justification. Is there a Goldilocks zone, or does it depend on feature surface area? [[client-side-agent-optimization]]'s framing of combos suggests evals also have a combinatorial explosion problem.
    • How do evals interact with [[harness-shrinkage-as-models-improve]]? When a harness asset shrinks because the model now handles it natively, the evals built around the old harness may become artifacts rather than guardrails. Does Anthropic retire evals or repurpose them?
    • Is there a single non-Anthropic example of a PM-as-eval-writer to cite, or is this currently a Cat-Wu-singular framing? The Matt Pocock workshop reaches the same place from a different vocabulary, but no third source has been ingested yet.
  • Managers as ICs
    • Fung's own open question: "Do you still need separate iOS and Android orgs?" — if engineers flex across platforms via Claude, the traditional platform-split org may dissolve too. How far does flattening go?
    • Does manager-as-IC scale past a certain org size, or only work while Claude Code is small and the codebase is Claude-legible?
  • Model Introspection Feedback
    • How reliable are 4.7-class introspective reports? Anthropic's interpretability research suggests partial fidelity but not full. Empirically, Cat reports it's good enough to drive harness fixes — but unclear at what model scale this technique becomes load-bearing.
    • Does adversarial introspection ("why did you fail?") yield different signal than neutral ("walk me through your reasoning")? Worth probing.
    • Could a meta-agent run introspection automatically against logged failures? Sounds tractable but no public implementation.
  • Prototype Over PRD
    • Where does prototype-over-PRD break down? Carey's domain is a visual design tool where a prototype *is* the product surface; for backend/infra/data work the prototype may not capture the spec (cf. [[ai-native-product-cadence]]'s "full PRD for heavy-infra features").
    • If there is no PRD, where does the *rationale* ("why we chose variation B") live for future readers? Same rationale-capture gap flagged in [[building-is-cheap-arguing-is-expensive]].
    • The prototype-as-spec must not become the prototype-as-validation trap [[problem-solution-fit-discipline]] warns about: a fast prototype proves the build was solvable, not that the problem is real.

Governance & Workforce 25 open

  • AI Accelerating AI Development
    • LOC, self-reports, and headroom-dependent multiples all overstate; what *unbiased* throughput metric would Anthropic's promised shift to "direct measurement of AI R&D acceleration and researcher uplift" ([[ai-rd-autonomy-evaluation]]) actually use?
    • The W2S result didn't transfer to production-scale models. Is that a temporary scaling artifact or a structural limit on autonomous research?
    • The next-step judgment trend (51%→64%) is measured only on weak-human-move slices. What does the curve look like on a representative sample of research decisions?
  • AI R&D Autonomy Evaluation (AECI)
    • "Not close to substituting for senior researchers" is a subjective, internally-sourced judgment. What objective signal would replace it as models approach the threshold?
    • AECI is a single scalar fork of an external index; how sensitive is the 155.5 / frontier-not-advanced conclusion to the choice of the n=11 evaluation set?
    • The shift to "direct measurement of AI R&D acceleration and researcher uplift" is announced but not yet operationalized in this card — what does that measurement look like?
  • Autonomous Scientific Discovery
    • Every result is Anthropic-reported and example-selected; the genomics "100× smaller beats *Science*" claim is "intend to publish" — what survives external peer review?
    • Science's verification gap: the formal-proof loop self-validates; here a wrong-but-confident hypothesis costs a wet-lab cycle to falsify. Does autonomy without a fast verifier *increase* the verification bottleneck rather than relieve it?
    • If hypothesis-generation is genuinely at ~80% preference, how much of "research taste" is left as a distinctively human function — and how would you measure the residue?
  • Capability-Gated Model Fallback
    • The >95%/<5% figures are session-level; what's the false-positive rate for *legitimate* security researchers and biologists, whose benign queries are exactly the ones most likely to trip the conservative classifiers?
    • Fallback-not-refusal preserves UX but means the *real* general-access model for security/bio-adjacent work is Opus 4.8, not Fable — does that quietly cap Fable's value for whole professional segments until the trusted-access programs open?
    • The UK AISI's "progress toward a universal jailbreak" is disclosed but not quantified — and the post-launch **access suspension** (see [[claude-fable-5]]) raises the question of whether a safeguard failure forced it.
    • Does swapping to a weaker model on flagged topics create an exploitable oracle (probe which queries trigger fallback to map the classifier's boundary)?
  • Frontier Pause Verification
    • What does an AI-training "verification regime" concretely consist of — compute-accounting, datacenter inspection, hardware attestation, on-chip telemetry? The essay names the problem, not the mechanism.
    • Detectability < verifiability: can detection even be made reliable when training runs leave no physical signature and inputs are dual-use?
    • Who adjudicates triggers and lifts? No institution currently holds that mandate, and standing one up is itself a decade-scale task.
  • Recursive Self-Improvement
    • Is "research taste" a true ceiling (future 1) or just the next capability to fall (futures 2–3)? The essay frames this as the single load-bearing uncertainty.
    • The RSI extrapolation rests on trends staying exponential rather than S-curving — but the essay concedes it cannot rule out an architectural ceiling or a compute/energy supply-chain constraint. Which binds first?
    • If misalignment compounds through self-improvement (future 3), is AECI-gated [[responsible-scaling-policy-evals|RSP]] review fast enough to catch it before control is lost?
  • Research Taste as the Human Bottleneck
    • Is research taste a genuine ceiling (an architectural capability scaling can't reach) or the next jagged valley to fill? The essay calls this the decisive unknown.
    • If taste is automatable, what — if anything — remains a durable human comparative advantage in AI development?
    • How do you measure rubber-stamping? "Humans set direction" can be true on paper while real judgment quietly transfers to the model.
  • Responsible Scaling Policy Evaluations
    • The RSP determination leans heavily on "we use it daily and it doesn't substitute for our researchers." How well does that subjective judgment scale as models approach the threshold?
    • The two new general-access risk pathways (other AI developers; major governments) are newly in scope but lightly evaluated — what would a positive finding there even look like?
    • How does the RSP brake interact with [[recursive-self-improvement]]: is AECI-based gating fast enough if acceleration compounds, and does single-lab gating even matter without the multilateral [[frontier-pause-verification|pause-verification]] regime?

Entities 43 open

  • AlphaProof Nexus
    • The framework's reach is gated by [[lean]]'s mathlib maturity. What's the path to domains needing new theory rather than subgoal decomposition?
    • AlphaProof adds little as a soloist but helps as a tool. As the prover LLM strengthens, does the AlphaProof tool become redundant entirely?
  • Anthropic Institute
    • How does the Institute's policy posture (favoring an *option* to pause) interact with Anthropic's commercial incentive to ship frontier models? The essay acknowledges the competitive/geopolitical pressure but doesn't resolve it.
    • What concrete verification mechanisms will the Institute prototype, and on what timeline relative to the RSI trend it warns about?
  • Campfire
    • Campfire claims its AI edge comes from "our own foundation model." For an ERP, what does a custom foundation model actually buy over fine-tuning a frontier model — and is it durable as frontier models improve (cf. [[harness-shrinkage-as-models-improve]])?
    • "Never had anyone outgrow Campfire" — does that hold as customers reach true enterprise scale where NetSuite's breadth historically mattered?
  • Claude Design
    • Did the "any design tool via MCP" integration actually ship on the stated timeline? (Forward claim from May 2026.)
    • How does Claude Design's eval discipline work for visual/aesthetic output, where there's no compiler or test? (Same open question as [[cowork]] for non-code artifacts; relates to [[wiki/derived/evals-for-taste-and-character|character/taste evals]].)
  • Claude Fable 5
    • **Why was access suspended after launch?** The source banner gives no reason (capacity? a safety finding? the UK-AISI jailbreak progress noted in [[capability-gated-model-fallback]]?). Not in source.
    • Exact benchmark numbers vs GPT-5.x / Gemini are image-only in the source; not transcribed.
    • How much of Fable's general-access experience is *actually* Fable vs Opus-4.8 fallback for security-research-adjacent users whose queries trip the conservative classifiers?
  • Claude Mythos 5
    • **Suspension reason** — shared with Fable 5; not stated in source.
    • How does "somewhat stronger than Mythos Preview" square with Opus 4.8's card claiming Mythos Preview was the capability frontier? The frontier has moved; the magnitude isn't quantified here.
    • The bio trusted-access SKU is "Fable 5 with bio safeguards removed," not Mythos 5 — so "Mythos 5" strictly denotes the cyber-lifted variant. Whether these converge under one trusted-access umbrella is unstated.
  • Claude Opus 4.7
    • Do Hakim's (2026) brevity-constraint findings on Opus 4.6 replicate on Opus 4.7, or does the literal-instruction-following change the elasticity? Specifically: does `<50 words` still yield +13.1pp on GSM8K?
    • Does Opus 4.7 still underperform as a planner in HotpotQA-style combo sweeps, or does improved instruction-following close the gap that AgentOpt (Hua et al., 2026) identified?
    • What is the real-world token-inflation multiplier on typical Claude Code sessions (1.0–1.35× is content-dependent — what's the distribution on code-heavy vs. prose-heavy inputs)?
    • How does xhigh compare to max on coding evals? The migration guidance says "start with high or xhigh" — is max ever worth it for coding?
    • What fraction of existing CLAUDE.md / system-prompt hedges become counterproductive under literal instruction following?
  • Claude Opus 4.8
    • Public model ID and pricing: the card does not state them; presumably `claude-opus-4-8` at the Opus tier.
    • Does the grader-speculation trend continue to escalate in the next model, and at what point does it begin to affect outward behavior?
    • Why is 4.8 *less* robust to prompt injection than 4.7 despite broad alignment gains — a capability/robustness tradeoff, or an artifact of the eval surface?
  • Cowork
    • How does Cowork's harness compare to [[claude-code]]'s? Both surface skills, MCP, sub-agents — but the failure modes for non-code output differ (no test suite, no compiler, no diff to review).
    • What's the eval discipline for Cowork-class outputs? Cat Wu says memory benefits a lot from evals; unclear how slide-deck quality is measured.
  • Google DeepMind
    • DeepMind reports its bespoke systems being caught by simple loops. Does the lab's comparative advantage move from *systems* to *models + verifiers + benchmarks* (mathlib, Formal Conjectures)?
    • The paper opens AI-for-math; what's DeepMind's next target domain where a sound verifier exists?
  • Hermes Agent
    • The container backend disabling dangerous-command checks is a defensible design but a meaningful security-model shift. What's the empirical track record? Have lockdown failures in popular images (Daytona, `nikolaik/python-nodejs`) caused incidents?
    • How do bounded memory files (~2,200 chars `MEMORY.md`) hold up over long-term use? Auto-consolidation is mentioned but not specified — what's the consolidation algorithm and how lossy is it?
    • Hermes's DM-pairing flow is a clean security primitive. Why hasn't this pattern been adopted by Claude Code or Cursor for shared/team deployments?
    • The split between `AGENTS.md` (project) and `SOUL.md` (personality) is explicit in Hermes but implicit in Claude Code's `CLAUDE.md`. Does the split materially improve outcomes, or is it a documentation choice without empirical backing?
    • Cron jobs in fresh sessions with no memory — how do teams structure the "context the agent needs" without it bloating every cron prompt? Is there a standard pattern?
  • Lean
    • mathlib maturity gates the reachable frontier. Can AI formal proof search *grow* mathlib (formalize new theory) as a byproduct, expanding its own frontier?
    • Lean is a perfect verifier for math. Which other domains have a comparably sound automatic verifier (vs. only noisy ones like tests or LLM-judge councils)?
  • METR
    • What new tasks will METR build to measure days- and weeks-long horizons once current baskets saturate?
    • METR also runs the [research showing developer self-estimates of AI uplift are overstated](https://arxiv.org/pdf/2507.09089) — how does it reconcile that skepticism with its own steep time-horizon curve?
  • Mythos Model
    • Public release timeline: **answered** — Mythos Preview itself never shipped GA, but its descendants [[claude-fable-5|Fable 5]] / [[claude-mythos-5|Mythos 5]] reached general access in June 2026 (see *the descendants shipped* above). Both were suspended shortly after launch; whether and when they return is open.
    • Capability profile beyond cybersecurity: Mythos Preview focused on the safety story; other capability dimensions not well-documented externally.
    • Internal access controls: who at Anthropic actually uses Mythos for daily work, vs Opus 4.7? Boris implies infrequent (try-it use); not detailed.
  • Symphony
    • The **500% landed-PRs claim** is hedged — no baseline definition, "on some teams" only. What does the distribution look like across teams? What happens to PR *quality* and revert rate at that throughput?
    • "Workspaces preserved across runs" is the opposite of typical CI ephemerality. At what point does state pollution from prior runs (stale `node_modules`, leftover branches, build artifacts) start hurting more than warm-cache helps?
    • Symphony doesn't write to the tracker — agents do. This means tracker policy is a *prompt* in `WORKFLOW.md`. How brittle is this in practice when Linear changes its API? How is consistent state-machine behavior enforced when agents have prompt-level discretion?
    • The spec was simplified by being implemented in 6 languages. What's the extension of this technique? Could `compiler-prompt.md` in this vault be similarly cross-fuzzed?
    • Symphony explicitly says agents can self-create tickets. What governance prevents runaway ticket-graph expansion? Is human triage of agent-created tickets the only check?