Sources#
Summary#
Cat Wu names this as her single most underrated AI debugging technique: when Claude does something unexpected, ask the model to reflect on why. The reasoning the model surfaces — about its system prompt, sub-agent delegation choices, ambiguous instructions, missing tools — points directly at what to fix in the harness. Don't tune the model behavior in isolation; treat the failure as a signal about what the harness needs to provide.
The technique#
When the agent does something wrong:
- Don't immediately re-prompt with corrections.
- Ask: "Why did you make that decision? What were you confused about?"
- Read the model's account of its own reasoning.
- Fix the harness — not the model — based on what it surfaces.
Cat Wu's worked example#
From the interview:
"There's situations where the model will make a front-end change and run tests but not actually use the UI. It's actually pretty useful to ask the model to reflect on why it did this. And sometimes they'll say that hey there was something confusing in the system prompt, or I didn't realize that the front-end verification was part of this task, or hey I delegated the verification to this sub-agent and the sub-agent didn't do the test and I didn't check its work. A lot of times just being very curious about why the model made the decision that it did will show you what misled it so that you can fix the harness in order to close this gap."
Each of the failure-explanations she lists points at a different harness fix:
| Model's reason | Harness fix |
|---|---|
| "Confusing system prompt" | Rewrite the relevant section |
| "Didn't realize UI verification was part of the task" | Add UI verification step explicitly to the task template |
| "Delegated to sub-agent, didn't check its work" | Reviewer agent in fresh context (see Deep Modules for Agents); or remove the delegation |
Why this is non-obvious#
The default reaction to a failure is "the model is dumb." The introspection reframe is: the model's behavior is a function of the harness; the failure is information about the harness. This requires accepting that:
- The model's account of its reasoning is partial/post-hoc but still useful (caveat below).
- The harness is the variable you control; the model isn't.
- Your job is to design an environment where the model can succeed, not to make the model smarter.
This sits at the same architectural level as "enforce invariants, not implementations" (see Agent Harness Engineering) — you're operating on the structure around the model, not on the model itself.
Caveats — model self-reports aren't ground truth#
Standard caveat: a language model's introspective reports describe what it would say about its reasoning, not necessarily the actual computation. The reports can be confabulated, especially in a session where the failure is already in context.
Mitigations:
- Run the introspection in fresh context with just the failure case, not the surrounding session
- Cross-check against logs (which tools were called, in what order)
- Treat model reports as hypotheses about harness gaps, not final diagnoses — verify the fix actually closes the failure mode
Generalization: trust calibration through dialogue#
The technique is part of a broader pattern at Anthropic: dialogue with the model as the primary feedback signal, not just metrics. Cat's other examples in the same interview:
- Team lunches at Claude Code where every member is asked their "vibe" on a new model — qualitative signal that informs which quantitative data to look at next.
- Amanda's character work: ambiguous goal, requires conviction, shaped by extended dialogue with the model rather than benchmark optimization.
- Building evals as an underappreciated PM skill — the durable companion to model introspection. Introspection generates hypotheses; evals verify them and guard against regression.
The unifying theme: in AI-native product work, the model is a teammate you can interview, not a black box you can only measure.
Connections#
- Cat Wu — primary articulator
- Design Concept Grilling — the inverse: model interviews user. Same pattern, opposite direction. Used together they form a tight alignment loop.
- Harness Shrinkage as Models Improve — introspection points at which harness elements still earn their tokens
- Agent Harness Engineering — operationalizes "enforce invariants, not implementations" by giving the harness designer a debugging tool
- Deep Modules for Agents — fresh-context reviewer is what you build after introspection tells you the implementer can't reliably review itself
- Claude Character as Product — character work uses introspection as primary feedback signal
- Problem-Solution Fit Discipline — the same shape: ask the model to critique itself / its plan / its hypothesis; the founder's playbook applies this technique to validating ideas
- Evals as Product Spec — durable companion: introspection generates hypotheses; evals are how they get verified and turned into regression guardrails
- Jagged Intelligence (Ghosts, Not Animals) — introspection presumes a ghost, not an animal: the model's "why did I fail" is a harness-debugging signal, not testimony from a mind with access to its own reasons
Open questions#
- How reliable are 4.7-class introspective reports? Anthropic's interpretability research suggests partial fidelity but not full. Empirically, Cat reports it's good enough to drive harness fixes — but unclear at what model scale this technique becomes load-bearing.
- Does adversarial introspection ("why did you fail?") yield different signal than neutral ("walk me through your reasoning")? Worth probing.
- Could a meta-agent run introspection automatically against logged failures? Sounds tractable but no public implementation.
Sources#
Cited by 11
- Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
- Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
- Deep Modules for Agents
Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in…
- Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
- How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
- Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
- Learning to Co-Work with AI: A Software Engineer's Field Guide
Field guide for software engineers in the AI era: 6 skill clusters (taste, harness, alignment-first planning, agent-fri…
- Product & Organization
Map of Content for the product-org domain — 8 concepts. Curated entry point; see Home for all domains.
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Problem-Solution Fit Discipline
Idea-stage thesis: three defenses against premature building (time, resources, belief friction) all eroded; AI as devil…
Related articles
- Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
- Dogfooding as Product Discipline
Product sense is built by relentless first-hand use ("ant food"); Mr. Peanut catch; cross-source (Cat Wu vibe-checks, G…
- Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
