Model Introspection Feedback

Sources#

How Anthropic's product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code)

Summary#

Cat Wu names this as her single most underrated AI debugging technique: when Claude does something unexpected, ask the model to reflect on why. The reasoning the model surfaces — about its system prompt, sub-agent delegation choices, ambiguous instructions, missing tools — points directly at what to fix in the harness. Don't tune the model behavior in isolation; treat the failure as a signal about what the harness needs to provide.

The technique#

When the agent does something wrong:

Don't immediately re-prompt with corrections.
Ask: "Why did you make that decision? What were you confused about?"
Read the model's account of its own reasoning.
Fix the harness — not the model — based on what it surfaces.

Cat Wu's worked example#

From the interview:

"There's situations where the model will make a front-end change and run tests but not actually use the UI. It's actually pretty useful to ask the model to reflect on why it did this. And sometimes they'll say that hey there was something confusing in the system prompt, or I didn't realize that the front-end verification was part of this task, or hey I delegated the verification to this sub-agent and the sub-agent didn't do the test and I didn't check its work. A lot of times just being very curious about why the model made the decision that it did will show you what misled it so that you can fix the harness in order to close this gap."

Each of the failure-explanations she lists points at a different harness fix:

Model's reason	Harness fix
"Confusing system prompt"	Rewrite the relevant section
"Didn't realize UI verification was part of the task"	Add UI verification step explicitly to the task template
"Delegated to sub-agent, didn't check its work"	Reviewer agent in fresh context (see Deep Modules for Agents); or remove the delegation

Why this is non-obvious#

The default reaction to a failure is "the model is dumb." The introspection reframe is: the model's behavior is a function of the harness; the failure is information about the harness. This requires accepting that:

The model's account of its reasoning is partial/post-hoc but still useful (caveat below).
The harness is the variable you control; the model isn't.
Your job is to design an environment where the model can succeed, not to make the model smarter.

This sits at the same architectural level as "enforce invariants, not implementations" (see Agent Harness Engineering) — you're operating on the structure around the model, not on the model itself.

Caveats — model self-reports aren't ground truth#

Standard caveat: a language model's introspective reports describe what it would say about its reasoning, not necessarily the actual computation. The reports can be confabulated, especially in a session where the failure is already in context.

Mitigations:

Run the introspection in fresh context with just the failure case, not the surrounding session
Cross-check against logs (which tools were called, in what order)
Treat model reports as hypotheses about harness gaps, not final diagnoses — verify the fix actually closes the failure mode

Generalization: trust calibration through dialogue#

The technique is part of a broader pattern at Anthropic: dialogue with the model as the primary feedback signal, not just metrics. Cat's other examples in the same interview:

Team lunches at Claude Code where every member is asked their "vibe" on a new model — qualitative signal that informs which quantitative data to look at next.
Amanda's character work: ambiguous goal, requires conviction, shaped by extended dialogue with the model rather than benchmark optimization.
Building evals as an underappreciated PM skill — the durable companion to model introspection. Introspection generates hypotheses; evals verify them and guard against regression.

The unifying theme: in AI-native product work, the model is a teammate you can interview, not a black box you can only measure.

Connections#

Cat Wu — primary articulator
Design Concept Grilling — the inverse: model interviews user. Same pattern, opposite direction. Used together they form a tight alignment loop.
Harness Shrinkage as Models Improve — introspection points at which harness elements still earn their tokens
Agent Harness Engineering — operationalizes "enforce invariants, not implementations" by giving the harness designer a debugging tool
Deep Modules for Agents — fresh-context reviewer is what you build after introspection tells you the implementer can't reliably review itself
Claude Character as Product — character work uses introspection as primary feedback signal
Problem-Solution Fit Discipline — the same shape: ask the model to critique itself / its plan / its hypothesis; the founder's playbook applies this technique to validating ideas
Evals as Product Spec — durable companion: introspection generates hypotheses; evals are how they get verified and turned into regression guardrails
Jagged Intelligence (Ghosts, Not Animals) — introspection presumes a ghost, not an animal: the model's "why did I fail" is a harness-debugging signal, not testimony from a mind with access to its own reasons

Open questions#

How reliable are 4.7-class introspective reports? Anthropic's interpretability research suggests partial fidelity but not full. Empirically, Cat reports it's good enough to drive harness fixes — but unclear at what model scale this technique becomes load-bearing.
Does adversarial introspection ("why did you fail?") yield different signal than neutral ("walk me through your reasoning")? Worth probing.
Could a meta-agent run introspection automatically against logged failures? Sounds tractable but no public implementation.

Sources#

How Anthropic's product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code)