Full-Duplex Interaction

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Summary#

"Full-duplex" = the model perceives and responds at the same time, in a constant two-way exchange — as opposed to half-duplex turn-taking (one party at a time). Interaction Models generalize the audio full-duplex idea across audio, video, and text. The phrase the post uses: an experience that "feels more like collaborating and less like prompting."

The interaction modes it enables#

All of these are special-purpose harnesses today; in an interaction model they're special cases of model behavior (see Time-Aligned Micro-Turns):

Proactive interjection — "interrupt when I say something wrong"; the model jumps in mid-turn when context warrants, not only at end-of-turn.
Visual-cue reactions — "tell me when I've written a bug in my code"; "count how many pushups I do"; requires acting on a visual change with no audio cue (audio-only turn-detection harnesses fail this — they say "Sure thing!" then go silent).
Simultaneous speech — user and model speak concurrently: "translate Spanish→English live."
Speak-while-watching — "live-commentate this sports game."
Time-aware speech — "remind me to breathe in and out every 4 seconds until I stop"; "how long did it take me to write this function?"
Codeswitch correction — "every time I use another language, give me the correct word in the original language" (requires speaking at the same time as the user).

The model implicitly tracks whether the speaker is thinking, yielding, self-correcting, or inviting a response — no separate dialog-management component.

Concurrent non-speech action#

While listening and speaking, the model can simultaneously call tools, search, browse, or generate UI — weaving results back into the conversation when appropriate. The deeper/longer of these are delegated to the background model.

Prior art it builds on#

Audio full-duplex models are the existing example of bidirectional/continuous interaction; robotics and autonomous vehicles are cited as domains where real-time perception+action is a given. Interaction models apply the principle across all modalities.

Connections#

Interaction Models — parent concept
Time-Aligned Micro-Turns — the mechanism (no turn boundaries) that makes full-duplex possible
Encoder-Free Early Fusion — joint multimodal reasoning is what lets a visual change trigger speech
Turn-Based Interface Bottleneck — the half-duplex status quo this replaces
Interactivity Benchmarks — TimeSpeak / CueSpeak / RepCount-A / ProactiveVideoQA / Charades measure exactly these modes
Interaction / Background Model Split — where the concurrent deep work goes
TML-Interaction-Small — the model that demonstrates these interaction modes

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration