H
Howardism
Plate IIInteraction & Multimodal中文HOWARDISM

Full-Duplex Interaction

PublishedMay 13, 2026FiledConceptDomainInteraction & MultimodalTagsHuman AI CollaborationMultimodalReading3 minSourceAI-synthesised

Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speech, live translation/commentary, time-aware speech — all special cases of model behavior

Illustration for Full-Duplex Interaction

Sources#

Summary#

"Full-duplex" = the model perceives and responds at the same time, in a constant two-way exchange — as opposed to half-duplex turn-taking (one party at a time). Interaction Models generalize the audio full-duplex idea across audio, video, and text. The phrase the post uses: an experience that "feels more like collaborating and less like prompting."

The interaction modes it enables#

All of these are special-purpose harnesses today; in an interaction model they're special cases of model behavior (see Time-Aligned Micro-Turns):

  • Proactive interjection — "interrupt when I say something wrong"; the model jumps in mid-turn when context warrants, not only at end-of-turn.
  • Visual-cue reactions — "tell me when I've written a bug in my code"; "count how many pushups I do"; requires acting on a visual change with no audio cue (audio-only turn-detection harnesses fail this — they say "Sure thing!" then go silent).
  • Simultaneous speech — user and model speak concurrently: "translate Spanish→English live."
  • Speak-while-watching — "live-commentate this sports game."
  • Time-aware speech — "remind me to breathe in and out every 4 seconds until I stop"; "how long did it take me to write this function?"
  • Codeswitch correction — "every time I use another language, give me the correct word in the original language" (requires speaking at the same time as the user).

The model implicitly tracks whether the speaker is thinking, yielding, self-correcting, or inviting a response — no separate dialog-management component.

Concurrent non-speech action#

While listening and speaking, the model can simultaneously call tools, search, browse, or generate UI — weaving results back into the conversation when appropriate. The deeper/longer of these are delegated to the background model.

Prior art it builds on#

Audio full-duplex models are the existing example of bidirectional/continuous interaction; robotics and autonomous vehicles are cited as domains where real-time perception+action is a given. Interaction models apply the principle across all modalities.

Connections#

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 9
  • Encoder-Free Early Fusion

    Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…

  • The Future of Agent Interfaces

    Interface future is layered: native interaction models for human collaboration, MCP/APIs for structured action, app pro…

  • Interaction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

  • Interaction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • Interactivity Benchmarks

    FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…

  • Interaction & Multimodal

    Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.

  • Time-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • TML-Interaction-Small

    TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

  • Turn-Based Interface Bottleneck

    Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…

Related articles
  • Interaction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • Time-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • Interaction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

  • TML-Interaction-Small

    TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

  • Encoder-Free Early Fusion

    Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…