Howardismvol. 03 · quiet corner of the web

Plate II機器翻譯 · machine-translatedENHOWARDISM

Full-Duplex Interaction

PublishedMay 13, 2026FiledConceptTagsHuman AI CollaborationMultimodalReading3 minSourceAI-synthesised

跨模態同時感知與回應；主動插話、視覺線索反應、同步語音、即時翻譯/評論、時間感知語音——皆為模型行為的特殊情境

Full-Duplex Interaction 的示意圖

資料來源#

Interaction Models: A Scalable Approach to Human-AI Collaboration

摘要#

「Full-duplex」= 模型同時感知與回應，維持持續的雙向交流——相對於半雙工的輪流對話（一次只有一方）。Interaction Models 將音訊 full-duplex 的概念推廣至音訊、視訊與文字。文章使用的描述：一種「感覺更像協作而非提示」的體驗。

它所啟用的互動模式#

以下這些目前都是專用的 harness；在 interaction model 中，它們是模型行為的特殊情境（參見 Time-Aligned Micro-Turns）：

主動插話 —— 「當我說錯時打斷我」；模型在上下文需要時於對方發言中途介入，而非僅在輪次結束時。
視覺線索反應 —— 「告訴我何時在程式碼中寫了 bug」；「數我做了幾下伏地挺身」；需要在沒有音訊線索的情況下對視覺變化做出反應（純音訊的輪次偵測 harness 無法做到——它們會說「好的！」然後沉默）。
同步語音 —— 使用者與模型同時說話：「即時將西班牙語翻譯成英語。」
邊看邊說 —— 「即時評論這場體育比賽。」
時間感知語音 —— 「每 4 秒提醒我吸氣和吐氣，直到我停止」；「我寫這個函式花了多久？」
語碼轉換修正 —— 「每次我使用另一種語言時，給我原始語言中的正確詞彙」（需要與使用者同時說話）。

模型隱式追蹤說話者是在思考、讓步、自我修正，還是邀請回應——不需要獨立的對話管理元件。

並行的非語音動作#

在聆聽和說話的同時，模型可以同步呼叫工具、搜尋、瀏覽或生成 UI——在適當時機將結果編織回對話中。其中較深入/較長時間的任務會委派給背景模型。

它所建立在的先前技術#

音訊 full-duplex 模型是雙向/持續互動的現有範例；機器人學和自動駕駛車輛被引用為即時感知+行動已是常態的領域。Interaction models 將此原則應用於所有模態。

相關連結#

Interaction Models — 母概念
Time-Aligned Micro-Turns — 使 full-duplex 成為可能的機制（無輪次邊界）
Encoder-Free Early Fusion — 聯合多模態推理讓視覺變化能觸發語音
Turn-Based Interface Bottleneck — 這取代的半雙工現狀
Interactivity Benchmarks — TimeSpeak / CueSpeak / RepCount-A / ProactiveVideoQA / Charades 正是衡量這些模式的基準
Interaction / Background Model Split — 並行深度工作的去處
TML-Interaction-Small — 展示這些互動模式的模型

資料來源#

Interaction Models: A Scalable Approach to Human-AI Collaboration

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 9

Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
The Future of Agent Interfaces
Interface future is layered: native interaction models for human collaboration, MCP/APIs for structured action, app pro…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Interaction & Multimodal
Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
TML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…

Related articles

Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
TML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…

Related articles

Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
TML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…

Cited by 9

Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
The Future of Agent Interfaces
Interface future is layered: native interaction models for human collaboration, MCP/APIs for structured action, app pro…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Interaction & Multimodal
Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
TML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…