資料來源#
它是什麼#
Thinking Machines Lab 的首款**互動模型** — 以 research preview 形式於 2026 年 5 月發布。定位為「第一個同時具備強大智慧/指令遵循能力以及互動性的模型。」
- 架構: 276B 參數 MoE,12B 活躍參數。從頭訓練為互動模型(而非在回合制模型上外掛互動功能)。
- 模態: 連續音訊 + 視訊 + 文字輸入;文字 + 音訊輸出。Encoder-Free Early Fusion(dMel 音訊嵌入、40×40-patch hMLP 處理影格、flow head 負責音訊輸出),單一共享 transformer,所有元件從頭共同訓練。
- 互動機制: Time-Aligned Micro-Turns — 200ms 交錯輸入/輸出區塊,無輪次邊界。
- 推理: 將深度推理/工具使用/長期任務委派給非同步背景模型 — 參見 Interaction / Background Model Split。即使不搭配背景 agent,在智慧基準測試上仍具競爭力。
主要數據(2026 年 5 月)#
- 輪替延遲:0.40 秒(FD-bench v1,音訊)— 所有比較模型中最佳。
- FD-bench v1.5 平均分:77.8 vs 基線約 39–54(包含 thinking-high 模型)。
- FD-bench v3(音訊+工具):82.8% 回應品質 / 68.0% Pass@1(搭配背景 agent)。
- Audio MultiChallenge APR:43.4% — 擊敗所有非思考基線;僅 GPT-realtime-2.0 xhigh(48.5%)更高。
- 比較基線:GPT-realtime-2.0(minimal/xhigh)、GPT-realtime-1.5、Gemini-3.1-flash-live-preview(minimal/high)、Qwen 3.5 Omni-plus-realtime。完整表格見 Interactivity Benchmarks。
限制(已承認)#
- 長時間連續音訊/視訊 session 會快速累積 context — 謹慎的 context 管理仍是未解問題(呼應 Context Window Smart Zone)。
- 需要可靠的低延遲連線;缺乏時效能嚴重下降。
- 稱為「Small」是因為較大的預訓練模型目前在此運行模式下速度太慢 — 更大的模型預計於 2026 年稍後推出。
可用性#
有限的 research preview「將於未來數月內」開放,更廣泛的發布「今年稍後」。歡迎透過 interaction@thinkingmachines.ai 提供回饋;研究補助開放申請中。
相關連結#
- Interaction Models — 此模型類別
- Thinking Machines Lab — 建造者
- Time-Aligned Micro-Turns / Encoder-Free Early Fusion / Interaction / Background Model Split — 三大架構支柱
- Full-Duplex Interaction — 其展示的互動模式
- Interactivity Benchmarks — 完整基準測試表格及其擊敗的基線
- Claude Opus 4.7 — 同時代產物(2026 年中前沿);4.7 的
xhigh努力層級對應 GPT-realtime 的 minimal/xhigh,此處作為基線配置使用 - Context Window Smart Zone — 長 session 限制
資料來源#
Cited by 9
- Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
- Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
- Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
- Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
- Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
- Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
- Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
- Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Related articles
- Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
- Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
- Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
- Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
