資料來源#
摘要#
Interaction Models 中的一項多模態設計選擇:不再將音訊和視訊路由到大型獨立編碼器(也不再透過獨立的 TTS 類解碼器輸出音訊),而是使用最少預處理搭配單一 transformer,所有元件從頭共同訓練。「Encoder-free」是相對的說法——仍有輕量嵌入層——但不存在 Whisper 規模的音訊編碼器或獨立的 TTS 模型。
各元件(單次 200ms micro-turn)#
輸入(文字 / 影像幀 / 音訊的任意子集):
- 文字 → token 嵌入(標準做法)。
- 影像 / 視訊幀 → 切分為 40×40 patches,由 hMLP 編碼(Touvron et al. 2022)。
- 音訊 → 以 dMel 表示(Bai et al. 2024),經輕量嵌入層(「bag of embeddings」)轉換。
單一共享 Transformer 消化融合後的輸入。
輸出:
- 文字 → unembedding(標準做法)。
- 音訊 → 透過 flow head(Lipman et al. 2022)產生 mel。
所有元件與 transformer 從頭共同訓練——並非由預訓練編碼器/解碼器拼接而成。
為何重要#
- 避免大型獨立編碼器/解碼器帶來的延遲與複雜度——當你每 200ms 就要執行一次時尤為關鍵(見 Time-Aligned Micro-Turns)。
- Early fusion(所有模態進入同一個 transformer)意味著模型能跨模態聯合推理,而非僅對編碼器預消化的輸出進行推理——這是實現「一邊說話一邊對視覺線索做出反應」等能力的前提(見 Full-Duplex Interaction)。
- 從頭共同訓練符合 The Bitter Lesson 的精神:減少手工設計的模組邊界,更多端到端學習。
相關連結#
- Interaction Models — 上層概念
- Time-Aligned Micro-Turns — 為何最少預處理是硬性需求(200ms 預算)
- Interaction / Background Model Split — 架構的另一半
- Full-Duplex Interaction — 跨模態聯合推理使視覺+音訊插話成為可能
- The Bitter Lesson — 「從頭共同訓練、捨棄模組化編碼器」正是 bitter-lesson 的實踐
- TML-Interaction-Small — 實作此設計的模型(dMel 音訊、40×40 hMLP patches、flow head),從頭共同訓練
資料來源#
Cited by 7
- Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
- Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
- Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- Interaction & Multimodal
Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.
- The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
- TML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
Related articles
- Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
- Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
- Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
- Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
