ultravox

A fast multimodal LLM for real-time voice

open-sourcevoice-agents

Visit Website View on GitHub

4.4k

Stars

+15

Stars/month

Releases (6m)

Star Growth

+4 (0.1%)

Overview

Ultravox 是一种专为实时语音交互设计的快速多模态大语言模型。它能够理解文本和人类语音，无需单独的自动语音识别(ASR)阶段。基于 AudioLM、SeamlessM4T、Gazelle 等研究，Ultravox 通过多模态投影器将音频直接转换为 LLM 使用的高维空间，从而实现比传统 ASR+LLM 组合系统更快的响应速度。该模型可以扩展任何开放权重的 LLM，默认模型基于 Llama 3.3 70B 构建，同时提供 8B 变体。目前 Ultravox 接受音频输入并输出流式文本，未来将支持直接输出语音令牌。该项目在 GitHub 上获得了 4382 个星标，提供多个版本（0.3 到 0.7），并在 Hugging Face 上可用。

Deep Analysis

Key Differentiator

vs ASR+LLM pipelines (Whisper+GPT): direct audio-to-embedding projection eliminates ASR latency bottleneck, enabling true real-time voice understanding

⚡ Capabilities

• Multimodal LLM for real-time voice interactions
• Direct audio-to-embedding projection (no separate ASR needed)
• Streaming text output from audio input
• Multiple LLM backbone support (Llama 3, Mistral, Gemma)
• Custom domain training with proprietary audio data
• Open-weight models in 8B and 70B variants

🔗 Integrations

Hugging FaceBaseTenLlama 3MistralGemma

✓ Best For

✓ Real-time voice AI agents requiring sub-100ms latency
✓ Custom domain voice applications with proprietary audio data

✗ Not Ideal For

✗ Text-to-speech synthesis
✗ Budget-constrained inference (H100 GPUs required)

Languages

Python

Deployment

BaseTen inferencemanaged APIself-hosted (H100 GPUs)

⚠ Known Limitations

⚠ Text output only (voice output planned but not available)
⚠ Requires 8 H100 GPUs for 2-3 hour training runs
⚠ MosaicML platform deprecating July 2025, migration needed
⚠ Adapter/projector training only (LLM and encoder frozen)

Pros

+ 无需单独 ASR 阶段，音频直接处理，响应速度更快
+ 支持多种开放权重模型（Llama、Mistral、Gemma）训练和扩展
+ 提供完整的实时语音 AI 代理构建平台和演示

Cons

- 目前仅输出文本，尚未实现直接语音输出
- 需要大量计算资源（默认 70B 模型）
- 作为研究项目，生产环境稳定性可能有限

Use Cases

• 构建实时语音客服或语音助手系统
• 开发需要快速语音理解的多模态应用
• 研究和实验下一代语音AI技术

Getting Started

1. 访问 demo.ultravox.ai 体验功能或从 Hugging Face 下载模型；2. 通过 ultravox.ai 平台配置 Realtime 语音代理；3. 使用 WAV 文件测试音频处理或启动推理服务器

Compare ultravox

ultravox vs litellm ultravox vs unsloth ultravox vs pipecat ultravox vs composio ultravox vs whisperX ultravox vs langchain4j