index-tts

An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

freevoice-agents

Visit Website View on GitHub

19.7k

Stars

+840

Stars/month

Releases (6m)

Star Growth

+127 (0.6%)

Overview

IndexTTS2是一个工业级的零样本文本转语音系统，专注于情感表达和持续时间控制的突破性技术。该系统采用自回归模型，支持两种生成模式：精确控制模式可以通过指定生成token数量来精确控制语音持续时间，满足视频配音等需要严格音视频同步的应用需求；自由生成模式则可以在不指定token数量的情况下自然生成语音，同时忠实再现输入提示的韵律特征。IndexTTS2的核心创新在于实现了情感表达和说话人身份的解耦控制，用户可以独立控制音色和情感。在零样本设置下，模型能够从音色提示中准确重建目标音色，同时从风格提示中完美再现指定的情感色调。为了增强高情感表达下的语音清晰度，系统整合了GPT潜在表示并设计了新颖的三阶段训练范式。该项目在GitHub上获得了近2万星标的关注，提供了完整的演示和预训练模型，支持研究人员和开发者快速部署和使用。

Deep Analysis

Key Differentiator

vs F5-TTS/CosyVoice: First autoregressive TTS model with precise duration control for video dubbing, plus emotion-timbre disentanglement allowing independent control of voice identity and emotional expression - developed by Bilibili

⚡ Capabilities

• Zero-shot text-to-speech synthesis
• Precise speech duration control
• Emotion-timbre disentanglement and control
• Natural language emotion description via Qwen3
• Multi-language support
• WebUI interface
• HuggingFace and ModelScope model hosting

🔗 Integrations

Qwen3 (emotion control)HuggingFaceModelScopeDeepSpeed (optional)

✓ Best For

✓ High-quality zero-shot TTS with emotion control
✓ Video dubbing with precise duration matching
✓ Research on expressive speech synthesis

✗ Not Ideal For

✗ Commercial deployment without licensing agreement
✗ CPU-only environments

Languages

Python

Deployment

Local (uv package manager)WebUIHuggingFace Spaces demo

Pricing Detail

Free: Free and open-source for research

Paid: Commercial licensing via indexspeech@bilibili.com

⚠ Known Limitations

⚠ Commercial use requires separate licensing
⚠ Requires GPU for inference
⚠ Only supports uv for installation (pip/conda not guaranteed)
⚠ Duration control feature not yet enabled in current release

Pros

+ 支持精确的语音持续时间控制，适合视频配音等需要音视频同步的场景
+ 实现情感表达和说话人身份的独立控制，可以自由组合不同音色和情感
+ 零样本能力强，无需针对特定说话人训练即可生成高质量语音

Cons

- 作为深度学习模型，对计算资源要求较高
- 自回归生成机制可能影响实时性能
- 情感控制的精确度可能因输入提示质量而有所差异

Use Cases

• 视频配音和音视频同步制作
• 有声读物和播客内容生成
• 多语言和多情感的语音助手开发

Getting Started

1. 访问HuggingFace或ModelScope平台体验在线演示版本；2. 从GitHub仓库克隆代码并根据文档安装依赖环境；3. 下载预训练模型并运行示例脚本进行首次语音合成

Compare index-tts

index-tts vs litellm index-tts vs unsloth index-tts vs pipecat index-tts vs composio index-tts vs whisperX index-tts vs langchain4j