gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)

open-sourcevoice-agents observability-evaluation tool-integration

Visit Website View on GitHub

12.8k

Stars

+60

Stars/month

Releases (6m)

Star Growth

+8 (0.1%)

Overview

Gorilla是一个由加州大学伯克利分校开发的研究项目，专注于训练和评估大语言模型的函数调用（工具调用）能力。该项目提供了Berkeley Function Calling Leaderboard (BFCL)，这是一个综合性的基准测试平台，用于评估LLM在真实世界场景中的工具调用性能。Gorilla支持多轮对话和多步骤函数调用的评估，包含状态管理、错误恢复和多跳推理等高级功能。项目还推出了Agent Arena，与LMSYS Chatbot Arena合作，允许用户比较不同智能体在搜索、金融、RAG等任务中的表现。作为一个活跃的学术研究项目，Gorilla为AI研究社区提供了重要的评估工具和数据集，帮助推进大语言模型在复杂工具集成方面的能力发展。

Deep Analysis

Key Differentiator

vs ChatGPT function calling: open-source model + comprehensive BFCL leaderboard + GoEx safe execution engine, all from UC Berkeley research

⚡ Capabilities

• LLM function calling with 1600+ APIs
• Berkeley Function Calling Leaderboard (BFCL v1-v4)
• OpenFunctions-v2 model for parallel/multi-language function calling
• Agent Arena for comparing LLM agents
• GoEx execution engine with safety guarantees
• RAFT domain-specific fine-tuning recipe
• Gorilla CLI for command-line AI

🔗 Integrations

OpenAI API formatLangChainHugging FaceKubernetesAWSGCP

✓ Best For

✓ Evaluating and benchmarking LLM function calling capabilities
✓ Building applications that need reliable API invocation

✗ Not Ideal For

✗ General-purpose chatbot without tool calling needs
✗ Teams wanting a managed commercial solution

Languages

Python

Deployment

API endpointlocal model (Hugging Face)CLI (pip)Docker (GoEx)

⚠ Known Limitations

⚠ OpenFunctions models need fine-tuning updates for new APIs
⚠ Self-hosted endpoint may have latency vs commercial APIs
⚠ BFCL benchmark focused on function calling, not general chat
⚠ GoEx execution engine requires Docker for sandboxing

Pros

+ 提供业界领先的Berkeley Function Calling Leaderboard，为LLM工具调用能力评估设立标准
+ 支持复杂的多轮对话和多步骤函数调用评估，包含状态管理和错误恢复机制
+ 活跃的学术研究社区，持续更新评估方法和数据集，与LMSYS等知名平台合作

Cons

- 主要面向研究用途，对于生产环境的实际应用指导有限
- 文档信息不够完整，缺乏详细的实施和部署指南

Use Cases

• AI研究人员评估和比较不同LLM的函数调用能力表现
• 开发团队基准测试自己的AI智能体在复杂工具集成场景中的性能
• 学术机构研究多模态AI系统在真实世界任务中的工具使用效果

Getting Started

1. 从GitHub克隆项目代码库到本地环境；2. 根据项目要求安装Python依赖和配置评估环境；3. 运行示例评估脚本或访问在线leaderboard查看基准测试结果

Compare gorilla

gorilla vs litellm gorilla vs unsloth gorilla vs pipecat gorilla vs composio gorilla vs whisperX gorilla vs langchain4j