AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Visit WebsiteView on GitHub
3.3k
Stars
+273
Stars/month
0
Releases (6m)

Overview

AgentBench is a comprehensive benchmark platform designed to evaluate Large Language Models (LLMs) as autonomous agents across multiple complex tasks. Published at ICLR'24, it provides researchers and developers with a standardized framework to assess agent capabilities in diverse domains including interactive environments, database operations, knowledge graphs, operating system interactions, and web-based shopping scenarios. The latest version, AgentBench FC, introduces function-calling capabilities and integrates with the AgentRL framework for end-to-end multitask and multiturn agent training. The platform features fully containerized deployment using Docker Compose, ensuring reproducible evaluation environments across different systems. With support for five core task categories (AlfWorld, DBBench, KnowledgeGraph, OS Interaction, and WebShop), AgentBench enables systematic comparison of agent performance using function-calling style prompts. The benchmark includes a public leaderboard for tracking progress and maintains an active research community through Slack for collaboration and knowledge sharing. This makes it an essential tool for advancing research in autonomous AI agents.

Pros

  • + Comprehensive evaluation across five diverse task domains with standardized metrics and reproducible containerized environments
  • + Function-calling integration with AgentRL framework enables end-to-end agent training and sophisticated multiturn interactions
  • + Active research community with public leaderboard, Slack workspace, and ongoing collaboration for benchmark improvements

Cons

  • - Complex setup requiring multiple Docker images and external data dependencies like Freebase database
  • - Primarily research-focused with limited documentation for production deployment scenarios
  • - Resource-intensive containerized environment may require significant computational resources for full evaluation

Use Cases

Getting Started

1. Download required Docker images (MySQL, OS interaction images) and build custom containers using provided Dockerfiles 2. Download and configure external dependencies like Freebase data at ./virtuoso_db/virtuoso.db location 3. Execute 'docker compose -f extra/docker-compose.yml up' to launch the complete benchmark environment with all task workers