AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

3.3k
Stars
+38
Stars/month
0
Releases (6m)

Star Growth

+13 (0.4%)
3.2k3.3k3.4kMar 27Apr 1

Overview

AgentBench is a comprehensive benchmark platform designed to evaluate Large Language Models (LLMs) as autonomous agents across multiple complex tasks. Published at ICLR'24, it provides researchers and developers with a standardized framework to assess agent capabilities in diverse domains including interactive environments, database operations, knowledge graphs, operating system interactions, and web-based shopping scenarios. The latest version, AgentBench FC, introduces function-calling capabilities and integrates with the AgentRL framework for end-to-end multitask and multiturn agent training. The platform features fully containerized deployment using Docker Compose, ensuring reproducible evaluation environments across different systems. With support for five core task categories (AlfWorld, DBBench, KnowledgeGraph, OS Interaction, and WebShop), AgentBench enables systematic comparison of agent performance using function-calling style prompts. The benchmark includes a public leaderboard for tracking progress and maintains an active research community through Slack for collaboration and knowledge sharing. This makes it an essential tool for advancing research in autonomous AI agents.

Deep Analysis

Capabilities

  • Comprehensive benchmark evaluating LLMs as autonomous agents across 8 diverse task environments
  • Environments: OS interaction, Database, Knowledge Graph, Card Game, Lateral Thinking, House-holding, Web Shopping, Web Browsing
  • Function-calling version (AgentBench FC) integrated with AgentRL framework
  • Fully containerized deployment with Docker Compose
  • Multi-turn interaction evaluation requiring ~4k (dev) and ~13k (test) model generations
  • Leaderboard tracking model performance across tasks

🔗 Integrations

OpenAI APIDockerMySQLRedisAgentRL frameworkMind2WebWebShopALFWorld

Best For

  • LLM researchers benchmarking agent capabilities across diverse environments
  • Teams evaluating and comparing AI agent architectures
  • Organizations selecting LLMs for autonomous agent applications

Languages

Python

Deployment

Docker Compose (containerized environments)Python 3.9+ CLI

Pricing Detail

Free: Open-source research project
Paid: LLM API costs for running benchmarks

Known Limitations

  • Resource-intensive — requires running multiple Docker containers simultaneously
  • Full test set requires ~13k model API calls
  • Primarily research-oriented — not for production use
  • Setup complexity with multiple dependent services

Pros

  • + Comprehensive evaluation across five diverse task domains with standardized metrics and reproducible containerized environments
  • + Function-calling integration with AgentRL framework enables end-to-end agent training and sophisticated multiturn interactions
  • + Active research community with public leaderboard, Slack workspace, and ongoing collaboration for benchmark improvements

Cons

  • - Complex setup requiring multiple Docker images and external data dependencies like Freebase database
  • - Primarily research-focused with limited documentation for production deployment scenarios
  • - Resource-intensive containerized environment may require significant computational resources for full evaluation

Use Cases

  • Research teams evaluating and comparing different LLM agent architectures across standardized benchmark tasks
  • AI companies developing autonomous agents who need systematic performance assessment before deployment
  • Academic institutions studying agent capabilities in interactive environments, databases, and web-based scenarios

Getting Started

1. Download required Docker images (MySQL, OS interaction images) and build custom containers using provided Dockerfiles 2. Download and configure external dependencies like Freebase data at ./virtuoso_db/virtuoso.db location 3. Execute 'docker compose -f extra/docker-compose.yml up' to launch the complete benchmark environment with all task workers

Compare AgentBench