bananalyzer

Open source AI Agent evaluation framework for web tasks 🐒🍌

open-sourceobservability-evaluation

Visit Website View on GitHub

327

Stars

Stars/month

Releases (6m)

Star Growth

Overview

Bananalyzer是一个开源的AI智能体评估框架，专门用于评估AI代理在网页任务上的表现。该工具使用Playwright浏览器自动化技术，通过创建和保存网站的静态快照来确保评估的一致性和可重复性。框架解决了网站随时间变化、网络延迟和反机器人保护等挑战，为AI代理提供稳定的测试环境。 Bananalyzer采用类似Mind2Web和WebArena的模式定义测试示例，将测试用例存储在examples.json文件中，包含代理目标、期望输出和网页快照等元数据。目前主要专注于从网页中提取结构化JSON数据的任务评估。该框架通过CLI工具动态构建pytest测试套件并执行评估，用户只需实现AgentRunner接口即可集成自己的AI代理。该项目的价值在于为AI代理的网页任务能力提供标准化、可重复的评估方法。随着AI代理在网页自动化和数据提取方面的应用越来越广泛，需要可靠的评估工具来衡量不同代理的性能。Bananalyzer填补了这一空白，特别是在处理多样化网站结构和行业用例方面。虽然目前仍在开发中，但计划支持更复杂的多步骤评估，并整合现有的网页任务数据集。

Deep Analysis

Key Differentiator

vs live-site testing frameworks: uses historic/static website snapshots to eliminate variability from site changes, latency, and bot protections — enabling reproducible and reliable web agent evaluation

⚡ Capabilities

• Open-source AI agent evaluation framework for web automation tasks
• Playwright-based test execution against example websites
• Static MHTML snapshots eliminate variability from site changes and bot protections
• CLI tool for running individual or batch evaluation tests
• Structured JSON output extraction comparison against expected results
• Notebook-based automation for creating evaluation examples

🔗 Integrations

PlaywrightFastAPIpytestChrome Developer API

✓ Best For

✓ Evaluating AI agent performance on web information retrieval
✓ Benchmarking structured data extraction accuracy across diverse sites
✓ Reproducible web agent testing with static snapshots

✗ Not Ideal For

✗ Real-time website interaction testing (uses static snapshots)
✗ Complex multi-step navigation workflows
✗ Scenarios requiring authentication or CAPTCHA handling

Languages

Python

Deployment

pip install (CLI)FastAPI server with uvicorn

⚠ Known Limitations

⚠ Work in progress with incomplete features
⚠ Currently supports only single-page evaluations
⚠ Limited multi-step scenario support
⚠ Requires manual example creation
⚠ No authentication, CAPTCHA, or pop-up handling yet

Pros

+ 使用mhtml快照技术保存网页状态，确保评估的一致性和可重复性，不受网站变化影响
+ 基于成熟的Mind2Web和WebArena数据集模式，提供标准化的评估框架和丰富的测试用例
+ 集成Playwright浏览器自动化，支持真实的网页交互和复杂的DOM操作评估

Cons

- 项目仍处于开发阶段，功能不够完整，可能存在稳定性问题
- 目前主要专注于结构化数据提取任务，对复杂的多步骤网页操作支持有限
- 需要用户实现AgentRunner接口，对技术要求较高，上手门槛相对较高

Use Cases

• 评估AI代理在电商网站、新闻门户等不同行业网站上的数据提取能力和准确性
• 对比测试不同AI代理在相同网页任务上的表现，为代理选型提供数据支持
• 为AI代理开发团队提供标准化的测试环境，验证代理在网页自动化任务中的可靠性

Getting Started

1. 从GitHub克隆bananalyzer仓库并安装Python依赖；2. 实现AgentRunner接口，定义你的AI代理逻辑和网页交互方法；3. 使用CLI工具运行examples.json中的测试用例，评估代理在网页任务上的表现

Compare bananalyzer

bananalyzer vs worldmonitor bananalyzer vs litellm bananalyzer vs MinerU bananalyzer vs OmniRoute bananalyzer vs promptfoo bananalyzer vs langfuse