evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

18.1k
Stars
+113
Stars/month
0
Releases (6m)

Star Growth

+22 (0.1%)
17.7k18.1k18.5kMar 27Apr 1

Overview

OpenAI Evals是一个专门用于评估大型语言模型(LLM)和基于LLM系统的框架。该工具提供了一个开源的基准测试注册表,包含多种评估维度来测试OpenAI模型的性能。Evals不仅提供现有的评估套件,还允许用户为特定用例编写自定义评估。该框架特别强调了评估在LLM开发中的重要性,正如OpenAI总裁Greg Brockman所说,高质量的评估是构建LLM应用时最有影响力的工作之一。该工具现在可以直接在OpenAI Dashboard中配置和运行,也支持本地部署。Evals还支持使用私有数据构建评估,而无需公开暴露这些数据,这对企业用户特别有价值。该框架与Weights & Biases集成,提供了更丰富的实验跟踪和可视化能力。

Deep Analysis

Capabilities

  • Framework for evaluating LLMs and LLM-based systems through pre-built and custom evaluation tests
  • Registry of existing evaluations for testing models across multiple dimensions
  • Custom eval creation for domain-specific use cases
  • Private eval building using proprietary data without public exposure
  • Model-graded evals configurable via YAML without coding
  • Completion function protocol for evaluating prompt chains and tool-using agents

🔗 Integrations

OpenAI APIWeights & BiasesSnowflakeGit-LFS

Best For

  • Teams systematically evaluating LLM performance across model versions
  • Prompt engineers needing no-code YAML-based evaluation workflows
  • Organizations building quality assurance pipelines for LLM applications

Languages

Python

Deployment

pip installCLI-based executionDashboard interface

Pricing Detail

Free: Open-source framework (MIT License)
Paid: OpenAI API costs for running evaluations

Known Limitations

  • Python 3.9+ required
  • Primarily designed for OpenAI models — limited multi-provider support
  • API costs accumulate with large evaluation suites
  • Git-LFS required for managing evaluation data registry

Pros

  • + 提供完整的LLM评估框架,包含丰富的预置基准测试注册表
  • + 支持自定义评估开发,可针对特定业务场景和用例进行定制
  • + 现在可直接在OpenAI Dashboard中运行,也支持本地部署,使用灵活

Cons

  • - 需要OpenAI API密钥和相关费用,运行评估可能产生不小的成本
  • - 使用Git-LFS存储评估数据,增加了初始设置的复杂性
  • - 主要针对OpenAI模型优化,对其他LLM供应商的支持可能有限

Use Cases

  • 测试不同OpenAI模型版本对特定业务工作流程的影响和性能差异
  • 为领域特定的LLM应用构建自定义基准测试和评估指标
  • 使用企业私有数据创建内部评估套件,而不暴露敏感信息

Getting Started

1. 安装框架:使用pip install -e .安装evals包;2. 配置API:设置OPENAI_API_KEY环境变量;3. 运行评估:通过OpenAI Dashboard或本地命令行运行现有评估或创建自定义评估

Compare evals