auto-evaluator

Evaluation tool for LLM QA chains

Visit WebsiteView on GitHub
782
Stars
+65
Stars/month
0
Releases (6m)

Overview

Auto-evaluator is a lightweight tool for evaluating and optimizing question-answering (QA) systems built with Langchain. It automates the entire evaluation pipeline by taking user-provided documents and using GPT-3.5-turbo to automatically generate question-answer pairs from the content. The tool then creates QA chains with various configurable parameters including text splitting methods, chunk sizes, embedding approaches, retrieval strategies, and language models. It systematically tests each configuration by generating responses to the auto-generated questions and uses GPT-3.5-turbo to score how well each response matches the expected answer. This automated approach eliminates the manual effort of creating evaluation datasets and provides quantitative insights into which configuration parameters work best for specific document types and use cases. The tool features a Streamlit-based web interface that makes it accessible to both technical and non-technical users, allowing them to experiment with different settings and visualize performance across various chain configurations. With 782 GitHub stars, it has become a popular choice for developers and researchers looking to optimize their RAG (Retrieval-Augmented Generation) systems. The tool is particularly valuable for teams building document-based QA applications who need systematic ways to compare retrieval methods, chunk sizes, and model choices.

Pros

  • + Fully automated evaluation pipeline that generates question-answer pairs from documents without manual dataset creation
  • + Comprehensive configuration testing across multiple parameters including chunk sizes, retrieval methods, and embedding approaches
  • + User-friendly Streamlit interface with hosted versions available on HuggingFace and langchain.com for easy access

Cons

  • - Requires paid API access to both OpenAI (GPT-4) and Anthropic services for full functionality
  • - Limited to GPT-3.5-turbo for both question generation and response scoring, which may introduce model-specific biases
  • - Evaluation quality depends on the automatic question generation, which may not capture all important aspects of document content

Use Cases

Getting Started

Install dependencies with 'pip install -r requirements.txt', configure OpenAI and Anthropic API keys in your environment, then launch the interface using 'streamlit run auto-evaluator.py' and upload your documents to begin automated evaluation