hallucination-leaderboard
Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents
Star Growth
Overview
A comprehensive leaderboard that evaluates and ranks Large Language Models (LLMs) based on their tendency to produce hallucinations when summarizing short documents. Built by Vectara using their Hallucination Evaluation Model (HHEM), this tool provides objective metrics for comparing model reliability across different LLMs. The leaderboard tracks key performance indicators including hallucination rate, factual consistency rate, answer rate, and average summary length. With over 3,000 GitHub stars and regular updates, it has become a trusted resource for researchers and developers selecting models for production summarization tasks. The tool offers both a static GitHub repository with detailed data and an interactive Hugging Face interface for easy exploration. It covers a wide range of models from major providers including OpenAI, Google, Anthropic, Meta, Amazon, Microsoft, and emerging AI companies, making it an essential reference for understanding the current state of LLM reliability in text summarization applications.
Deep Analysis
The only continuously-updated automated hallucination benchmark using a dedicated evaluation model (HHEM) rather than human annotations, enabling scalable and repeatable factual consistency measurement across 7,700+ test documents
⚡ Capabilities
- • Ranks LLMs by factual consistency when summarizing documents using HHEM-2.3 evaluation model
- • Tests across 7,700+ diverse articles spanning news, tech, science, medicine, legal, sports, education
- • Automated evaluation pipeline enabling continuous updates as new models emerge
- • Open-source HHEM-2.1 model available for self-hosted evaluation
- • Commercial HHEM API with multi-language support
🔗 Integrations
✓ Best For
- ✓ Teams evaluating LLM reliability for RAG systems where factual accuracy is critical
- ✓ Researchers benchmarking model truthfulness for document summarization tasks
✗ Not Ideal For
- ✗ Measuring overall LLM quality or coherence — use LMSYS Chatbot Arena instead
- ✗ Evaluating closed-book QA or creative generation — this only measures summarization fidelity
Languages
Deployment
Pricing Detail
⚠ Known Limitations
- ⚠ Evaluates only summarization factual consistency — not all hallucination types
- ⚠ English-language only (multi-language planned)
- ⚠ Cannot detect hallucinations for arbitrary questions without reference documents
- ⚠ Single-metric approach — should be combined with other evaluation frameworks
Pros
- + Regularly updated with latest model versions and performance data, ensuring current relevance for model selection decisions
- + Uses standardized HHEM evaluation methodology providing consistent and comparable metrics across all tested models
- + Comprehensive metrics beyond just hallucination rates including factual consistency, answer rates, and summary length statistics
Cons
- - Limited to summarization tasks only, not covering other common LLM use cases like code generation or creative writing
- - No API access mentioned for programmatic integration into model selection workflows
Use Cases
- • Selecting the most reliable LLM for production summarization applications where factual accuracy is critical
- • Academic research into hallucination patterns and model reliability across different architectures and training approaches
- • Benchmarking new models against established baselines to evaluate improvements in factual consistency