hallucination-leaderboard
Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents
Overview
A comprehensive leaderboard that evaluates and ranks Large Language Models (LLMs) based on their tendency to produce hallucinations when summarizing short documents. Built by Vectara using their Hallucination Evaluation Model (HHEM), this tool provides objective metrics for comparing model reliability across different LLMs. The leaderboard tracks key performance indicators including hallucination rate, factual consistency rate, answer rate, and average summary length. With over 3,000 GitHub stars and regular updates, it has become a trusted resource for researchers and developers selecting models for production summarization tasks. The tool offers both a static GitHub repository with detailed data and an interactive Hugging Face interface for easy exploration. It covers a wide range of models from major providers including OpenAI, Google, Anthropic, Meta, Amazon, Microsoft, and emerging AI companies, making it an essential reference for understanding the current state of LLM reliability in text summarization applications.
Pros
- + Regularly updated with latest model versions and performance data, ensuring current relevance for model selection decisions
- + Uses standardized HHEM evaluation methodology providing consistent and comparable metrics across all tested models
- + Comprehensive metrics beyond just hallucination rates including factual consistency, answer rates, and summary length statistics
Cons
- - Limited to summarization tasks only, not covering other common LLM use cases like code generation or creative writing
- - No API access mentioned for programmatic integration into model selection workflows
Use Cases
- • Selecting the most reliable LLM for production summarization applications where factual accuracy is critical
- • Academic research into hallucination patterns and model reliability across different architectures and training approaches
- • Benchmarking new models against established baselines to evaluate improvements in factual consistency