hallucination-leaderboard

Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents

open-sourceagent-frameworks

Visit Website View on GitHub

3.2k

Stars

+30

Stars/month

Releases (6m)

Star Growth

+7 (0.2%)

Overview

A comprehensive leaderboard that evaluates and ranks Large Language Models (LLMs) based on their tendency to produce hallucinations when summarizing short documents. Built by Vectara using their Hallucination Evaluation Model (HHEM), this tool provides objective metrics for comparing model reliability across different LLMs. The leaderboard tracks key performance indicators including hallucination rate, factual consistency rate, answer rate, and average summary length. With over 3,000 GitHub stars and regular updates, it has become a trusted resource for researchers and developers selecting models for production summarization tasks. The tool offers both a static GitHub repository with detailed data and an interactive Hugging Face interface for easy exploration. It covers a wide range of models from major providers including OpenAI, Google, Anthropic, Meta, Amazon, Microsoft, and emerging AI companies, making it an essential reference for understanding the current state of LLM reliability in text summarization applications.

Deep Analysis

Key Differentiator

The only continuously-updated automated hallucination benchmark using a dedicated evaluation model (HHEM) rather than human annotations, enabling scalable and repeatable factual consistency measurement across 7,700+ test documents

⚡ Capabilities

• Ranks LLMs by factual consistency when summarizing documents using HHEM-2.3 evaluation model
• Tests across 7,700+ diverse articles spanning news, tech, science, medicine, legal, sports, education
• Automated evaluation pipeline enabling continuous updates as new models emerge
• Open-source HHEM-2.1 model available for self-hosted evaluation
• Commercial HHEM API with multi-language support

🔗 Integrations

Anthropic (Claude)OpenAI (GPT series)Google (Gemini via Vertex AI)MistralCohereDeepSeekMeta (Llama)Hugging FaceTogether AI

✓ Best For

✓ Teams evaluating LLM reliability for RAG systems where factual accuracy is critical
✓ Researchers benchmarking model truthfulness for document summarization tasks

✗ Not Ideal For

✗ Measuring overall LLM quality or coherence — use LMSYS Chatbot Arena instead
✗ Evaluating closed-book QA or creative generation — this only measures summarization fidelity

Languages

Python

Deployment

Hugging Face Spaces (interactive leaderboard)Hugging Face model download (HHEM-2.1-Open)Commercial HHEM API

Pricing Detail

Free: Leaderboard free to view; HHEM-2.1-Open model free to download

Paid: Commercial API — contact Vectara for pricing

⚠ Known Limitations

⚠ Evaluates only summarization factual consistency — not all hallucination types
⚠ English-language only (multi-language planned)
⚠ Cannot detect hallucinations for arbitrary questions without reference documents
⚠ Single-metric approach — should be combined with other evaluation frameworks

Pros

+ Regularly updated with latest model versions and performance data, ensuring current relevance for model selection decisions
+ Uses standardized HHEM evaluation methodology providing consistent and comparable metrics across all tested models
+ Comprehensive metrics beyond just hallucination rates including factual consistency, answer rates, and summary length statistics

Cons

- Limited to summarization tasks only, not covering other common LLM use cases like code generation or creative writing
- No API access mentioned for programmatic integration into model selection workflows

Use Cases

• Selecting the most reliable LLM for production summarization applications where factual accuracy is critical
• Academic research into hallucination patterns and model reliability across different architectures and training approaches
• Benchmarking new models against established baselines to evaluate improvements in factual consistency

Getting Started

Visit the interactive Hugging Face leaderboard at [link] to explore current rankings and metrics. Review the GitHub repository for detailed methodology and historical data. Compare models based on your specific requirements for hallucination tolerance, answer rate, and summary length preferences.

Compare hallucination-leaderboard

hallucination-leaderboard vs claude-code hallucination-leaderboard vs llama.cpp hallucination-leaderboard vs dify hallucination-leaderboard vs OpenHands hallucination-leaderboard vs OpenHands hallucination-leaderboard vs langgraph