hallucination-leaderboard

Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents

open-sourceagent-frameworks
Visit WebsiteView on GitHub
3.2k
Stars
+263
Stars/month
0
Releases (6m)

Overview

A comprehensive leaderboard that evaluates and ranks Large Language Models (LLMs) based on their tendency to produce hallucinations when summarizing short documents. Built by Vectara using their Hallucination Evaluation Model (HHEM), this tool provides objective metrics for comparing model reliability across different LLMs. The leaderboard tracks key performance indicators including hallucination rate, factual consistency rate, answer rate, and average summary length. With over 3,000 GitHub stars and regular updates, it has become a trusted resource for researchers and developers selecting models for production summarization tasks. The tool offers both a static GitHub repository with detailed data and an interactive Hugging Face interface for easy exploration. It covers a wide range of models from major providers including OpenAI, Google, Anthropic, Meta, Amazon, Microsoft, and emerging AI companies, making it an essential reference for understanding the current state of LLM reliability in text summarization applications.

Pros

  • + Regularly updated with latest model versions and performance data, ensuring current relevance for model selection decisions
  • + Uses standardized HHEM evaluation methodology providing consistent and comparable metrics across all tested models
  • + Comprehensive metrics beyond just hallucination rates including factual consistency, answer rates, and summary length statistics

Cons

  • - Limited to summarization tasks only, not covering other common LLM use cases like code generation or creative writing
  • - No API access mentioned for programmatic integration into model selection workflows

Use Cases

Getting Started

Visit the interactive Hugging Face leaderboard at [link] to explore current rankings and metrics. Review the GitHub repository for detailed methodology and historical data. Compare models based on your specific requirements for hallucination tolerance, answer rate, and summary length preferences.