Dolphin

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

Visit WebsiteView on GitHub
8.9k
Stars
+740
Stars/month
0
Releases (6m)

Overview

Dolphin is an advanced document image parsing model that uses heterogeneous anchor prompting to extract and structure content from any type of document, whether digital-born or photographed. Developed by ByteDance and published in ACL 2025, it employs a sophisticated two-stage architecture: first classifying document types and analyzing layout with reading order prediction, then applying hybrid parsing strategies optimized for each document type. The v2 model features 3B parameters and can detect 21 different document elements including text paragraphs, figures, formulas, tables, and code blocks. What sets Dolphin apart is its document-type-aware approach that uses holistic parsing for photographed documents and parallel element-wise parsing for digital documents, ensuring both accuracy and efficiency. The model excels at handling complex, multi-element documents with intertwined content types that traditional parsing tools struggle with. With its lightweight architecture and parallel parsing mechanism, Dolphin achieves superior performance across diverse page-level and element-level parsing tasks while maintaining computational efficiency.

Pros

  • + Universal document parsing capability that handles both digital and photographed documents seamlessly
  • + Advanced two-stage architecture with document-type-aware parsing strategies optimized for different document formats
  • + Comprehensive 21-element detection including complex elements like formulas, code blocks, and tables with attribute field extraction

Cons

  • - Research-focused tool that may require significant technical expertise to implement and integrate
  • - Relatively new release with limited production use cases and community feedback
  • - Large model size (3B parameters) may require substantial computational resources for deployment

Use Cases

Getting Started

1. Access the pre-trained Dolphin-v2 model from HuggingFace (ByteDance/Dolphin-v2) or clone the GitHub repository. 2. Set up the required dependencies and environment following the repository documentation. 3. Run inference on your document images using the provided model interface to extract structured content with element detection and parsing.