Dolphin
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
Overview
Dolphin is an advanced document image parsing model that uses heterogeneous anchor prompting to extract and structure content from any type of document, whether digital-born or photographed. Developed by ByteDance and published in ACL 2025, it employs a sophisticated two-stage architecture: first classifying document types and analyzing layout with reading order prediction, then applying hybrid parsing strategies optimized for each document type. The v2 model features 3B parameters and can detect 21 different document elements including text paragraphs, figures, formulas, tables, and code blocks. What sets Dolphin apart is its document-type-aware approach that uses holistic parsing for photographed documents and parallel element-wise parsing for digital documents, ensuring both accuracy and efficiency. The model excels at handling complex, multi-element documents with intertwined content types that traditional parsing tools struggle with. With its lightweight architecture and parallel parsing mechanism, Dolphin achieves superior performance across diverse page-level and element-level parsing tasks while maintaining computational efficiency.
Pros
- + Universal document parsing capability that handles both digital and photographed documents seamlessly
- + Advanced two-stage architecture with document-type-aware parsing strategies optimized for different document formats
- + Comprehensive 21-element detection including complex elements like formulas, code blocks, and tables with attribute field extraction
Cons
- - Research-focused tool that may require significant technical expertise to implement and integrate
- - Relatively new release with limited production use cases and community feedback
- - Large model size (3B parameters) may require substantial computational resources for deployment
Use Cases
- • Academic research document digitization and content extraction from PDFs and scanned papers
- • Enterprise document processing for complex reports, invoices, and forms with mixed content types
- • Automated parsing of technical documentation containing code snippets, mathematical formulas, and diagrams