Dolphin
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
Star Growth
Overview
Dolphin is an advanced document image parsing model that uses heterogeneous anchor prompting to extract and structure content from any type of document, whether digital-born or photographed. Developed by ByteDance and published in ACL 2025, it employs a sophisticated two-stage architecture: first classifying document types and analyzing layout with reading order prediction, then applying hybrid parsing strategies optimized for each document type. The v2 model features 3B parameters and can detect 21 different document elements including text paragraphs, figures, formulas, tables, and code blocks. What sets Dolphin apart is its document-type-aware approach that uses holistic parsing for photographed documents and parallel element-wise parsing for digital documents, ensuring both accuracy and efficiency. The model excels at handling complex, multi-element documents with intertwined content types that traditional parsing tools struggle with. With its lightweight architecture and parallel parsing mechanism, Dolphin achieves superior performance across diverse page-level and element-level parsing tasks while maintaining computational efficiency.
Deep Analysis
Unlike general-purpose vision-language models, Dolphin's document-type-aware two-stage approach with heterogeneous anchor prompting achieves superior layout understanding while staying lightweight at 3B parameters — outperforming much larger models on structured document parsing
⚡ Capabilities
- • Universal document parsing with two-stage processing: type classification + layout analysis then hybrid parsing
- • Handles text extraction, table parsing, formula recognition, code blocks, and reading order prediction
- • Supports both digital-born PDFs and photographed/scanned documents
- • Multi-page PDF processing with batch inference support
- • Lightweight 3B parameter architecture with heterogeneous anchor prompting for parallel element parsing
🔗 Integrations
✓ Best For
- ✓ Teams building document processing pipelines for academic papers, technical docs, and multi-format PDFs
- ✓ Organizations needing high-quality layout-aware document parsing at scale
✗ Not Ideal For
- ✗ Simple OCR tasks — use Tesseract or PaddleOCR for basic text extraction
- ✗ Real-time streaming document processing — designed for batch workflows
Languages
Deployment
Pricing Detail
⚠ Known Limitations
- ⚠ Requires GPU for inference (3B parameter model)
- ⚠ Actively requesting community feedback on bad cases — still being optimized
- ⚠ Focused on document parsing only — not a general-purpose vision model
Pros
- + Universal document parsing capability that handles both digital and photographed documents seamlessly
- + Advanced two-stage architecture with document-type-aware parsing strategies optimized for different document formats
- + Comprehensive 21-element detection including complex elements like formulas, code blocks, and tables with attribute field extraction
Cons
- - Research-focused tool that may require significant technical expertise to implement and integrate
- - Relatively new release with limited production use cases and community feedback
- - Large model size (3B parameters) may require substantial computational resources for deployment
Use Cases
- • Academic research document digitization and content extraction from PDFs and scanned papers
- • Enterprise document processing for complex reports, invoices, and forms with mixed content types
- • Automated parsing of technical documentation containing code snippets, mathematical formulas, and diagrams