Dolphin

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

8.9k
Stars
+15
Stars/month
0
Releases (6m)

Star Growth

+2 (0.0%)
8.7k8.9k9.1kMar 27Apr 1

Overview

Dolphin is an advanced document image parsing model that uses heterogeneous anchor prompting to extract and structure content from any type of document, whether digital-born or photographed. Developed by ByteDance and published in ACL 2025, it employs a sophisticated two-stage architecture: first classifying document types and analyzing layout with reading order prediction, then applying hybrid parsing strategies optimized for each document type. The v2 model features 3B parameters and can detect 21 different document elements including text paragraphs, figures, formulas, tables, and code blocks. What sets Dolphin apart is its document-type-aware approach that uses holistic parsing for photographed documents and parallel element-wise parsing for digital documents, ensuring both accuracy and efficiency. The model excels at handling complex, multi-element documents with intertwined content types that traditional parsing tools struggle with. With its lightweight architecture and parallel parsing mechanism, Dolphin achieves superior performance across diverse page-level and element-level parsing tasks while maintaining computational efficiency.

Deep Analysis

Key Differentiator

Unlike general-purpose vision-language models, Dolphin's document-type-aware two-stage approach with heterogeneous anchor prompting achieves superior layout understanding while staying lightweight at 3B parameters — outperforming much larger models on structured document parsing

Capabilities

  • Universal document parsing with two-stage processing: type classification + layout analysis then hybrid parsing
  • Handles text extraction, table parsing, formula recognition, code blocks, and reading order prediction
  • Supports both digital-born PDFs and photographed/scanned documents
  • Multi-page PDF processing with batch inference support
  • Lightweight 3B parameter architecture with heterogeneous anchor prompting for parallel element parsing

🔗 Integrations

Hugging Face TransformersvLLMTensorRT-LLMPDF processing pipelines

Best For

  • Teams building document processing pipelines for academic papers, technical docs, and multi-format PDFs
  • Organizations needing high-quality layout-aware document parsing at scale

Not Ideal For

  • Simple OCR tasks — use Tesseract or PaddleOCR for basic text extraction
  • Real-time streaming document processing — designed for batch workflows

Languages

Python

Deployment

Local inference via Hugging Face model downloadvLLM accelerated servingTensorRT-LLM optimized deploymentBatch processing with configurable sizes

Pricing Detail

Free: Model weights freely available on Hugging Face
Paid: N/A — fully open-source

Known Limitations

  • Requires GPU for inference (3B parameter model)
  • Actively requesting community feedback on bad cases — still being optimized
  • Focused on document parsing only — not a general-purpose vision model

Pros

  • + Universal document parsing capability that handles both digital and photographed documents seamlessly
  • + Advanced two-stage architecture with document-type-aware parsing strategies optimized for different document formats
  • + Comprehensive 21-element detection including complex elements like formulas, code blocks, and tables with attribute field extraction

Cons

  • - Research-focused tool that may require significant technical expertise to implement and integrate
  • - Relatively new release with limited production use cases and community feedback
  • - Large model size (3B parameters) may require substantial computational resources for deployment

Use Cases

  • Academic research document digitization and content extraction from PDFs and scanned papers
  • Enterprise document processing for complex reports, invoices, and forms with mixed content types
  • Automated parsing of technical documentation containing code snippets, mathematical formulas, and diagrams

Getting Started

1. Access the pre-trained Dolphin-v2 model from HuggingFace (ByteDance/Dolphin-v2) or clone the GitHub repository. 2. Set up the required dependencies and environment following the repository documentation. 3. Run inference on your document images using the provided model interface to extract structured content with element detection and parsing.

Compare Dolphin