docling

Get your documents ready for gen AI

open-sourceagent-frameworks
Visit WebsiteView on GitHub
56.6k
Stars
+4717
Stars/month
10
Releases (6m)

Overview

Docling is an advanced document processing library designed to prepare documents for generative AI workflows. It excels at parsing diverse document formats including PDF, DOCX, PPTX, XLSX, HTML, audio files (WAV, MP3), WebVTT, images, LaTeX, and plain text. The tool's standout feature is its sophisticated PDF understanding capabilities, which include page layout analysis, reading order detection, table structure recognition, code extraction, formula processing, and image classification. Docling converts processed documents into a unified DoclingDocument representation, making it easier to integrate document content into AI pipelines. With over 56,000 GitHub stars, it has gained significant adoption in the AI community. The library provides seamless integrations with the generative AI ecosystem, enabling developers to efficiently extract and structure content from complex documents for downstream AI applications. As part of the Linux Foundation AI & Data project, Docling represents a robust, community-backed solution for document intelligence tasks.

Pros

  • + Advanced PDF understanding with layout analysis, table structure recognition, and reading order detection
  • + Supports wide variety of document formats including office documents, images, audio, and markup languages
  • + Unified DoclingDocument representation simplifies integration with AI workflows and downstream processing

Cons

  • - Processing complex documents with advanced features may require significant computational resources
  • - Limited information available about performance benchmarks and processing speed for large document batches

Use Cases

Getting Started

1. Install via pip: `pip install docling` 2. Import and create a document converter: `from docling.document_converter import DocumentConverter; converter = DocumentConverter()` 3. Process a document: `result = converter.convert('path/to/document.pdf')` to get structured DoclingDocument output