olmocr

Toolkit for linearizing PDFs for LLM datasets/training

open-sourceagent-frameworks
17.1k
Stars
+105
Stars/month
10
Releases (6m)

Star Growth

+27 (0.2%)
16.7k17.1k17.4kMar 27Apr 1

Overview

olmocr is a specialized toolkit designed for converting PDFs and image-based documents into clean, structured text format optimized for LLM training datasets. Built around a 7B parameter Vision Language Model, it excels at handling complex document layouts including multi-column text, equations, tables, handwriting, and intricate formatting that traditional OCR tools struggle with. The system intelligently preserves natural reading order even in documents with figures, insets, and complex layouts, while automatically removing headers and footers. olmocr outputs clean Markdown format, making it ideal for creating high-quality training data for language models. With a cost efficiency of under $200 per million pages processed, it provides an economical solution for large-scale document digitization. The toolkit has undergone continuous improvements through multiple model releases, with recent versions achieving significant performance boosts on olmOCR-Bench evaluations. It includes Docker support for easy deployment and uses vllm-based inference for efficient processing.

Deep Analysis

Key Differentiator

Open-source VLM-based OCR achieving 82+ on olmOCR-Bench, rivaling commercial solutions like Mistral OCR — vs traditional OCR tools (Tesseract) that struggle with complex layouts

Capabilities

  • PDF/PNG/JPEG to clean Markdown conversion
  • Equation, table, handwriting, and complex formatting support
  • Automatic header/footer removal
  • Natural reading order detection for multi-column layouts
  • Benchmark suite (olmOCR-Bench) with 7000+ test cases
  • GPU-accelerated inference via vLLM

🔗 Integrations

vLLMHugging Face modelsDockerBeaker clusters

Best For

  • Batch PDF-to-text conversion at scale with high accuracy
  • Academic and research document digitization
  • Building RAG pipelines that need clean text from PDFs

Not Ideal For

  • Real-time OCR on mobile or edge devices
  • CPU-only environments without access to GPU servers

Languages

Python

Deployment

pip packageDockerRemote vLLM serverBeaker cluster jobs

Pricing Detail

Free: Open source Apache 2.0, free for local GPU use
Paid: ~$200 per million pages (compute cost for GPU inference)

Known Limitations

  • Requires NVIDIA GPU with 12GB+ VRAM for local inference
  • 7B parameter model — needs significant compute
  • Not suitable for CPU-only environments without remote server
  • Focused on document OCR, not real-time text recognition

Pros

  • + Excellent handling of complex document layouts including equations, tables, handwriting, and multi-column formats with natural reading order preservation
  • + Cost-effective processing at under $200 per million pages, making it economical for large-scale dataset creation
  • + Continuous model improvements with recent releases showing significant performance gains and reduced hallucinations on blank documents

Cons

  • - Requires GPU resources due to 7B parameter model, making it computationally intensive and potentially expensive to run
  • - May require multiple retries for some documents to achieve optimal results
  • - Limited to image-based document formats (PDF, PNG, JPEG) and requires technical expertise for setup and optimization

Use Cases

  • Converting academic papers and research documents with complex equations and figures for LLM training datasets
  • Processing legacy document archives with multi-column layouts and mixed content types into searchable text format
  • Creating high-quality training data from technical manuals, textbooks, and scientific publications for domain-specific language models

Getting Started

Install olmocr via pip install, set up GPU environment with CUDA support, then run olmocr on your PDF files to generate clean Markdown output

Compare olmocr