olmocr

Toolkit for linearizing PDFs for LLM datasets/training

open-sourceagent-frameworks
Visit WebsiteView on GitHub
17.1k
Stars
+1422
Stars/month
10
Releases (6m)

Overview

olmocr is a specialized toolkit designed for converting PDFs and image-based documents into clean, structured text format optimized for LLM training datasets. Built around a 7B parameter Vision Language Model, it excels at handling complex document layouts including multi-column text, equations, tables, handwriting, and intricate formatting that traditional OCR tools struggle with. The system intelligently preserves natural reading order even in documents with figures, insets, and complex layouts, while automatically removing headers and footers. olmocr outputs clean Markdown format, making it ideal for creating high-quality training data for language models. With a cost efficiency of under $200 per million pages processed, it provides an economical solution for large-scale document digitization. The toolkit has undergone continuous improvements through multiple model releases, with recent versions achieving significant performance boosts on olmOCR-Bench evaluations. It includes Docker support for easy deployment and uses vllm-based inference for efficient processing.

Pros

  • + Excellent handling of complex document layouts including equations, tables, handwriting, and multi-column formats with natural reading order preservation
  • + Cost-effective processing at under $200 per million pages, making it economical for large-scale dataset creation
  • + Continuous model improvements with recent releases showing significant performance gains and reduced hallucinations on blank documents

Cons

  • - Requires GPU resources due to 7B parameter model, making it computationally intensive and potentially expensive to run
  • - May require multiple retries for some documents to achieve optimal results
  • - Limited to image-based document formats (PDF, PNG, JPEG) and requires technical expertise for setup and optimization

Use Cases

Getting Started

Install olmocr via pip install, set up GPU environment with CUDA support, then run olmocr on your PDF files to generate clean Markdown output