text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSO

open-sourceobservability-evaluation tool-integration

Visit Website View on GitHub

3.1k

Stars

+23

Stars/month

Releases (6m)

Star Growth

+3 (0.1%)

Overview

text-extract-api 是一个基于 FastAPI 的文档提取和解析 API，专注于将 PDF、Word、PowerPoint 等文档转换为 Markdown 文本或结构化 JSON。该工具使用最先进的 OCR 技术，包括 EasyOCR、LLaMA 3.2 Vision、MiniCPM-V 等多种策略，确保极高的文档识别精度。所有处理都在本地环境中进行，无需外部云服务，保障数据隐私。工具集成了 Ollama 模型来改进 OCR 结果质量，修正拼写和文本错误，并提供 PII（个人身份信息）移除功能。通过 Celery 实现分布式队列处理，支持大量文档的异步处理，Redis 提供缓存机制以优化性能。特别适用于需要高精度文档数字化、数据提取和隐私保护的场景，如医疗报告处理、财务文档分析、合规性文档审查等。

Deep Analysis

Key Differentiator

vs cloud OCR services: fully on-premise with pluggable OCR strategies (4 engines), built-in PII removal, and distributed Celery scaling — no vendor lock-in

⚡ Capabilities

• Convert PDFs, images, and Office documents to Markdown or structured JSON
• Multiple OCR strategies: EasyOCR, MiniCPM-V, Llama 3.2-Vision, Marker-PDF
• PII removal from documents
• LLM-based OCR correction and post-processing
• Async distributed processing via Celery workers
• Flexible storage backends (local, Google Drive, S3)

🔗 Integrations

FastAPICeleryRedisEasyOCROllamaMiniCPM-VLlama 3.2-VisionMarker-PDFGoogle Drive APIAmazon S3

✓ Best For

✓ High-volume document digitization pipelines
✓ Extracting structured data from invoices, reports, and forms with PII removal

✗ Not Ideal For

✗ Real-time single-request processing (batch-oriented)
✗ Handwriting recognition or general computer vision

Languages

PythonTypeScript (API client)

Deployment

DockerDocker with GPUlocal (Python)cloud edition

⚠ Known Limitations

⚠ Mac Docker lacks native Apple GPU support; local install needed for M-series benefits
⚠ Llama 3.2-Vision is slowest strategy due to 90B parameters
⚠ Marker-PDF excluded from default distribution due to GPL3
⚠ Asynchronous design prioritizes throughput over single-request latency

Pros

+ 完全本地化处理，无外部依赖，确保数据隐私和安全性
+ 支持多种先进OCR策略（LLaMA Vision、EasyOCR等），识别精度极高
+ 集成分布式队列和缓存机制，支持大规模文档批量处理

Cons

- 需要安装多个依赖组件（Docker、Ollama），初始设置较为复杂
- 本地运行PyTorch模型需要较大计算资源和存储空间

Use Cases

• 医疗机构将MRI报告、病历等医疗文档转换为结构化数据
• 企业财务部门处理发票、合同等文档并自动移除敏感信息
• 法律机构批量数字化和分析大量合规文档或法律条文

Getting Started

1. 安装Docker和Ollama到本地环境；2. 克隆项目并使用docker-compose启动API服务和Redis缓存；3. 使用CLI工具上传文档文件并指定转换格式开始处理

Compare text-extract-api

text-extract-api vs worldmonitor text-extract-api vs litellm text-extract-api vs MinerU text-extract-api vs OmniRoute text-extract-api vs promptfoo text-extract-api vs langfuse