text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSO
3.1k
Stars
+23
Stars/month
0
Releases (6m)
Star Growth
+3 (0.1%)
Overview
text-extract-api 是一个基于 FastAPI 的文档提取和解析 API,专注于将 PDF、Word、PowerPoint 等文档转换为 Markdown 文本或结构化 JSON。该工具使用最先进的 OCR 技术,包括 EasyOCR、LLaMA 3.2 Vision、MiniCPM-V 等多种策略,确保极高的文档识别精度。所有处理都在本地环境中进行,无需外部云服务,保障数据隐私。工具集成了 Ollama 模型来改进 OCR 结果质量,修正拼写和文本错误,并提供 PII(个人身份信息)移除功能。通过 Celery 实现分布式队列处理,支持大量文档的异步处理,Redis 提供缓存机制以优化性能。特别适用于需要高精度文档数字化、数据提取和隐私保护的场景,如医疗报告处理、财务文档分析、合规性文档审查等。
Deep Analysis
Key Differentiator
vs cloud OCR services: fully on-premise with pluggable OCR strategies (4 engines), built-in PII removal, and distributed Celery scaling — no vendor lock-in
⚡ Capabilities
- • Convert PDFs, images, and Office documents to Markdown or structured JSON
- • Multiple OCR strategies: EasyOCR, MiniCPM-V, Llama 3.2-Vision, Marker-PDF
- • PII removal from documents
- • LLM-based OCR correction and post-processing
- • Async distributed processing via Celery workers
- • Flexible storage backends (local, Google Drive, S3)
🔗 Integrations
FastAPICeleryRedisEasyOCROllamaMiniCPM-VLlama 3.2-VisionMarker-PDFGoogle Drive APIAmazon S3
✓ Best For
- ✓ High-volume document digitization pipelines
- ✓ Extracting structured data from invoices, reports, and forms with PII removal
✗ Not Ideal For
- ✗ Real-time single-request processing (batch-oriented)
- ✗ Handwriting recognition or general computer vision
Languages
PythonTypeScript (API client)
Deployment
DockerDocker with GPUlocal (Python)cloud edition
⚠ Known Limitations
- ⚠ Mac Docker lacks native Apple GPU support; local install needed for M-series benefits
- ⚠ Llama 3.2-Vision is slowest strategy due to 90B parameters
- ⚠ Marker-PDF excluded from default distribution due to GPL3
- ⚠ Asynchronous design prioritizes throughput over single-request latency
Pros
- + 完全本地化处理,无外部依赖,确保数据隐私和安全性
- + 支持多种先进OCR策略(LLaMA Vision、EasyOCR等),识别精度极高
- + 集成分布式队列和缓存机制,支持大规模文档批量处理
Cons
- - 需要安装多个依赖组件(Docker、Ollama),初始设置较为复杂
- - 本地运行PyTorch模型需要较大计算资源和存储空间
Use Cases
- • 医疗机构将MRI报告、病历等医疗文档转换为结构化数据
- • 企业财务部门处理发票、合同等文档并自动移除敏感信息
- • 法律机构批量数字化和分析大量合规文档或法律条文
Getting Started
1. 安装Docker和Ollama到本地环境;2. 克隆项目并使用docker-compose启动API服务和Redis缓存;3. 使用CLI工具上传文档文件并指定转换格式开始处理