swiss_army_llama
A FastAPI service for semantic text search using precomputed embeddings and advanced similarity measures, with built-in support for various file types through textract.
Star Growth
Overview
Swiss Army Llama is a comprehensive FastAPI service that streamlines semantic text search and document processing using local LLMs. It automatically generates and caches text embeddings for various file types including PDFs (with OCR support), Word documents, and audio files through Whisper transcription. The tool leverages llama_cpp for local LLM integration and employs a high-performance Rust-based library for advanced similarity measures like Spearman correlation, Kendall tau, and Hoeffding's D statistic. Beyond basic cosine similarity, it offers sophisticated semantic search capabilities through FAISS vector indexing with multiple embedding pooling methods including mean pooling, SVD, and Independent Component Analysis. The service intelligently caches embeddings in SQLite to prevent redundant computations and supports optional RAM disk usage for faster LLM loading. All functionality is exposed through REST endpoints with an integrated Swagger UI, making it easy to integrate into existing applications. This makes it particularly valuable for organizations wanting to implement semantic search and document analysis capabilities while maintaining full control over their data through local deployment.
Deep Analysis
vs cloud embedding APIs (OpenAI, Cohere): fully self-hosted with multi-format document processing, advanced statistical similarity measures beyond cosine, and grammar-constrained completions — complete data privacy with zero external API calls
⚡ Capabilities
- • Local LLM text embeddings via llama_cpp with FastAPI REST endpoints
- • Multi-format document processing (PDF with OCR, Word, images, audio)
- • Semantic search via FAISS with advanced similarity measures (Spearman, Kendall, Hoeffding)
- • Audio transcription via Whisper with embedding computation
- • Grammar-constrained text completion (JSON output enforcement)
- • Multiple embedding pooling methods (mean, SVD, ICA, factor analysis)
- • SQLite caching layer for computed embeddings
🔗 Integrations
✓ Best For
- ✓ Organizations requiring fully local LLM processing without cloud dependencies
- ✓ Document analysis workflows across mixed formats (PDF, Word, images, audio)
- ✓ Semantic search over proprietary knowledge bases with advanced similarity metrics
✗ Not Ideal For
- ✗ Ultra-low-latency real-time inference
- ✗ Serverless/pay-per-use deployment models
- ✗ Web-scale deployments without substantial infrastructure
Languages
Deployment
⚠ Known Limitations
- ⚠ Models must be ≥100 MB in .gguf format
- ⚠ Concurrency limited by MAX_CONCURRENT_PARALLEL_INFERENCE_TASKS config
- ⚠ Local-only — not cloud-native architecture
- ⚠ SQLite may limit extreme concurrency
- ⚠ RAM disk requires sudo permissions and adequate memory
Pros
- + Comprehensive document processing pipeline that handles diverse file types including PDFs with OCR, Word documents, and audio transcription
- + Advanced similarity measures beyond cosine similarity, including statistical correlation methods and dependency measures via optimized Rust library
- + Intelligent caching system with SQLite storage prevents redundant computations and includes automatic RAM disk management for performance optimization
Cons
- - Requires significant local computational resources for running multiple LLMs and processing large document collections
- - Setup complexity may be challenging for users without experience in local LLM deployment and configuration
- - Limited to local deployment model which may not suit teams requiring cloud-native or distributed processing solutions
Use Cases
- • Enterprise document search across mixed file types (PDFs, Word docs, audio recordings) while keeping data on-premises for security compliance
- • Research applications requiring sophisticated similarity analysis beyond basic cosine similarity for academic paper analysis or content clustering
- • Knowledge management systems that need to process and search through large document repositories with automatic embedding generation and caching