swiss_army_llama

A FastAPI service for semantic text search using precomputed embeddings and advanced similarity measures, with built-in support for various file types through textract.

1.1k
Stars
+8
Stars/month
0
Releases (6m)

Star Growth

+1 (0.1%)
1.0k1.1k1.1kMar 27Apr 1

Overview

Swiss Army Llama is a comprehensive FastAPI service that streamlines semantic text search and document processing using local LLMs. It automatically generates and caches text embeddings for various file types including PDFs (with OCR support), Word documents, and audio files through Whisper transcription. The tool leverages llama_cpp for local LLM integration and employs a high-performance Rust-based library for advanced similarity measures like Spearman correlation, Kendall tau, and Hoeffding's D statistic. Beyond basic cosine similarity, it offers sophisticated semantic search capabilities through FAISS vector indexing with multiple embedding pooling methods including mean pooling, SVD, and Independent Component Analysis. The service intelligently caches embeddings in SQLite to prevent redundant computations and supports optional RAM disk usage for faster LLM loading. All functionality is exposed through REST endpoints with an integrated Swagger UI, making it easy to integrate into existing applications. This makes it particularly valuable for organizations wanting to implement semantic search and document analysis capabilities while maintaining full control over their data through local deployment.

Deep Analysis

Key Differentiator

vs cloud embedding APIs (OpenAI, Cohere): fully self-hosted with multi-format document processing, advanced statistical similarity measures beyond cosine, and grammar-constrained completions — complete data privacy with zero external API calls

Capabilities

  • Local LLM text embeddings via llama_cpp with FastAPI REST endpoints
  • Multi-format document processing (PDF with OCR, Word, images, audio)
  • Semantic search via FAISS with advanced similarity measures (Spearman, Kendall, Hoeffding)
  • Audio transcription via Whisper with embedding computation
  • Grammar-constrained text completion (JSON output enforcement)
  • Multiple embedding pooling methods (mean, SVD, ICA, factor analysis)
  • SQLite caching layer for computed embeddings

🔗 Integrations

llama_cppFAISSFaster WhisperRedisSQLAlchemySwagger UI

Best For

  • Organizations requiring fully local LLM processing without cloud dependencies
  • Document analysis workflows across mixed formats (PDF, Word, images, audio)
  • Semantic search over proprietary knowledge bases with advanced similarity metrics

Not Ideal For

  • Ultra-low-latency real-time inference
  • Serverless/pay-per-use deployment models
  • Web-scale deployments without substantial infrastructure

Languages

Python

Deployment

Dockernative Python/Ubuntuautomated setup scriptsRAM disk supportmulti-worker Uvicorn

Known Limitations

  • Models must be ≥100 MB in .gguf format
  • Concurrency limited by MAX_CONCURRENT_PARALLEL_INFERENCE_TASKS config
  • Local-only — not cloud-native architecture
  • SQLite may limit extreme concurrency
  • RAM disk requires sudo permissions and adequate memory

Pros

  • + Comprehensive document processing pipeline that handles diverse file types including PDFs with OCR, Word documents, and audio transcription
  • + Advanced similarity measures beyond cosine similarity, including statistical correlation methods and dependency measures via optimized Rust library
  • + Intelligent caching system with SQLite storage prevents redundant computations and includes automatic RAM disk management for performance optimization

Cons

  • - Requires significant local computational resources for running multiple LLMs and processing large document collections
  • - Setup complexity may be challenging for users without experience in local LLM deployment and configuration
  • - Limited to local deployment model which may not suit teams requiring cloud-native or distributed processing solutions

Use Cases

  • Enterprise document search across mixed file types (PDFs, Word docs, audio recordings) while keeping data on-premises for security compliance
  • Research applications requiring sophisticated similarity analysis beyond basic cosine similarity for academic paper analysis or content clustering
  • Knowledge management systems that need to process and search through large document repositories with automatic embedding generation and caching

Getting Started

Install the service by cloning the repository and installing Python dependencies via pip or conda. Configure your local LLM models and optional RAM disk settings in the configuration file. Launch the FastAPI server and access the Swagger UI to start uploading documents or submitting text for embedding generation and semantic search.

Compare swiss_army_llama