Observability & Evaluation
Monitoring, tracing, and testing infrastructure for running AI agents reliably in production
65 tools
MinerU
freeTransforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
worldmonitor
open-sourceReal-time global intelligence dashboard. AI-powered news aggregation, geopolitical monitoring, and infrastructure tracking in a unified situational awareness interface
ragflow
open-sourceRAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs
litellm
freePython SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropi
firecrawl
freeπ₯ The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
haystack
open-sourceOpen-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, m
langfuse
open-sourceπͺ’ Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. πYC W23
mastra
freeFrom the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
prefect
open-sourcePrefect is a workflow orchestration framework for building resilient data pipelines in Python.
Scrapegraph-ai
open-sourcePython scraper based on AI
promptfoo
open-sourceTest your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and
opik
open-sourceDebug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
weaviate
open-sourceWeaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a c
unstructured
open-sourceConvert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to
tensorzero
open-sourceTensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
txtai
open-sourceπ‘ All-in-one AI framework for semantic search, LLM orchestration and language model workflows
phoenix
freeAI Observability & Evaluation
DocsGPT
open-sourcePrivate AI platform for agents, assistants and enterprise search. Built-in Agent Builder, Deep research, Document analysis, Multi-model support, and API connectivity for agents.
llmware
open-sourceUnified framework for building enterprise RAG pipelines with small, specialized models
deepeval
open-sourceThe LLM Evaluation Framework
openllmetry
open-sourceOpen-source observability for your GenAI or LLM application, based on OpenTelemetry
voltagent
open-sourceAI Agent Engineering Platform built on an Open Source TypeScript AI Agent Framework
oumi
open-sourceEasily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
manifest
open-sourceSmart LLM Routing for OpenClaw. Cut Costs up to 70% π¦π¦
agenta
freeThe open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
langroid
open-sourceHarness LLMs with Multi-Agent Programming
bifrost
open-sourceFastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 Β΅s overhead at 5k RPS.
langwatch
freeThe platform for LLM evaluations and AI agent testing
vanna
open-sourceπ€ Chat with your SQL database π. Accurate Text-to-SQL Generation via LLMs using Agentic Retrieval π.
ragas
open-sourceSupercharge Your LLM Application Evaluations π
openlit
open-sourceOpen source platform for AI Engineering: OpenTelemetry-native LLM Observability, GPU Monitoring, Guardrails, Evaluations, Prompt Management, Vault, Playground. ππ» Integrates with 50+ LLM Providers,
gorilla
open-sourceGorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
llm-app
open-sourceReady-to-run cloud templates for RAG, AI pipelines, and enterprise search with live data. π³Docker-friendly.β‘Always in sync with Sharepoint, Google Drive, S3, Kafka, PostgreSQL, real-time data APIs, a
OmniRoute
open-sourceOmniRoute is an AI gateway for multi-provider LLMs: an OpenAI-compatible endpoint with smart routing, load balancing, retries, and fallbacks. Add policies, rate limits, caching, and observability for
WFGY
freeWFGY is an open-source AI Troubleshooting Atlas for RAG, agents, and real-world AI workflows. Includes the 16-problem map, Global Debug Card, and WFGY 3.0. β Star to help more builders find this repo.
helicone
open-sourceπ§ Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 π
Langchain-Chatchat
open-sourceLangchain-ChatchatοΌεLangchain-ChatGLMοΌεΊδΊ Langchain δΈ ChatGLM, Qwen δΈ Llama ηθ―θ¨ζ¨‘εη RAG δΈ Agent εΊη¨ | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and Ll
FastChat
open-sourceAn open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
uqlm
open-sourceUQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
agentops
open-sourcePython SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and Ca
storm
open-sourceAn LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations.
evals
freeEvals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
gitingest
open-sourceReplace 'hub' with 'ingest' in any GitHub URL to get a prompt-friendly extract of a codebase
GPTDiscord
open-sourceA robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!
llama-github
open-sourceLlama-github is an open-source Python library that empowers LLM Chatbots, AI Agents, and Auto-dev Solutions to conduct Agentic RAG from actively selected GitHub public projects. It Augments through LL
R2R
open-sourceSoTA production-ready AI retrieval system. Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.
AgentBench
open-sourceA Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Verba
open-sourceRetrieval Augmented Generation (RAG) chatbot powered by Weaviate
TaskingAI
open-sourceThe open source platform for AI-native application development.
vision-agent
open-sourceThis tool has been deprecated. Use Agentic Document Extraction instead.
bRAG-langchain
freeEverything you need to know to build your own RAG application
text-extract-api
open-sourceDocument (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSO
pezzo
open-sourceπΉοΈ Open-source, developer-first LLMOps platform designed to streamline prompt design, version management, instant delivery, collaboration, troubleshooting, observability and more.
ChainForge
open-sourceAn open-source visual programming environment for battle-testing prompts to LLMs.
clip-retrieval
open-sourceEasily compute clip embeddings and build a clip retrieval system with them
uptrain
open-sourceUpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform ro
LLM-eval-survey
freeThe official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
langfair
freeLangFair is a Python library for conducting use-case level LLM bias and fairness assessments
swiss_army_llama
freeA FastAPI service for semantic text search using precomputed embeddings and advanced similarity measures, with built-in support for various file types through textract.
canopy
open-sourceRetrieval Augmented Generation (RAG) framework and context engine powered by Pinecone
langkit
open-sourceπ LangKit: An open-source toolkit for monitoring Large Language Models (LLMs). π Extracts signals from prompts & responses, ensuring safety & security. π‘οΈ Features include text quality, relevance m
auto-evaluator
freeEvaluation tool for LLM QA chains
llm-comparator
open-sourceLLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR team.
bananalyzer
open-sourceOpen source AI Agent evaluation framework for web tasks ππ
repochat
open-sourceChatbot assistant enabling GitHub repository interaction using LLMs with Retrieval Augmented Generation