📡

Observability & Evaluation

Monitoring, tracing, and testing infrastructure for running AI agents reliably in production

65 tools

worldmonitor

Real-time global intelligence dashboard. AI-powered news aggregation, geopolitical monitoring, and infrastructure tracking in a unified situational awareness interface

⭐ 45.7k↑ 8063/moobservability-evaluation

litellm

Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropi

⭐ 41.6k↑ 3435/movoice-agents

MinerU

Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

⭐ 57.7k↑ 2243/moobservability-evaluation

OmniRoute

OmniRoute is an AI gateway for multi-provider LLMs: an OpenAI-compatible endpoint with smart routing, load balancing, retries, and fallbacks. Add policies, rate limits, caching, and observability for

⭐ 1.6k↑ 2078/moobservability-evaluation

promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and

⭐ 18.9k↑ 1740/moobservability-evaluation

langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

⭐ 24.1k↑ 1583/moobservability-evaluation

firecrawl

🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data

⭐ 101.6k↑ 18180/mobrowser-web-agents

mastra

From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

⭐ 22.5k↑ 870/moobservability-evaluation

Scrapegraph-ai

Python scraper based on AI

⭐ 23.1k↑ 1928/moobservability-evaluation

ragflow

RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs

⭐ 76.7k↑ 2228/moobservability-evaluation

bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

⭐ 3.4k↑ 675/moobservability-evaluation

voltagent

AI Agent Engineering Platform built on an Open Source TypeScript AI Agent Framework

⭐ 7.1k↑ 563/moobservability-evaluation

opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

⭐ 18.6k↑ 353/moobservability-evaluation

phoenix

AI Observability & Evaluation

⭐ 9.1k↑ 345/moobservability-evaluation

manifest

Smart LLM Routing for OpenClaw. Cut Costs up to 70% 🦞🦚

⭐ 4.2k↑ 293/moobservability-evaluation

haystack

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, m

⭐ 24.7k↑ 248/moobservability-evaluation

prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

⭐ 22.0k↑ 203/moobservability-evaluation

weaviate

Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a c

⭐ 15.9k↑ 188/moobservability-evaluation

unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to

⭐ 14.4k↑ 98/moobservability-evaluation

deepeval

The LLM Evaluation Framework

⭐ 14.4k↑ 300/moobservability-evaluation

langwatch

The platform for LLM evaluations and AI agent testing

⭐ 3.2k↑ 75/mono-code-agent-builders

tensorzero

TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.

⭐ 11.2k↑ 53/moobservability-evaluation

openllmetry

Open-source observability for your GenAI or LLM application, based on OpenTelemetry

⭐ 7.0k↑ 45/moobservability-evaluation

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

⭐ 4.0k↑ 38/moobservability-evaluation

openlit

Open source platform for AI Engineering: OpenTelemetry-native LLM Observability, GPU Monitoring, Guardrails, Evaluations, Prompt Management, Vault, Playground. 🚀💻 Integrates with 50+ LLM Providers,

⭐ 2.3k↑ 30/moobservability-evaluation

WFGY

WFGY is an open-source AI Troubleshooting Atlas for RAG, agents, and real-world AI workflows. Includes the 16-problem map, Global Debug Card, and WFGY 3.0. ⭐ Star to help more builders find this repo.

⭐ 1.7k↑ 68/moobservability-evaluation

ragas

Supercharge Your LLM Application Evaluations 🚀

⭐ 13.2k↑ 360/moobservability-evaluation

helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

⭐ 5.4k↑ 368/moobservability-evaluation

oumi

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

⭐ 8.9k↑ 30/moobservability-evaluation

langroid

Harness LLMs with Multi-Agent Programming

⭐ 3.9k↑ 15/movoice-agents

txtai

💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows

⭐ 12.4k↑ 23/moobservability-evaluation

uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

⭐ 1.1k↑ 8/moobservability-evaluation

vanna

🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using Agentic Retrieval 🔄.

⭐ 23.2k↑ 315/mono-code-agent-builders

llm-app

Ready-to-run cloud templates for RAG, AI pipelines, and enterprise search with live data. 🐳Docker-friendly.⚡Always in sync with Sharepoint, Google Drive, S3, Kafka, PostgreSQL, real-time data APIs, a

⭐ 59.7k↑ 2535/moobservability-evaluation

llama-github

Llama-github is an open-source Python library that empowers LLM Chatbots, AI Agents, and Auto-dev Solutions to conduct Agentic RAG from actively selected GitHub public projects. It Augments through LL

⭐ 320↑ 8/moobservability-evaluation

agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and Ca

⭐ 5.4k↑ 83/moobservability-evaluation

gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)

⭐ 12.8k↑ 60/movoice-agents

clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them

⭐ 2.7k↑ 38/moobservability-evaluation

Langchain-Chatchat

Langchain-Chatchat（原Langchain-ChatGLM）基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and Ll

⭐ 37.7k↑ 248/moobservability-evaluation

GPTDiscord

A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!

⭐ 1.9k↑ 8/moobservability-evaluation

evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

⭐ 18.1k↑ 113/moobservability-evaluation

AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

⭐ 3.3k↑ 38/moobservability-evaluation

DocsGPT

Private AI platform for agents, assistants and enterprise search. Built-in Agent Builder, Deep research, Document analysis, Multi-model support, and API connectivity for agents.

⭐ 17.8k→ 15/moobservability-evaluation

llmware

Unified framework for building enterprise RAG pipelines with small, specialized models

⭐ 14.9k→ 15/moobservability-evaluation

gitingest

Replace 'hub' with 'ingest' in any GitHub URL to get a prompt-friendly extract of a codebase

⭐ 14.2k↑ 45/moobservability-evaluation

FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

⭐ 39.5k↑ 38/moobservability-evaluation

storm

An LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations.

⭐ 28.0k↑ 30/moobservability-evaluation

text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSO

⭐ 3.1k↑ 23/moobservability-evaluation

langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

⭐ 255→ 0/moobservability-evaluation

ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

⭐ 3.0k↑ 8/mono-code-agent-builders

swiss_army_llama

A FastAPI service for semantic text search using precomputed embeddings and advanced similarity measures, with built-in support for various file types through textract.

⭐ 1.1k↑ 8/moobservability-evaluation

bRAG-langchain

Everything you need to know to build your own RAG application

⭐ 4.1k→ 0/moobservability-evaluation

vision-agent

This tool has been deprecated. Use Agentic Document Extraction instead.

⭐ 5.3k→ 0/moobservability-evaluation

pezzo

🕹️ Open-source, developer-first LLMOps platform designed to streamline prompt design, version management, instant delivery, collaboration, troubleshooting, observability and more.

⭐ 3.2k→ 0/moobservability-evaluation

auto-evaluator

Evaluation tool for LLM QA chains

⭐ 782→ 0/moobservability-evaluation

LLM-eval-survey

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

⭐ 1.6k→ 0/moobservability-evaluation

langkit

🔍 LangKit: An open-source toolkit for monitoring Large Language Models (LLMs). 📚 Extracts signals from prompts & responses, ensuring safety & security. 🛡️ Features include text quality, relevance m

⭐ 980→ 0/moobservability-evaluation

canopy

Retrieval Augmented Generation (RAG) framework and context engine powered by Pinecone

⭐ 1.0k→ 0/moobservability-evaluation

TaskingAI

The open source platform for AI-native application development.

⭐ 5.4k→ 0/movoice-agents

bananalyzer

Open source AI Agent evaluation framework for web tasks 🐒🍌

⭐ 327→ 0/moobservability-evaluation

llm-comparator

LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR team.

⭐ 521→ 0/mono-code-agent-builders

repochat

Chatbot assistant enabling GitHub repository interaction using LLMs with Retrieval Augmented Generation

⭐ 316→ 0/moobservability-evaluation

uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform ro

⭐ 2.3k→ 0/moobservability-evaluation

R2R

SoTA production-ready AI retrieval system. Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.

⭐ 7.7k→ 8/moobservability-evaluation

Verba

Retrieval Augmented Generation (RAG) chatbot powered by Weaviate

⭐ 7.6k→ 15/moobservability-evaluation