lumos
Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"
Star Growth
Overview
Lumos is an open-source language agent framework that provides unified training for complex interactive tasks using modular design principles. Built on LLAMA-2-7B/13B models, it features a three-module architecture consisting of planning, grounding, and execution components. The framework is trained on approximately 56,000 high-quality subgoal and action annotations derived from ground-truth reasoning steps across multiple benchmarks. Lumos achieves competitive performance with GPT-4/3.5-based agents on web navigation, complex question answering, mathematical reasoning, and multimodal tasks. Its unified data format enables seamless support for diverse interactive tasks, making it a valuable resource for researchers developing open-source agents. The framework outperforms contemporaneous fine-tuned agents like FiReAct, AgentLM, and AutoAct on benchmarks such as Mind2Web, HotpotQA, WebShop, and InterCode_SQL, while maintaining the transparency and accessibility advantages of open-source models.
Deep Analysis
vs GPT-4 agents: unified modular framework achieving competitive performance with 7B-13B models — planning + grounding + execution separation enables task-agnostic agent architecture from Allen AI
⚡ Capabilities
- • Modular language agent framework: planning, grounding, and execution
- • Task decomposition into subgoals with API-mapped actions
- • Handles web navigation, complex QA, math, and multimodal reasoning
- • Unified data format across diverse task types
- • Competitive with GPT-4 agents using smaller 7B-13B models
- • Training data and checkpoints freely available
🔗 Integrations
✓ Best For
- ✓ Multi-step reasoning: web navigation, QA, math problem-solving
- ✓ Research into efficient agent architectures with small models
- ✓ Building agents competitive with GPT-4 at lower cost
✗ Not Ideal For
- ✗ Real-time applications needing extreme low latency
- ✗ Specialized domains without training data coverage
- ✗ Teams without GPU infrastructure for fine-tuning
Languages
Deployment
⚠ Known Limitations
- ⚠ May underperform on specialized domains outside training distribution
- ⚠ Requires significant compute for model fine-tuning
- ⚠ Limited to task types covered by training data
Pros
- + Modular architecture with separate planning, grounding, and execution components enables flexible customization and debugging
- + Unified data format supports multiple task types (web navigation, QA, math, multimodal) within a single framework
- + Competitive performance with much larger proprietary models while being fully open-source and based on smaller LLAMA-2 models
Cons
- - Based on LLAMA-2 architecture which is older and may not incorporate latest language model advances
- - Primarily research-focused with limited documentation for production deployment
- - Requires significant computational resources for training and may need fine-tuning for domain-specific applications
Use Cases
- • Research into open-source language agents and comparative studies against proprietary models
- • Web navigation and automation tasks requiring multi-step planning and execution
- • Complex question answering systems that need to break down problems into actionable subgoals