self-operating-computer

A framework to enable multimodal models to operate a computer.

open-sourceagent-frameworks

Visit Website View on GitHub

10.2k

Stars

+-23

Stars/month

Releases (6m)

Star Growth

Overview

Self-Operating Computer Framework is a pioneering tool that enables multimodal AI models to control computers through visual understanding and automated actions. Released in November 2023 as one of the first full computer-use implementations, it allows AI models to view computer screens and execute mouse and keyboard actions to accomplish objectives, mimicking human computer operation. The framework supports multiple leading AI models including GPT-4o, GPT-4.1, o1, Gemini Pro Vision, Claude 3, Qwen-VL, and LLaVa, with plans for additional model integration. With over 10,000 GitHub stars, it has established itself as a significant tool in the AI automation space. The system operates by taking screenshots, processing them through multimodal models, and translating the AI's decisions into precise computer actions, creating a bridge between natural language instructions and computer control.

Deep Analysis

Key Differentiator

vs Anthropic Computer Use / Browser Use: one of the first open-source frameworks for full computer-use — multimodal models see the screen and execute mouse/keyboard actions across any application, not just browsers

⚡ Capabilities

• Multimodal AI agents autonomously operating computers via screen viewing
• Mouse and keyboard action execution based on visual understanding
• Multiple operation modes: OCR, Set-of-Mark prompting, voice input
• Cross-platform support (macOS, Windows, Linux)
• Multiple model backends (GPT-4o, Gemini Pro Vision, Claude 3, Qwen-VL, LLaVA)

🔗 Integrations

OpenAI GPT-4o/GPT-4.1/o1Google Gemini Pro VisionAnthropic Claude 3Qwen-VLLLaVA (via Ollama)

✓ Best For

✓ Automating computer tasks requiring visual understanding
✓ Researching multimodal agent computer interaction
✓ Cross-application workflow automation via screen recognition

✗ Not Ideal For

✗ Reliable production automation (error-prone)
✗ Headless server environments (needs display)
✗ Cost-sensitive automation (API costs per action)

Languages

Python

Deployment

pip install self-operating-computerlocal with Ollama (LLaVA)

⚠ Known Limitations

⚠ Requires $5 minimum API spending for GPT-4o access
⚠ LLaVA has very high error rates currently
⚠ Ollama limited to macOS, Linux, and Windows Preview
⚠ Early-stage — accuracy varies significantly by task complexity

Pros

+ Multi-model compatibility supporting 7+ leading AI models including GPT-4 variants, Gemini, and Claude
+ Simple installation and usage with single pip install and operate command
+ Pioneer in computer automation field, being one of the first full computer-use frameworks available

Cons

- Requires API keys for external AI services, creating ongoing costs and dependencies
- Needs extensive system permissions including screen recording and accessibility access
- Subject to AI model outages and availability issues that can affect functionality

Use Cases

• Automating repetitive desktop tasks across different applications and workflows
• Testing and comparing different AI models' computer control capabilities
• Building AI-powered desktop automation tools and demonstrations

Getting Started

Install via pip with `pip install self-operating-computer`, run `operate` command, then enter your OpenAI API key when prompted and grant Terminal permissions for Screen Recording and Accessibility in system preferences

Compare self-operating-computer

self-operating-computer vs claude-code self-operating-computer vs llama.cpp self-operating-computer vs dify self-operating-computer vs OpenHands self-operating-computer vs OpenHands self-operating-computer vs langgraph