vimGPT

Browse the web with GPT-4V and Vimium

open-sourceagent-frameworks
Visit WebsiteView on GitHub
2.7k
Stars
+222
Stars/month
0
Releases (6m)

Overview

vimGPT is an experimental tool that enables web browsing using GPT-4V's vision capabilities combined with the Vimium Chrome extension. Unlike traditional LLM web automation that relies on DOM text parsing, vimGPT takes a vision-first approach by analyzing screenshots of web pages and using Vimium's keyboard-driven navigation system to interact with elements. The tool runs on Playwright and can interpret visual web content to perform browsing actions based on natural language instructions. It represents an innovative approach to automated web interaction that mimics how humans visually navigate websites. The project includes a voice mode feature that allows users to give spoken commands for real-time web browsing automation. This experimental framework is particularly valuable for researchers exploring multimodal AI interfaces and developers interested in vision-based web automation alternatives to traditional DOM manipulation methods.

Pros

  • + Vision-first approach eliminates dependency on HTML/DOM parsing for web interaction
  • + Integrates seamlessly with Vimium's proven keyboard navigation system for reliable element targeting
  • + Supports voice commands for hands-free web browsing automation

Cons

  • - Requires manual loading of Vimium extension with each Playwright session
  • - Performance degrades significantly at low image resolutions affecting element detection
  • - Limited by current Vision API constraints including lack of JSON mode and function calling support

Use Cases

Getting Started

Install Python dependencies with `pip install -r requirements.txt`, download Vimium extension locally using `./setup.sh`, then run `python main.py` to start web browsing or `python main.py --voice` for voice mode