tarsier

Vision utilities for web interaction agents 👀

open-sourcetool-integration

Visit Website View on GitHub

1.8k

Stars

Stars/month

Releases (6m)

Star Growth

Overview

Tarsier是专为Web交互代理设计的视觉工具库，解决了LLM自动化网页交互中的核心难题。它通过在网页上为可交互元素添加视觉标签（如[23]），建立了LLM响应与网页元素之间的映射关系。Tarsier的OCR算法能将网页截图转换为结构化的文本表示（类似ASCII艺术），让纯文本LLM也能理解页面的视觉布局。该工具特别针对按钮、链接和输入字段等可交互元素进行标记，同时支持标记所有文本元素。根据其内部基准测试，纯文本GPT-4配合Tarsier文本表示的性能比GPT-4V配合Tarsier截图的表现高出10-20%。这个Python包专门为解决网页自动化中的感知问题而设计，是构建智能网页代理的重要工具。

Deep Analysis

Key Differentiator

vs vision-language models for web tasks: OCR-to-text conversion enables text-only LLMs to outperform multimodal models by 10-20% on web interaction benchmarks — more accurate and cheaper than GPT-4V

⚡ Capabilities

• Visual tagging of interactable web elements with bracket IDs for LLM mapping
• OCR algorithm converting page screenshots to whitespace-structured text
• Enables text-only LLMs to understand and interact with web pages
• LangChain and LlamaIndex integration for agent frameworks
• Async Playwright browser automation support
• Benchmarked: text-only GPT-4 + Tarsier beats GPT-4V by 10-20%

🔗 Integrations

Google Cloud Vision OCRMicrosoft Azure Computer VisionPlaywrightLangChainLlamaIndex

✓ Best For

✓ Web automation agents needing visual element understanding
✓ Enabling text-only LLMs to interact with web pages effectively
✓ Building autonomous web agents with superior task performance

✗ Not Ideal For

✗ Browser-agnostic solutions requiring Selenium/Puppeteer
✗ Offline operation without external OCR services
✗ Simple DOM-based web scraping without visual context

Languages

PythonTypeScript

Deployment

pip install tarsierrequires external OCR service credentials

⚠ Known Limitations

⚠ Only two OCR engines supported (Amazon Textract coming soon)
⚠ Requires external paid OCR services
⚠ Limited tag styling customization
⚠ Browser driver support limited to Playwright only

Pros

+ 创新的元素标记系统，为LLM提供了直观的网页元素引用方式，简化了复杂的网页交互任务
+ 独特的OCR算法将视觉信息转换为文本格式，使纯文本LLM也能有效理解网页布局和结构
+ 经过大量真实网页任务验证，在内部基准测试中表现优于视觉语言模型的方案

Cons

- 仅支持Python生态系统，限制了在其他编程语言环境中的应用
- 专门针对网页交互场景设计，不适用于通用的计算机视觉任务
- 性能优势声明基于内部基准测试，缺乏第三方验证和公开的对比数据

Use Cases

• 构建能够自主浏览和操作复杂网站的智能代理，用于数据采集或业务流程自动化
• 开发网页测试自动化系统，让AI能够像人类用户一样导航和交互界面元素
• 创建需要复杂页面导航的数据抓取工具，特别适用于JavaScript渲染的动态网站

Getting Started

1. 通过pip install tarsier安装Python包；2. 对目标网页进行截图并使用Tarsier进行元素标记和文本转换；3. 将生成的标记文本输入LLM，并使用元素ID来执行具体的交互操作（如CLICK [23]）

Compare tarsier

tarsier vs n8n tarsier vs litellm tarsier vs dify tarsier vs gemini-cli tarsier vs AutoGPT tarsier vs agentscope