petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

open-sourceagent-frameworks
10.0k
Stars
+38
Stars/month
0
Releases (6m)

Star Growth

+8 (0.1%)
9.8k10.0k10.2kMar 27Apr 1

Overview

Petals is a distributed peer-to-peer system that enables running large language models (LLMs) collaboratively across multiple machines, similar to BitTorrent file sharing. Instead of requiring powerful hardware to run massive models like Llama 3.1 (405B parameters), Mixtral (8x22B), or BLOOM (176B), Petals distributes model layers across a network of volunteer computers, allowing users to access these models from modest hardware including desktop computers or Google Colab instances. The system maintains compatibility with the Hugging Face Transformers API, making it easy to integrate into existing workflows. Users can perform both inference and fine-tuning tasks, with claimed performance improvements up to 10x faster than traditional offloading methods. The platform operates as a community-driven initiative where participants contribute GPU resources to collectively host model layers, creating a shared computational network. Petals supports both public swarms for general use and private swarms for sensitive applications, providing flexibility for different privacy requirements. The system includes a web-based chatbot interface and programmatic API access, making it accessible to both technical and non-technical users. With over 10,000 GitHub stars, Petals represents a novel approach to democratizing access to large language models by leveraging distributed computing principles.

Deep Analysis

Key Differentiator

The only framework enabling consumer-hardware users to collectively run 405B+ parameter models via BitTorrent-style distributed inference β€” published at ACL 2023 and NeurIPS 2023, making frontier-scale models accessible without enterprise GPUs

⚑ Capabilities

  • β€’ Distributed inference of large LLMs (up to 405B) across consumer hardware
  • β€’ BitTorrent-style collaborative model hosting
  • β€’ Fine-tuning and prompt-tuning over distributed network
  • β€’ HuggingFace Transformers-compatible API
  • β€’ Support for Llama 3.1, Mixtral, Falcon, BLOOM models
  • β€’ Private swarm deployment for sensitive data
  • β€’ Up to 10x faster than single-device offloading

πŸ”— Integrations

Hugging Face TransformersPyTorchNVIDIA CUDAAMD ROCmApple Metal (M1/M2)

βœ“ Best For

  • βœ“ Running 100B+ parameter models without expensive GPU hardware
  • βœ“ Research teams wanting to experiment with very large models on consumer GPUs
  • βœ“ Collaborative model hosting within trusted organizations

βœ— Not Ideal For

  • βœ— Production workloads requiring consistent latency and throughput
  • βœ— Processing sensitive/private data (unless using private swarm)

Languages

Python

Deployment

Distributed peer-to-peer networkPrivate swarmDockerGoogle Colab

⚠ Known Limitations

  • ⚠ Public swarm means data is processed by untrusted peers (privacy concerns)
  • ⚠ Inference speed depends on network participants and their hardware
  • ⚠ Swarm availability fluctuates based on community participation
  • ⚠ Single-batch inference maxes at ~6 tokens/sec for Llama 2 70B

Pros

  • + Enables running very large models (405B+ parameters) on modest hardware through distributed computing
  • + Maintains full compatibility with Hugging Face Transformers API for easy integration
  • + Claims significant performance improvements (up to 10x faster) for fine-tuning and inference compared to offloading

Cons

  • - Data privacy concerns since processing occurs across public swarm of unknown participants
  • - Dependency on community-contributed GPU resources for model availability and performance
  • - Potential network latency and reliability issues inherent in distributed systems

Use Cases

  • β€’ Researchers and developers wanting to experiment with large language models without expensive hardware investments
  • β€’ Organizations needing to fine-tune massive models for specific tasks while leveraging distributed computing resources
  • β€’ Educational institutions teaching about large language models where students can access powerful models from basic computers

Getting Started

Install Petals via pip (`pip install petals`), import the required classes (`from petals import AutoDistributedModelForCausalLM`), then connect to a distributed model by specifying the model name from the available models at health.petals.dev and start generating text with standard Transformers syntax.

Compare petals