petals
πΈ Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
Star Growth
Overview
Petals is a distributed peer-to-peer system that enables running large language models (LLMs) collaboratively across multiple machines, similar to BitTorrent file sharing. Instead of requiring powerful hardware to run massive models like Llama 3.1 (405B parameters), Mixtral (8x22B), or BLOOM (176B), Petals distributes model layers across a network of volunteer computers, allowing users to access these models from modest hardware including desktop computers or Google Colab instances. The system maintains compatibility with the Hugging Face Transformers API, making it easy to integrate into existing workflows. Users can perform both inference and fine-tuning tasks, with claimed performance improvements up to 10x faster than traditional offloading methods. The platform operates as a community-driven initiative where participants contribute GPU resources to collectively host model layers, creating a shared computational network. Petals supports both public swarms for general use and private swarms for sensitive applications, providing flexibility for different privacy requirements. The system includes a web-based chatbot interface and programmatic API access, making it accessible to both technical and non-technical users. With over 10,000 GitHub stars, Petals represents a novel approach to democratizing access to large language models by leveraging distributed computing principles.
Deep Analysis
The only framework enabling consumer-hardware users to collectively run 405B+ parameter models via BitTorrent-style distributed inference β published at ACL 2023 and NeurIPS 2023, making frontier-scale models accessible without enterprise GPUs
β‘ Capabilities
- β’ Distributed inference of large LLMs (up to 405B) across consumer hardware
- β’ BitTorrent-style collaborative model hosting
- β’ Fine-tuning and prompt-tuning over distributed network
- β’ HuggingFace Transformers-compatible API
- β’ Support for Llama 3.1, Mixtral, Falcon, BLOOM models
- β’ Private swarm deployment for sensitive data
- β’ Up to 10x faster than single-device offloading
π Integrations
β Best For
- β Running 100B+ parameter models without expensive GPU hardware
- β Research teams wanting to experiment with very large models on consumer GPUs
- β Collaborative model hosting within trusted organizations
β Not Ideal For
- β Production workloads requiring consistent latency and throughput
- β Processing sensitive/private data (unless using private swarm)
Languages
Deployment
β Known Limitations
- β Public swarm means data is processed by untrusted peers (privacy concerns)
- β Inference speed depends on network participants and their hardware
- β Swarm availability fluctuates based on community participation
- β Single-batch inference maxes at ~6 tokens/sec for Llama 2 70B
Pros
- + Enables running very large models (405B+ parameters) on modest hardware through distributed computing
- + Maintains full compatibility with Hugging Face Transformers API for easy integration
- + Claims significant performance improvements (up to 10x faster) for fine-tuning and inference compared to offloading
Cons
- - Data privacy concerns since processing occurs across public swarm of unknown participants
- - Dependency on community-contributed GPU resources for model availability and performance
- - Potential network latency and reliability issues inherent in distributed systems
Use Cases
- β’ Researchers and developers wanting to experiment with large language models without expensive hardware investments
- β’ Organizations needing to fine-tune massive models for specific tasks while leveraging distributed computing resources
- β’ Educational institutions teaching about large language models where students can access powerful models from basic computers