Petals

Run large language models at home, BitTorrent‑style

Generate text with Llama 2 (70B), Falcon (40B+), BLOOM (176B) (or their derivatives) and fine‑tune them for your tasks — using a consumer-grade GPU or Google Colab.
You load a small part of the model, then join a network of people serving the other parts. Single‑batch inference runs at up to 6 tokens/sec for Llama 2 (70B) and up to 4 tokens/sec for Falcon (180B) — enough for chatbots and interactive apps.
Beyond classic LLM APIs — you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch and 🤗 Transformers.

Top contributors right now:

Follow development in Discord or via email:

We send updates once a few months. No spam.

We sent you an email to confirm your address. Click it and you're in!

Featured on:

This project is a part of the BigScience research workshop.