Welcome to Psyche
Psyche is a set of systems that enable distributed training of transformer-based AI models over the internet. It seeks to enable collaboration between untrusted parties to train state-of-the-art ML models.
Psyche is a system that enables distributed training of transformer-based AI models over the internet, aiming to foster collaboration between untrusted parties to create state-of-the-art machine learning models. It leverages a peer-to-peer distributed network for communication and data sharing.
This documentation provides a comprehensive guide to understanding, using, and developing with Psyche, whether you're an end-user looking to participate in a training run, a developer interested in contributing to the project, or just curious about how it all works.
Introduction
How does it work?
At its core, Psyche is a protocol that coordinates multiple independent clients to train a single machine learning model together. Rather than running on a centralized server farm with high-speed interconnects between every accelerator (GPUs, usually), Psyche distributes the training workload across many independent computers, each contributing a small piece to the overall training process.
Psyche is built to maintain training integrity without requiring participants to trust each other. Through a combination of consensus mechanisms, game theory, and careful protocol design, Psyche will ensure that the trained model remains coherent and consistent despite being trained across disparate machines.
Decentralized training flow
flowchart TD subgraph sg_solana["Solana"] direction TB CoordinatorState["Coordinator Program State <br> (Run State, Epoch,<br>Round, Clients)"] end subgraph sg_distro["DisTrO Optimizer"] direction TB MomentumUpdate["Update Local Momentum <br> m<sub>t</sub> = βm<sub>t-1</sub> + g<sub>t</sub>"] DCTExtract["Extract Fast Components <br> (q<sub>t</sub>) (DCT + TopK)"] CompressedUpdate["Compressed Local q<sub>t</sub> <br> (Indices + Amplitudes)"] MomentumResidual["Update Local<br>Momentum Residual<br> m<sub>t+1</sub> = m<sub>t</sub> - q<sub>t</sub>"] end subgraph sg_loop["Local Training"] direction TB LocalWeights["Model Weights (x<sub>t</sub>)"] ApplyAggregatedUpdate["Apply Aggregated Update <br> x<sub>t</sub> = x<sub>t-1</sub> - η Q<sub>t-1</sub>"] ReceiveDecode["Receive &<br>Decode/Aggregate <br> Compressed q<sub>t-1</sub><br> from Peers"] ForwardBackward["Forward/Backward Pass <br> (Use x<sub>t</sub>, <br>Compute Gradient g<sub>t</sub>)"] FetchData["Fetch Assigned Data <br> (Batch<sub>t</sub>)"] Gradient["Local Gradient (g<sub>t</sub>)"] sg_distro P2PNetworkInterface["P2P Network Interface"] end subgraph sg_client["Client"] direction TB ClientSM["Client State Machine <br> (Warmup, Train,<br>Witness, Cooldown)"] sg_loop end subgraph sg_p2p["P2P Gossip & Blob Transfer"] direction TB ClientNode2("Client Node 2") ClientNode3("Client Node 3") ClientNodeN("Client Node N") end DataProvider["Data Provider <br> (Local File/HTTP/etc.)"] ClientSM -- Manages --> sg_loop ClientSM -- Receives State Updates --- CoordinatorState ApplyAggregatedUpdate --> LocalWeights ReceiveDecode -- "Aggregated Q<sub>t-1</sub>" --> ApplyAggregatedUpdate LocalWeights -- Used By --> ForwardBackward FetchData -- Provides Data --> ForwardBackward ForwardBackward -- Produces Gradient --> Gradient Gradient -- Updates --> MomentumUpdate MomentumUpdate --> DCTExtract DCTExtract -- Produces --> CompressedUpdate DCTExtract -- Updates --> MomentumResidual CompressedUpdate -- Broadcasts Local Compressed Update --> P2PNetworkInterface P2PNetworkInterface -- Receives Compressed Updates --> ReceiveDecode DataProvider -- Provides Data --> FetchData P2PNetworkInterface <-- Send/Receive Updates -------> sg_p2p ClientNode2 <-- Transfer Data Off-chain --> ClientNode3 & ClientNodeN ClientNode3 <-- Transfer Data Off-chain --> ClientNodeN CoordinatorState -- Assigns Data/Committee --> ClientSM ClientSM -- "Submits Transactions (e.g., Join, Tick, Witness)" --> CoordinatorState