Welcome to Psyche
Psyche is a system that enables distributed training of transformer-based AI models over the internet, aiming to foster collaboration between untrusted parties to create state-of-the-art machine learning models. It leverages a peer-to-peer distributed network for communication and data sharing.
This documentation provides a comprehensive guide to understanding, using, and developing with Psyche, whether you're an end-user looking to participate in a training run, a developer interested in contributing to the project, or just curious about how it all works.
Introduction
How does it work?
At its core, Psyche is a protocol that coordinates multiple independent clients to train a single machine learning model together. Rather than running on a centralized server farm with high-speed interconnects between every accelerator (GPUs, usually), Psyche distributes the training workload across many independent computers, each contributing a small piece to the overall training process.
Psyche is built to maintain training integrity without requiring participants to trust each other. Through a combination of consensus mechanisms, game theory, and careful protocol design, Psyche will ensure that the trained model remains coherent and consistent despite being trained across disparate machines.
Psyche In Depth
The core system is composed of three main actors:
-
Coordinator: Serves as a source of truth for global state available to all clients in a given training run. Each run has one coordinator that oversees the entire process. The coordinator is implemented as both a program running on the Solana Blockchain and as a regular TCP server.
-
Client: A user participating in a training run. Clients receive the model to be trained and a specific dataset for that run. They send information to the coordinator to progress the training run and use a peer-to-peer network to share their results at each training step with other clients.
-
Data Provider: Each run requires training data. This data could be served by the Psyche Data Provider server, over HTTP, or loaded from local copies of a dataset.
Sample topologies
--- title: Decentralized Run, training data provided over HTTP. --- flowchart TB subgraph "Solana Blockchain" C(["Coordinator State"]) end C <-- Solana RPC & TXs --> C1(("Client")) & C2(("Client")) & C3(("Client")) C1 <-. p2p gossip .-> C2 C3 <-. p2p gossip .-> C2 & C1 DT["` Hosted training data and model snapshots `"] --> C1 & C2 & C3
--- title: Centralized Run, training data provided thru TCP data server --- flowchart TB subgraph "Coordinator Server" CC(["Coordinator State"]) end CC <-- TPC --> C11(("Client")) & C22(("Client")) & C33(("Client")) C11 <-. p2p gossip .-> C22 C33 <-. p2p gossip .-> C22 & C11 DTT["` Hosted training data and model snapshots `"] --> C11 & C22 & C33
What constitutes a training run?
The training process for a given model is divided into small steps that incrementally train the model in a coordinated manner. A training run is divided into epochs, where clients can join and leave the run, and epochs are further divided into steps, where the model is incrementally trained.
During a training run, clients primarily perform three tasks:
- Training: Train the model using an assigned subset of the data.
- Witnessing: Verify the liveness and correctness of other participants.
- Verifying: Recompute and compare training results to identify and punish malicious participants.
Waiting for Members & Warmup
At the start of an epoch, all clients have a window of time to join the run by requesting to be added by coordinator, and then connecting to the other participating clients.
Once a minimum threshold of clients has been met, the run will transition to the Warmup phase and begin a countdown to allow connected clients to update their copy of the model, after which it will enter the Training phase.
To obtain a copy of the model, the Coordinator will either direct clients to a checkpoint uploaded somewhere like: HuggingFace or direct clients to download the model from other clients via the p2p network.
Training
At the beginning of an epoch, after the Warmup phase ends, clients are assigned specific tasks that require them to train the model on a portion of the data.
The coordinator contains information that uniquely assigns pieces of training data to clients based on the current round.
If clients have already been training (i.e., it is not the first round of the epoch), they will apply the results from the previous round, then retrieve the data sample they need for the current round.
After completing the training on their assigned data, each client emits a p2p broadcast to all other clients containing their training results and a cryptographic commitment that binds them to those results.
As the training results are recieved from other clients, they are downloaded to be later incorporated into the current model.
Witnessing
At the start of each round, one or more clients are randomly selected as witnesses. The number of witnesses can be configured. Witnesses train the model as usual, but also build bloom filters that track which nodes they have received training results from, signifying that they are actively participating and providing valid results.
These bloom filters are sent to the coordinator, which then combines them into a provable consensus of which results to apply to the model.
Once a witness quorum is reached, the coordinator advances to the Training phase to allow all clients a brief window to download every training result.
Once the Witness phase concludes, the coordinator returns to the Training phase. Clients are assigned new data, and the process repeats. After a predefined number of rounds, a Cooldown round occurs, marking the end of an epoch.
The witness/train loop visualized
Here's a high-level overview of the process.
Additional details exist, but this captures the overall flow of a single Round from an Epoch:
sequenceDiagram participant Client1 participant Client2 participant Coordinator participant Data Hosting Client1 ->> Data Hosting: get_data Client2 ->> Data Hosting: get_data Coordinator ->> Client2: witness Note over Client1: Train Note over Client2: Train Client1 ->> Client2: Send results Client2 ->> Client1: Send results Note over Client1: Download results Note over Client2: Download results Client2 ->> Coordinator: Send witness Note over Coordinator: Quorum reached Note over Coordinator: Starting Witness phase
Glossary
For a list of common terms within the project along with definitions, please refer to the glossary
General Workflow
Client
A client is an active participant responsible for executing the training tasks within a run. It handles assigned data batches for training, generates commitments, and participates in the witness process when elected to validate the work of its peers. Each client maintains its own state synchronized with the Coordinator.
Coordinator
The Coordinator stores metadata about the training run's state and a list of participants.
It handles the transition between each Phase of a Round, and provides a random seed that's used to determine data assignments, witnesses, and more.
It's responsible for providing a point of synchronization for all clients within a run.
Ticks (State Transitions)
The coordinator behaves like a state machine, moving from one state to another, with each state transition having specific requirements.
When certain events occur or time-based conditions are met, the Coordinator can be "ticked" forwards to transition from one Phase to another Phase.
sequenceDiagram loop Note over Backend, Coordinator: Wait for a timeout or backend state Backend->>Coordinator: Tick Coordinator->>Backend: New state produced Backend->Client1: New coordinator state consumed by Client Backend->Client2: New coordinator state consumed by Client end
Beginning an Epoch (state: WaitingForMembers)
The Coordinator begins in the WaitingForMembers
phase, with no clients connected.
Whatever backend you're running the Coordinator in should accept pending clients to be added to upcoming Epochs.
When inside the WaitingForMembers
phase, your backend will pass new clients to the Coordinator until a configured min_clients
threshold is met, at which point the coordinator's tick
will transition it to the Warmup
phase.
sequenceDiagram Note over Coordinator: min_clients = 2 Client1->>Coordinator: Join Client2->>Coordinator: Join Note over Coordinator: Entering Warmup Client1->>Client2: Connect Client2->>Client1: Connect Note over Coordinator: The Warmup countdown elapses Note over Coordinator: Entering Training
Model Loading (state: Warmup)
This phase is designed to let all clients download the model & load it onto their GPUs.
If a client has dropped whilst waiting for the warmup time, the Backend then removes the client from the Coordinator's clients list.
If the number of clients falls below min_clients, the Coordinator goes back to the WaitingForMembers
phase.
Once the Warmup
time passes, the Coordinator loads all the information for the next training round and change its phase to RoundTrain
. The Server will broadcast this Training
Coordinator state to all clients.
Training (state: RoundTrain)
In this phase, the Coordinator provides a random seed.
Each client can use this seed, alongside the current round index and epoch index to determine which indicies of the training data to use.
Each client then proceeds to run the training on the selected training data.
This state will end when clients later exchanges Witness
messages.
Witnessing training results
As clients complete their training, they send their results to all other clients, including the Witnesses. The witnesses will each send a witness proof to the Coordinator, building towards a witness quorum.
A witness proof contains a bloom filter describing which pieces of data the witness recieved training results for, and which clients did that work. Elected witnesses are responsible for creating these witness proofs and and sending them to the Coordinator.
The witnesses for each round are chosen randomly from all the clients, using the same random seed as for data assignments. A witness will attempt to send an opportunistic witness message once it's seen a recieved a training result for every single batch in the current round.
Witness Quorum
The Coordinator advances the run from the Training phase to the Witness phase in one of two ways:
- If enough witnesses observe all results and reach a witness quorum for the round, they notify the Coordinator that it is safe to advance. This process, named opportunistic witnessing, accelerates the transition to the Witness phase, rather than having to wait a fixed time for training results.
- If witnesses do not receive all required results from other clients before the maximum time specified for the Training phase, the Coordinator will nontheless transition to the Witness phase after the maximum Training time elapses.
Witness phase (state: RoundWitness)
This phase exists to give the witnesses an opportunity to send their proofs to the Coordinator in the event that they have not received enough training results from other clients to have reached the quorum and send their proofs opportunistically.
There is also brief slack period for non-witness nodes to catch up by downloading any remaining results they might have not recieved.
When the Witness phase finishes via timeout, the Coordinator transitions from Witness to the Cooldown phase in three cases:
- If we are in the last round of the epoch.
- If the clients have dropped to less than the minimum required by the config.
- If the number of witnesses for the round is less than the quorum specified by the config.
Any clients that have failed health checks will also be removed from the current epoch.
Cooldown phase (state: Cooldown)
The Cooldown phase is the last phase of an epoch, during which the Cooordinator waits for either the Cooldown period to elapse, or a checkpoint to have happened.
When the Cooldown phase begins, the Coordinator resets the current model checkpoint state to Checkpoint::P2P
, signifying that new joiners should download the latest copy of the model from the other participants.
Upon exiting the Cooldown phase, the Coordinator transitions to the next epoch, saving the previous epoch state, and moving back to the WaitingForMembers phase.
It all comes together
Here is an overview of the whole process from a high level perspective:
sequenceDiagram Backend->>Coordinator: tick Coordinator->>Backend: Change state to `RoundTrain` Backend->>Client1: New state Backend->>Client2: New state par Start training Client1->>Client1: Start training Client2->>Client2: Start training end Client1->>Committee: get_witness Client2->>Committee: get_witness Committee->>Client1: false Committee->>Client2: true Note over Client1: Train Note over Client2: Train Note over Client2: Fill bloom filters Client2->>Backend: try send opportunistic witness Backend->>Coordinator: Witness message Note over Coordinator: Enough witnesses for round Coordinator->>Coordinator: Update state to RoundWitness Note over Coordinator: Timeout round witness time alt step > total steps Coordinator->>Coordinator: Update state to Waitingformembers else height == rounds per epoch Coordinator->>Coordinator: Update state to Cooldown else Coordinator->>Coordinator: Update state to RoundTrain with step + 1 end
Health checks
Each client should repeatedly send health checks to the coordinator. Clients are assigned a score determined by the Coordinator using the trainer_healthy_score_by_witnesses
method. This score increases as a client sends the required data to be added to the participants' bloom filters, allowing the Coordinator to confirm that the client is actively participating in the training.
A client also sends a list of other clients it considers unhealthy to the server using the HealthCheck
message. The Coordinator processes this information to determine whether those clients are healthy. Clients deemed inactive or non-participatory are marked for removal in the next round.
Centralized Backend
In this Backend, the Coordinator is owned and ticked forwards by a Server that communicates via clients over TCP.
The Server's Coordinator is initially configured in main.rs
.
It's loaded using the configuration file state.toml
.
flowchart LR S[Server] --run--> A[App] S --new--> C[Coordinator] C --run_id init warmup min clients model--> A
The Server uses some parts of the Coordinator configuration, like the data server configuration, if enabled, to boot up all the functionality it needs.
When a new client joins the run it has to communicate the run_id
that it wants to join, to ensure the client's joining the correct run. After processing the join message, the client is added to a pending clients list, and runs the Coordinator's tick function to potentially add the client into the run.
When a tick condition is met, the Server ticks the Coordinator forwards, then broadcasts the Coordinator's new state to all connected clients.
Decentralized Backend
In this Backend, the Coordinator is an account associated with a Solana Program, and ticked forwards by a tick
method that can be called by anyone.
A training run can be created by calling the init_coordinator
method in the Coordinator program, and subsequently information about the model to be trained can be set by calling the update
method.
For a new client to join the run, it must call the join_run
method in the Coordinator program and pass the run_id
for the run it intends to join. After the Solan Program processes the join message, the client is added to a pending clients list, and the Program runs the Coordinator's tick function to potentially add the client into the run.
When a tick condition is met, anybody using Solana can tick the Coordinator forwards by calling the tick
method (clients in a Run will do this automatically). This new state is then read via an RPC subscription on each Client, progressing through the regular state machine.
flowchart LR T["Psyche Team"] -- deploy Solana Program --> P["Solana Program"] R["Run Creator"] -- init_coordinator with run_id --> A["Account for this training run"] R["Run Creator"] -- update with run info --> A C[Client] -- "join_run" --> A C --tick--> A G["A random Solana user"] -- tick --> A
Decentralized training flow
flowchart TD subgraph sg_solana["Solana"] direction TB CoordinatorState["Coordinator Program State <br> (Run State, Epoch,<br>Round, Clients)"] end subgraph sg_distro["DisTrO Optimizer"] direction TB MomentumUpdate["Update Local Momentum <br> m<sub>t</sub> = βm<sub>t-1</sub> + g<sub>t</sub>"] DCTExtract["Extract Fast Components <br> (q<sub>t</sub>) (DCT + TopK)"] CompressedUpdate["Compressed Local q<sub>t</sub> <br> (Indices + Amplitudes)"] MomentumResidual["Update Local<br>Momentum Residual<br> m<sub>t+1</sub> = m<sub>t</sub> - q<sub>t</sub>"] end subgraph sg_loop["Local Training"] direction TB LocalWeights["Model Weights (x<sub>t</sub>)"] ApplyAggregatedUpdate["Apply Aggregated Update <br> x<sub>t</sub> = x<sub>t-1</sub> - η Q<sub>t-1</sub>"] ReceiveDecode["Receive &<br>Decode/Aggregate <br> Compressed q<sub>t-1</sub><br> from Peers"] ForwardBackward["Forward/Backward Pass <br> (Use x<sub>t</sub>, <br>Compute Gradient g<sub>t</sub>)"] FetchData["Fetch Assigned Data <br> (Batch<sub>t</sub>)"] Gradient["Local Gradient (g<sub>t</sub>)"] sg_distro P2PNetworkInterface["P2P Network Interface"] end subgraph sg_client["Client"] direction TB ClientSM["Client State Machine <br> (Warmup, Train,<br>Witness, Cooldown)"] sg_loop end subgraph sg_p2p["P2P Gossip & Blob Transfer"] direction TB ClientNode2("Client Node 2") ClientNode3("Client Node 3") ClientNodeN("Client Node N") end DataProvider["Data Provider <br> (Local File/HTTP/etc.)"] ClientSM -- Manages --> sg_loop ClientSM -- Receives State Updates --- CoordinatorState ApplyAggregatedUpdate --> LocalWeights ReceiveDecode -- "Aggregated Q<sub>t-1</sub>" --> ApplyAggregatedUpdate LocalWeights -- Used By --> ForwardBackward FetchData -- Provides Data --> ForwardBackward ForwardBackward -- Produces Gradient --> Gradient Gradient -- Updates --> MomentumUpdate MomentumUpdate --> DCTExtract DCTExtract -- Produces --> CompressedUpdate DCTExtract -- Updates --> MomentumResidual CompressedUpdate -- Broadcasts Local Compressed Update --> P2PNetworkInterface P2PNetworkInterface -- Receives Compressed Updates --> ReceiveDecode DataProvider -- Provides Data --> FetchData P2PNetworkInterface <-- Send/Receive Updates -------> sg_p2p ClientNode2 <-- Transfer Data Off-chain --> ClientNode3 & ClientNodeN ClientNode3 <-- Transfer Data Off-chain --> ClientNodeN CoordinatorState -- Assigns Data/Committee --> ClientSM ClientSM -- "Submits Transactions (e.g., Join, Tick, Witness)" --> CoordinatorState
Data Provider
When you're training an AI model, you need data to train on! Psyche supports multiple kinds of data providers that will fetch & provide the data your model needs to train.
- Local data provider Each client already has the training data downloaded locally.
- HTTP data provider Each client will request individual pieces of data from a webserver as they are assigned that data for training
- TCP data provider Each client will reach out to a dedicated server over TCP & request samples of data.
Overview
When a client starts a round of training, it is assigned an ID or a range of IDs by the coordinator, representing all the "batches" of data that will be used for that round. Each batch contains a specific subsection of the overall training data.
The size of a batch is always the same, and can be configured in the run config This order is deterministic, and is distributed across each client, so no piece of data will be trained on more than once.
To understand how the data is partitioned for each client, refer to the following diagram:
flowchart TD C((Coordinator)) C1[Client A] C2[Client B] C -- Assigned Batch IDs 1, 2, 3 --> C1 C -- Assigned Batch IDs 4, 5, 6 --> C2 subgraph Data Provider B1["Batch 1"] B2["Batch 2"] B3["Batch 3"] B4["Batch 4"] B5["Batch 5"] B6["Batch 6"] B4 ~~~ B1 B5 ~~~ B2 B6 ~~~ B3 end B1 --> C1 B2 --> C1 B3 --> C1 B4 --> C2 B5 --> C2 B6 --> C2
Provider configuration
Inside the run config, the key [model.LLM.data_location]
specifies whether the data will be hosted on a TCP server, accessed via HTTP, or stored in a local folder.
We also support loading data from GCP as a subsection of the HTTP data provider.
The required configuration depends on the data provider implementation being used:
-
TCP Server:
- If the data provider is configured as a TCP server, and an additional file named
data.toml
is required. - This file contains the configuration required for the TCP server, including:
- Data location
- Token size
- Sequence length
- A seed to shuffle the data if necessary
- Example
data.toml
files can be found inpsyche/config
within the various initial state examples.
- If the data provider is configured as a TCP server, and an additional file named
-
HTTP Provider:
- For the HTTP data provider, no additional configuration file is needed.
- The required fields for this setup include:
- The URL (or a set of URLs) from which the data will be fetched - or, if you're loading data from GCP, a GCP bucket and an optional subdirectory.
- Token size (in bytes)
- A shuffle seed, if data shuffling is desired.
-
Local Provider:
- Simply point to the folder where the data should be loaded from.
Model sharing
When an epoch starts, all clients must have an identical model to train with.
At the beginning of a run, all clients must download the model parameters, tokenizer configuration, and model configuration from HuggingFace, where the model must have been previously uploaded
(TODO: add more details on uploading a model).
Each client will then modify their copy of the model by receiving new training results from other clients and applying them. This keeps everyone's copy of model identical within an epoch without an additional full synchronization step.
When a new client joins a run that has already progressed past its first epoch, it would not be correct for the client to download the original model from HuggingFace, as the model parameters would have already been updated during training. Instead, the new client must acquire a copy of the model from the peers who have been actively training it.
This synchronization process occurs during the Warmup phase, while the coordinator waits to begin the next Training phase.
To address this, we checkpoint the model at the end of an epoch, where clients save and share the entire model for new peers to join. There are two checkpointing variants: HuggingFace based and P2P based.
HuggingFace checkpoint
In this approach, a client or a set of clients are designated as the checkpointers for the run. These clients upload their copy of updated model to HuggingFace after each epoch, and send the URL for this checkpoint to the coordinator. When a new client joins the run, it retrieves the checkpoint URL from the coordinator, and connects to HuggingFace to download the latest copy of the model parameters and configuration files.
P2P checkpoint
In the peer-to-peer (P2P) approach, a new client synchronizes by obtaining the latest model directly from other peers. It receives the model information and parameters from any available peer, requesting a set of parameters for each layer from different clients. This process allows the client to assemble the latest model state and participate in the training without an explicit upload step to a central server occuring.
Here's an example of a P2P model sharing interaction:
flowchart TB C((Coordinator)) C1[Client 1] C2[Client 2] C3[Client 3] C4[Joining Client] C --warmup---> C1 C --warmup---> C2 C --warmup---> C3 C2 --Model config--> C4 C4 -.Join.-> C C1 -.Layer 1 weights.-> C4 C2 -.Layer 2 weights.-> C4 C3 -.Layer 3 weights.-> C4
Training Rewards
When clients participate in a training run, the Coordinator
keeps track of the compute contributions.
An earning_rate
is added to the Client's earned
points at the end of every successful training Epoch.
Run Treasurer, Compute Incentives
A Training run can be created through a Treasurer
escrow smart contract. In this case the Run
's authority will be the Treasurer
smart contract itself.
In this case, an arbitrary token can be distributed through the Treasurer
's Token holding. Every time a client earns a point on the run's coordinator, the treasurer will allow claiming a fixed amount of reward token for each earned coordinator point.
The source code for the treasurer smart contract can be found here: https://github.com/PsycheFoundation/psyche/tree/main/architectures/decentralized/solana-treasurer.
Mining Pool, Pooling funds
Participating in a run can be expensive — a powerful GPU may be required to train a particular model. Users can pool resources together through a Mining Pool smart contract. The source code used can be found here: https://github.com/PsycheFoundation/psyche/tree/main/architectures/decentralized/solana-mining-pool.
Each user contributing to a Mining Pool will delegate their funds so those can be used by the Mining Pool authority and owner to purchase compute power. The Mining Pool authority can then re-distribute equitably through the Mining Pool any token that may have been received as a result of the training.
Psyche Glossary
ActiveStep
The state machine phases a Client
goes through during a training Round
or Epoch
, synchronized with the Coordinator
's RunState
. Includes Warmup
, Training
, Witness
, and Cooldown
.
AMD ROCm An alternative GPU compute platform to NVIDIA's CUDA. Support for ROCm is planned for Psyche clients in the future.
Authorizer A Solana program that issues authorizations to specific users.
Authorization A specific role (scope) assigned to a single user (grantee) by a specific authority (grantor). The grantee can then delegate authorization to other keys (the delegates) that can act on its behalf. In practice, this is useful for managing permissions to nodes in data center clusters easily.
Batch
A subset of the training data processed by clients in a single step within a Round
. Identified by a BatchId
.
BatchId
A unique identifier for a specific Batch
of training data.
Bloom Filter
A probabilistic data structure used for efficient set membership testing (e.g., checking if a client's commitment has been witnessed). Used in WitnessBloom
. Has a small chance of false positives.
BLOOM_FALSE_RATE
The target false positive rate (1% in this case) for the Bloom Filters
used in the witness protocol.
Checkpoint
A saved state of the LLM being trained. Psyche uses checkpoints to allow runs to be paused, resumed, or recovered after interruptions. Checkpoints can be stored in a central HubRepo
or shared between clients via P2P
.
Checkpointers
Designated, trusted participants responsible for saving the model Checkpoint
during the Cooldown
phase.
Client
The software participants run on their own hardware (typically with a GPU) to contribute to the distributed training process. Clients perform computations, submit results (Commitments
), and participate in Witnessing
.
ClientState
The status of a Client
as tracked by the Coordinator
. Key states include Healthy
, Dropped
, Withdrawn
, and Ejected
.
Commitment
A cryptographic hash (SHA-256) of a client's computational results for a given Batch
. Submitting commitments allows the Coordinator
and Witnesses
to verify work was done without transferring the full results initially.
Commitee
The particular role of a client in a given round. Can be one of Trainer
, Verifier
or TieBreaker
.
Cooldown
A phase (RunState
and ActiveStep
) at the end of an Epoch
where model Checkpoints
are saved and the system prepares for the next epoch.
Coordinator
The central orchestrator of the Psyche training system, implemented as a Solana program. It manages the training lifecycle (RunState
), client participation (ClientState
), data batch assignment, and Witnessing
.
CoordinatorConfig
The set of parameters defining how a specific training run operates (e.g., warmup_time
, witness_quorum
, rounds_per_epoch
).
CUDA NVIDIA's parallel computing platform and programming model, required for running the Psyche client on NVIDIA GPUs.
Data Provider
Component responsible for supplying the training data in organized Batches
.
Desync
An error state (StepError::Desync
) occurring when a Client
's ActiveStep
falls out of synchronization with the Coordinator
's RunState
.
Docker
A platform used to build, ship, and run applications in Containers
. Psyche uses Docker to distribute and run the client software.
Dropped
A ClientState
indicating a client has become unresponsive or disconnected unexpectedly.
Ejected
A ClientState
indicating a client has been forcibly removed from the training run, typically due to failing health checks or malicious behavior. Ejected clients may be subject to Slashing
.
Epoch
A major cycle in the training process, composed of multiple Rounds
. A Checkpoint
starts with the WaitingForMembers
and Warmup
phases and ends with a Cooldown
phase.
Exited Clients
A buffer on the Coordinator
holding records of clients that have recently left the run (Dropped
, Withdrawn
, Ejected
).
Finished
A RunState
indicating that the training run has completed its configured total_steps
.
Garnix
CI (Continuous Integration) service based on Nix
, used by Psyche
.
Health Check
A verification procedure (health_check()
) initiated by designated witness
clients. Its purpose is to monitor peer clients and confirm they are actively processing their assigned training batches. When a witness client detects a peer that appears unresponsive or failing (unhealthy
), it notifies the central coordinator. The coordinator independently verifies the status of the reported peer by running its own health check. If this verification is verified then the peer is marked as unhealthy
and is kicked.
Healthy
The desired ClientState
, indicating the client is connected, responsive, and participating correctly in the training process. Only Healthy clients typically receive Rewards
.
HubRepo
A centralized repository location (e.g., Hugging Face, S3 bucket) where the model Checkpoint
can be stored, particularly when initializing or if P2P storage is unavailable.
Iroh
A P2P
library that Psyche
uses for data-sharing between the clients.
Lightweight Hashing
Using efficient hashing algorithms like SHA-256 for Commitments
to allow for fast verification by the Coordinator
and Witnesses
.
Metal Apple's graphics and compute API. A future backend target for running the Psyche client on Mac hardware.
min_clients
The minimum number of Healthy
clients required for a training run to progress beyond the WaitingForMembers
state.
Mining Pool A Solana program that implements a basic "mining" or lending pool mechanism where users (lenders) can deposit collateral into a pool to delegate funds to other participants with more compute power and eventually claim redeemable tokens proportionate to their share of the total deposited collateral.
NUM_STORED_ROUNDS
A constant defining how many past rounds' states are kept in the Coordinator
's history buffer (e.g., 4 rounds).
Nix
Tool for declarative and reproducible builds used by Psyche
.
Opportunistic Witnessing
A feature that allows progressing early from the RoundTrain
phase to the Witness
phase, given that the witness quorum
is reached.
Paused
A RunState
where the training process is temporarily stopped by manual intervention. Can be resumed later.
P2P Peer-to-Peer, meaning a client acts both as a client and as a server, sharing data with it's peers. This is the intended way of data-sharing during a stable run.
Psyche Nous Research's set of systems that enable distributed training of transformer-based AI models over the internet.
Round
A smaller cycle within an Epoch
. Involves a training phase (RoundTrain
) and a validation phase (RoundWitness
).
RoundTrain
The phase (RunState
and ActiveStep
) where clients download assigned data Batches
, perform training computations (e.g., calculate gradients), and submit Commitments
.
RoundWitness
The phase (RunState
and ActiveStep
) where clients act as Witnesses
to validate the Commitments
submitted by other clients during RoundTrain
. Requires a witness_quorum
to succeed.
rounds_per_epoch
A configuration parameter (CoordinatorConfig
) specifying how many Rounds
make up one Epoch
.
RunState
The overall state of the training run as managed by the Coordinator
. Examples include Uninitialized
, WaitingForMembers
, Warmup
, RoundTrain
, RoundWitness
, Cooldown
, Paused
, Finished
.
SHA-256
The specific cryptographic hash function used to create Commitments
in Psyche.
Solana
The blockchain platform on which the Psyche Coordinator
program runs.
StepError
A category of errors related to the Client
's ActiveStep
progression, such as Desync
.
tick()
A function periodically called on the Coordinator
program to drive the state machine transitions (advancing RunState
based on time limits, client counts, and submitted results). Specific versions exist for different states (e.g., tick_waiting_for_members
, tick_round_witness
).
total_steps
A configuration parameter defining the total number of training steps or batches the run aims to complete before entering the Finished
state.
Training
The ActiveStep
where the client actively computes gradients or other training operations on its assigned data Batch
.
Treasurer A Solana program that runs on top of psyche's Coordinator managing the distribution of rewards to the clients and keeping track of the points earned by each client in the training process.
Uninitialized
The default starting RunState
of the Coordinator
before a training run is configured and started.
WaitingForMembers
The RunState
where the Coordinator
waits for the minimum number of clients (min_clients
) to connect and become Healthy
before starting the training process.
Warmup
The initial phase (RunState
and ActiveStep
) of a training run where clients download the model Checkpoint
and initialize their training environment.
Witness
A Client
selected to validate other client's work.
WitnessBloom
The specific Bloom Filter
used on the Coordinator
to track which client Commitments
have been successfully witnessed.
Witness Quorum
The minimum number of clients that must successfully act as Witnesses
and agree on the validity of results for a Round
to be considered successful.
Withdrawn
A ClientState
indicating that a client has exited the run.
End-user configuration
Joining a run
Learn how to use your existing compute to join a run
You may also want to read the client FAQ
Creating a run
To train your own models on Psyche, you should familiarize yourself with:
- The process for Creating a run
- The Run configuration file format
- The process for Authentication
Joining a training run
Pre-requisites
The Psyche client currently only runs under Linux.
NVIDIA Driver
Psyche requires an NVIDIA CUDA-capable GPU. If your system does not have NVIDIA drivers installed, follow NVIDIA's installation guide for your Linux distribution.
Running under Docker
The Psyche client is distributed as a Docker image. In order to run it, you will need to have some container engine. We develop & test Psyche using Docker, so we recommend you use the same.
If you don't have Docker installed, follow the Docker Engine installation guide for your Linux distribution.
NVIDIA Container Toolkit
The NVIDIA Container Toolkit is used to enable GPU access inside Docker container, which Psyche uses for model training. To install it, follow the Nvidia Container Toolkit installation guide for your Linux distribution.
Solana RPC providers
To ensure reliability, performance, and security, all end-users must configure their own private Solana RPC provider, though configuring two is recommended to accommodate outages and network blips. We recommend using a dedicated RPC service such as Helius, QuickNode, Triton, or self-hosting your own Solana RPC node.
Configuration
A .env
file should be created containing all the necessary configuration variables for joining a training run. These variables will be used to interact with the Solana blockchain, specify the model you'll contribute compute to, and configure the Psyche client based on your hardware resources.
Your .env
file should contain at least these configuration options:
RPC
: The RPC url of your primary Solana provider.WS_RPC
: The websocket RPC url of the same primary Solana provider.RPC_2
: The RPC url of your other Solana provider. If you don't have one, use a public alternative. For example, https://api.devnet.solana.com for Devnet.WS_RPC_2
: The websocket RPC url of your other Solana provider, or a public alternative if you don't have one. For example, wss://api.devnet.solana.com for Devnet.RUN_ID
: The ID of the training run you will join.NVIDIA_DRIVER_CAPABILITIES
: An environment variable that the NVIDIA Container Toolkit uses to determine which compute capabilities should be provided to your container. It is recommended to set it to 'all', e.g.NVIDIA_DRIVER_CAPABILITIES=all
.DATA_PARALLELISM
: The number of GPUs the training data will be distributed across. This speeds up computation if you have the resources.TENSOR_PARALLELISM
: The number of GPUs the loaded model will be distributed across. This lets you train a model you can't fit on one single GPU.MICRO_BATCH_SIZE
: Number of samples processed per GPU per step (affects memory usage, set it as high as VRAM allows)AUTHORIZER
: The Solana address that delegated authorization to the Solana public key you will use to join the run. You can read more about authorization here.
Testing authorization
You can check if your client is authorized to join a given run by running psyche-solana-client check-authorization --run-id <RUN_ID> --authorizer <AUTHORIZER> --pubkey <PUBKEY>
, where <AUTHORIZER>
is the same Solana authorizer delegate you have configured in your .env
file, and <PUBKEY>
is the key associated with the given client's private key. You can also pass the private key via RAW_WALLET_PRIVATE_KEY="$(cat <path_to_solana_pubkey>)"
instead of passing --pubkey
.
This command will return successfully if the wallet is authorized to join that run, given the associated authorizer.
Running the Psyche client docker image
To download and run the Psyche client thru Docker, run the following command, replacing <path_to_env_file>
and
<path_to_solana_pubkey>
with your own.
docker run -d \
--env-file <path_to_env_file> \
-e RAW_WALLET_PRIVATE_KEY="$(cat <path_to_solana_pubkey>)" \
--gpus all \
--network "host" \
nousresearch/psyche-client:latest
Creating a run
To create a new training run and make it available for nodes to join, you'll need to create it, configure it, and unpause it.
First, create the run on-chain. You'll need to provide:
- the RPC & websocket RPC urls so the client can communicate with an RPC node.
- a unique run ID - just a few characters to uniquely identify your run.
- a name & description for your run
psyche-solana-client create-run \
--rpc [RPC] \
--ws-rpc [WS_RPC] \
--run-id [RUN_ID] \
--name [NAME] \
--description [DESCRIPTION]
Then, set the run's config. You'll need to provide:
- the RPC & websocket RPC urls so the client can communicate with an RPC node.
- the run ID you previously used
- the path to a
config.toml
file, following the run config schema
psyche-solana-client update-config \
--rpc [RPC] \
--ws-rpc [WS_RPC] \
--run-id [RUN_ID] \
--config-path [CONFIG_FILE]
At this point, your run is ready to go! You can now set its state to "unpaused", and let clients join & begin training your model.
psyche-solana-client set-paused \
--rpc [RPC] \
--ws-rpc [WS_RPC] \
--run-id [RUN_ID] \
resume
Congratulations! As soon as your first client joins, your model is being trained.
Run configuration
A training run on Psyche is described using a Run Configuration file.
It's a .toml
file with information about the model shape, size, checkpoints, optimizer settings, run witnessing settings, and more.
There's two top-level values in a run configuration: a config
, and a model
.
While some examples are described below, you can find the full range of options for the coordinator here and for the model here
Config
Here's a sample config with some of its options documented.
[config]
# maximum time, in seconds, to let nodes download the model from a checkpoint / other nodes
warmup_time = 30
# time, in seconds, to let nodes bring the model from the GPU to disk, and to opt to join the next round.
cooldown_time = 30
# how many training rounds in one "epoch", from warmup to cooldown.
rounds_per_epoch = 20
# maximum time, in seconds, to allow nodes to train in one round.
# this will limit the types of GPUs your model can be trained on,
# since setting it low will prevent slower hardware from completing
# training in time.
max_round_train_time = 30
# time, in seconds, to allow witnesses to publish their messages before next round
round_witness_time = 1
# number of clients that need to be active for an epoch to continue on.
# if the number of clients goes below this number, we initiate a Cooldown and then back to WaitingForMembers.
# this should be adjusted alongside max_round_train_time, because one client will train a lot slower
# than 100.
min_clients = 1
# minumum number of clients required before we transition from WaitingForMembers to Warmup.
# must be equal to or greater than min_clients
init_min_clients = 1
# what percent of nodes are dedicated to verifying correctness. always set to 0 for now.
verification_percent = 0
# how many nodes are selected each round to publish witness proofs
witness_nodes = 1
# the total number of training data batches per-step. this also determines your maximum number of clients.
# the batch size will linearly increase from global_batch_size_start to global_batch_size_end over
# global_batch_size_warmup_tokens tokens
global_batch_size_start = 8
global_batch_size_end = 8
global_batch_size_warmup_tokens = 0
# the total number of training steps to partake in. this is used for the LR schedule in the model section too.
total_steps = 25000
Model
# so far only LLMs are supported.
[model.LLM]
architecture = "HfLlama"
data_type = "Pretraining"
max_seq_len = 2048
[model.LLM.checkpoint.Hub]
repo_id = "emozilla/llama2-20m-init"
[model.LLM.data_location.Http]
token_size_in_bytes = "TwoBytes"
shuffle = "DontShuffle"
[model.LLM.data_location.Http.location.Gcp]
bucket_name = "nous-pretraining-public-us"
filter_directory = "fineweb-edu-tokenized-llama2"
[model.LLM.lr_schedule.Cosine]
base_lr = 4.0e-4
warmup_steps = 250
warmup_init_lr = 0.0
total_steps = 25000
final_lr = 4.0e-5
# only the DisTrO optimizer is supported when training models on Psyche.
[model.LLM.optimizer.Distro]
clip_grad_norm = 1.0
compression_decay = 0.999
compression_chunk = 64
compression_topk = 8
quantize_1bit = true
Authentication and Keys
When clients participate in a decentralized training run, a set of Solana Keypairs is used to authenticate each type of user.
Users Roles
A different set of key will be used for each role within the Training flow.
The following roles will be important:
-
The Run's
main_authority
is the private-key that creates and owns the run, it is the only key that is allowed to modify the run's configuration. -
The Run's
join_authority
is the private-key that is responsible for allowing or disallowing clients's keys to join a training run. It is set by themain_authority
during the creation of the Run. -
A client's
authorizer
(or grantee) key is the "master" private-key of a compute provider. That key may be allowed to join a run and to set delegate keys that can also join the run on its behalf. -
A Client's
delegate
key is a temporary and ephemeral key that can be allowed to join a run's training on behalf of a user.
A Training run can be configured to be restricted to only a set of whitelisted keys, this kind of run is considered "Permissioned". As opposed to a "Permissionless" which is open to anyone without any authorization
required.
Permissioned Runs
When a In order to be able to join a run, a user (with a key) must first be allowed to join a run.
This is done through the following steps:
- The
join_authority
(the grantor) issues anauthorization
to anauthorizer
(the grantee) - The
authorizer
(the grantee) sets a list ofdelegate
keys that can join the run on its behalf - The
delegate
key then can join a run
Keys Authorizations
Make sure to install the scripting dependencies:
sudo apt-get install jq
cargo install solana_toolbox_cli
For the join_authority
(the grantor) to issues new authorization
a script is provided:
# We assume that "grantor.json" contains the Private Key of the "join_authority"
# The "grantor.json" can be created using: $ solana-keygen new -o grantee.json
# We assume that $GRANTEE_PUBKEY is set to the public key of the "authorizer" (or grantee)
# The $GRANTEE_PUBKEY can be retrieved by using: $ solana-keygen pubkey grantee.json
sh scripts/join-authorization-create.sh devnet grantor.json $GRANTEE_PUBKEY
For the authorizer
(the grantee) to set a list of delegate, the following script is provided:
# We assume that $GRANTOR_PUBKEY is set to the public key of the "join_authority" of the run
# The $GRANTOR_PUBKEY can be retrieved by using: $ solana-keygen pubkey grantor.json
# We assume that "grantee.json" contains the Private Key of the "authorizer"
# The "grantee.json" can be created using: $ solana-keygen new -o grantee.json
# We assume that a set of keypairs exist at path: delegate1.json, delegate2.json, etc
sh scripts/join-authorization-set-delegates.sh devnet $GRANTOR_PUBKEY grantee.json delegate*.json
Further information
The source code for the authorizer
smart contract used by the Psyche's coordinator can be found here with its readme: https://github.com/PsycheFoundation/psyche/tree/main/architectures/decentralized/solana-authorizer
Client FAQ
- Which operating systems are supported?
- We support officially suport modern Linux versions, with Mac support planned for the future once the Metal backend is implemented.
- Which are the hardware requirements to run a client?
- You need a CUDA-compatible GPU. As for the exact specs this will depend on the size of the model being trained. Support for AMD ROCm is planned for the future.
- Can I join a run at any moment?
- Yes! You will remain as a pending client and start training in the next epoch.
- Can I leave a run at any moment? How do I leave a run?
- Yes, you may leave a run by closing the container with the client, either with Ctrl+C in the terminal or by stopping manually the container if it's running detached. However take into account that once rewards are implemented, you will lose all rewards for that epoch.
- What happens if my connection drops or my client crashes?
- This is similar to closing the client, just make sure the docker container is correctly stopped and re-run the client.
- How do I update the client to the latest version?
- You can force Docker to pull the latest image by running
docker pull nousresearch/psyche-client:latest
before running the client.
- You can force Docker to pull the latest image by running
- Do I need a Solana wallet to train? Does it need to have funds?
- Are the client and coordinator open-source? Can I report bugs?
- Yes, you may check Psyche's github repo
Psyche Development
As the Psyche project a large & complex, we'll walk you through some of the processes we use in development.
- Setup & Useful Commands
- Running Psyche On-Chain
- Running Psyche Off-Chain
- Implementing Models
- Secrets Management
- Building these docs
- CI
- Contributing
Setup & Useful Commands
Installation and Setup
Psyche uses nix
+ flakes to install every single dependency and development tool Psyche needs to run and be developed.
This is the preferred way of working on Psyche, as it guarantees a consistent development and build process regardless of your machine's specific configuration.
If you can't / don't want to use Nix, it's also possible to manually install all the required deps for Psyche.
Any Linux, via Nix
Installing Nix
To install nix
, simply run curl --proto '=https' --tlsv1.2 -sSf -L https://install.determinate.systems/nix | sh -s -- install
or find it at your local package manager.
Binary cache
To speed up your builds & your local dev shell, we recommend enabling the binary cache from garnix
, our CI provider.
In order to use the cache that garnix provides, change your nix.conf
, adding https://cache.garnix.io
to substituters, and cache.garnix.io:CTFPyKSLcx5RMJKfLo5EEPUObbA78b0YQ2DTCJXqr9g=
to trusted-public-keys
.
If you've just installed Nix via the Determinite Systems installer above, you can do this by adding these lines to /etc/nix/nix.conf
:
extra-substituters = https://cache.garnix.io
extra-trusted-public-keys = cache.garnix.io:CTFPyKSLcx5RMJKfLo5EEPUObbA78b0YQ2DTCJXqr9g=
Setup Using direnv
You can optionally use direnv
to automatically enter a Nix environment when you cd
into the Psyche folder.
Install direnv
from your system's package manager.
After running direnv allow
in the Psyche directory once, your terminal will automatically enter a development shell when you subsequently cd
into the Psyche directory.
Setup Without direnv
Each time you open a new shell in the Psyche directory, run nix develop
to enter a development shell.
Ubuntu
The following instructions are needed for a server with a fresh Ubuntu installation
1. Install drivers
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install
2. Install CUDA libraries
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
rm cuda-keyring_1.1-1_all.deb
sudo apt-get install libnccl-dev libnccl2
sudo apt install nvidia-cuda-toolkit
3. Download libtorch & extract
wget https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcu124.zip
unzip libtorch-cxx11-abi-shared-with-deps-2.6.0+cu124.zip
rm libtorch-cxx11-abi-shared-with-deps-2.6.0+cu124.zip
4. Libtorch environment variables
In the .bashrc
file, set the following libtorch environment variables. Here <path_to_libtorch>
is the absolute path to the extracted libtorch
folder from the previous step
export LIBTORCH=<path_to_libtorch>
export LIBTORCH_INCLUDE=<path_to_libtorch>
export LIBTORCH_LIB=<path_to_libtorch>
export LD_LIBRARY_PATH=<path_to_libtorch>/lib:$LD_LIBRARY_PATH
export CUDA_ROOT=/usr/local/cuda-12.4
These variables can also be provided to cargo by creating a .cargo/config.toml
file in your home directory
[env]
LIBTORCH=<path_to_libtorch>
LD_LIBRARY_PATH=<path_to_libtorch>/lib
CUDA_ROOT = "/usr/local/cuda-12.4"
5. Download & install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
6. (optional) Install just
sudo snap install just --edge --classic
7. (optional) Install Solana and Anchor
Install Solana
sh -c "$(curl -sSfL https://release.anza.xyz/beta/install)"
After installation, follow the instructions to add the Solana tools to PATH.
Install Anchor
cargo install --git https://github.com/coral-xyz/anchor --rev a7a23eea308440a9fa9cb79cee7bddd30ab163d5 anchor-cli
This may require
sudo apt install pkg-config libudev-dev libssl-dev libfontconfig-dev
Windows
-
Install CUDA libraries: https://developer.nvidia.com/cuda-12-4-1-download-archive?target_os=Windows&target_arch=x86_64&target_version=11
-
Download libtorch & extract: https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcu124.zip
-
Download OpenSSL: https://slproweb.com/download/Win64OpenSSL-3_3_2.exe
-
Install Perl: https://github.com/StrawberryPerl/Perl-Dist-Strawberry/releases/download/SP_53822_64bit/strawberry-perl-5.38.2.2-64bit.msi
-
Create a
.cargo/config.toml
file to set environment variables
NOTE: Building may take several minutes the first time as openssl-sys
takes a long time (for some reason)
[env]
LIBTORCH = <path_to_libtorch>
OPENSSL_LIB_DIR = <path_to_openssl>/lib/VC/x64/MT
OPENSSL_INCLUDE_DIR = <path_to_openssl>/include
MacOS / aarch64
These platforms aren't supported right now :( PRs welcome!
Docker
Create a Docker image with the necessary dependencies to run a Psyche client:
- Install the necessary NVIDIA and CUDA drivers as explained in the previous sections.
- Install the NVIDIA container toolkit. If using Ubuntu, just run:
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
- Create an
.env
file following the.env.example
inpsyche/config/client
and update the necessary environment variables. - Run
docker compose build
.
Useful commands
Psyche uses just
to run some common tasks.
You can run just
to see the whole list of commands!
Running checks
requires Nix!
just check
If it passes, CI will pass.
Formatting
just fmt
Running Psyche on-chain
To build the Solana programs, you'll need a handful of Solana tools installed. See the setup if you're not using Nix.
To start, you'll need to create a Solana wallet to fund your transactions.
solana-keygen new
Run on a local validator (localnet)
In a new terminal, run the following command to:
- setup a
solana-test-validator
- Deploy all the required programs
- Create a local run with name
<RUN_ID>
. If no run name is provided, the nametest
will be used by default.
just setup-solana-localnet-test-run run_id=<RUN_ID>
Then, in another terminal, run a client to train the test model and joining the run with name RUN_ID
. If no run name is provided, the name test
will be used by default.
just start-training-localnet-client run_id=<RUN_ID>
This will start a run to train a 1.1b parameter model with all the parallelism features enabled. For a more lightweight run to avoid OOM errors, or just to use your hardware less, (we see you 8gb VRAM cards!) there's also:
just setup-solana-localnet-light-test-run
just start-training-localnet-light-client
By default the client will use the private key generated by solana-keygen new
above (located by default in ~/.config/solana/id.json
).
To spin up another client and join the run we'll have to create another keypair using:
solana-keygen new --outfile <PATH_TO_NEW_KEYPAIR>
and run the same just
command but using the new created keypair:
WALLET_FILE=<PATH_TO_NEW_KEYPAIR> just start-training-localnet-client run_id=<RUN_ID>
or:
WALLET_FILE=<PATH_TO_NEW_KEYPAIR> just start-training-localnet-light-client run_id=<RUN_ID>
Run on Solana's Devnet
You'll need to fund your wallet to make transactions on Devnet. You can request an airdrop from the Solana foundation of up to 10 devnet sol every 8 hours. Simply run
solana-keygen pubkey
and paste the resulting key into the airdrop website.
You can then use the same steps for deploying the programs, creating a run, and training on localnet above, but using following just
commands:
just setup-solana-devnet-test-run
just start-training-devnet-client
alongside the -light
variants
just setup-solana-devnet-light-test-run
just start-training-devnet-light-client
Regenerating program keypairs
If you're developing things that change the structure of the program's accounts layout, deploying an update to the coordinator program will likely cause breakage with existing runs that have coordinator accounts already instantiated.
Any programs, including the Psyche website's indexer, will fail to read the content of the on-chain data if you use a new IDL with an old in-memory layout.
Therefore, changes to the data structures that end up on-chain will require a deployment of a new coordinator program under a new ProgramID to prevent breakage of existing runs.
In order to do this by yourself, you'll need to generate a new ProgramID (and keypair).
To deploy a program to devnet or localnet with a new program keypair, regenerate its devnet/localnet keypair file (checked into the repo!)
For the solana coordinator, that would be:
solana-keygen new -o architectures/decentralized/solana-coordinator/target/deploy/psyche_solana_coordinator-keypair.json -f
You can see the newly generated program ID by running
solana-keygen pubkey architectures/decentralized/solana-coordinator/target/deploy/psyche_solana_coordinator-keypair.json
Make sure to then update the declare_id
's content with the new keys before deploying the new development contracts, either manually or with anchor keys sync
in the appropriate project folder.
if you want to push these changes to the repo, you'll need to use git add -f
, since they're normally .gitignore
d.
Running Psyche offchain
When developing for Psyche, you might not want to spin up all the Solana infrastructure if you're working on a feature like the distributed networking or the training code.
To that end, we maintain a "centralized" client & server package that simply communicate over TCP instead of dealing with code deployed to a Solana network.
There's a server
package, and a client
package.
To develop with them, you'd spin up one server
with whatever run config you want
Local Testnet
The local testnet is a helper application designed to easily spin up a Server and multiple clients. It's useful for doing sample runs on your own hardware, and for development.
Pre-requisites
Since we want to run many clients and the server we'll need several terminal windows to monitor them. The tool uses tmux to create them.
If you're using the Nix devShell, tmux is already included.
Running
A sample invocation that fires up 3 clients to train on a 20m model might look like this:
just local-testnet \
--num-clients 3 \
--config-path ./config/consilience-match-llama2-20m-fineweb-pretrain-dev/
There's a lot of options to configure the local testnet. Check em out below!
Command-line options
# Command-Line Help for `psyche-centralized-local-testnet`This document contains the help content for the psyche-centralized-local-testnet
command-line program.
Command Overview:
psyche-centralized-local-testnet
Usage: psyche-centralized-local-testnet <COMMAND>
Subcommands:
start
— Starts the local-testnet running each part of the system in a separate terminal pane
psyche-centralized-local-testnet start
Starts the local-testnet running each part of the system in a separate terminal pane
Usage: psyche-centralized-local-testnet start [OPTIONS] --num-clients <NUM_CLIENTS> --config-path <CONFIG_PATH>
Options:
-
--num-clients <NUM_CLIENTS>
— Number of clients to start -
--config-path <CONFIG_PATH>
— File path to the configuration that the coordinator will need to start -
--write-distro-data <WRITE_DISTRO_DATA>
— If provided, write DisTrO data to disk in this path -
--server-port <SERVER_PORT>
— Port where the server for this testnet will be listen it to (this is the one that clients must use when connecting)Default value:
20000
-
--tui <TUI>
— Enables a terminal-based graphical interface for monitoring analyticsDefault value:
true
Possible values:
true
,false
-
--random-kill-num <RANDOM_KILL_NUM>
— Kill N clients randomly every <RANDOM_KILL_INTERVAL> seconds -
--allowed-to-kill <ALLOWED_TO_KILL>
— Which clients we're allowed to kill randomly -
--random-kill-interval <RANDOM_KILL_INTERVAL>
— Kill <RANDOM_KILL_NUM> clients randomly every N secondsDefault value:
120
-
--log <LOG>
— Sets the level of the logging for more granular informationDefault value:
warn,psyche=debug
-
--first-client-checkpoint <FIRST_CLIENT_CHECKPOINT>
— HF repo where the first client could get the model and the configuration to use -
--hf-token <HF_TOKEN>
-
--write-log
Default value:
false
-
--wandb-project <WANDB_PROJECT>
-
--wandb-group <WANDB_GROUP>
-
--wandb-entity <WANDB_ENTITY>
-
--optim-stats <OPTIM_STATS>
-
--eval-tasks <EVAL_TASKS>
This document was generated automatically by
clap-markdown
.
Server & Client
Both of these applications can be spun up individually at your discretion instead of using the local testnet. We include all their command-line options for your reading pleasure:
Client
# Command-Line Help for `psyche-centralized-client`This document contains the help content for the psyche-centralized-client
command-line program.
Command Overview:
psyche-centralized-client
↴psyche-centralized-client show-identity
↴psyche-centralized-client train
↴
psyche-centralized-client
Usage: psyche-centralized-client <COMMAND>
Subcommands:
show-identity
— Displays the client's unique identifier, used to participate in training runstrain
— Allows the client to join a training run and contribute to the model's training process
psyche-centralized-client show-identity
Displays the client's unique identifier, used to participate in training runs
Usage: psyche-centralized-client show-identity [OPTIONS]
Options:
--identity-secret-key-path <IDENTITY_SECRET_KEY_PATH>
— Path to the clients secret key. Create a new random one runningopenssl rand 32 > secret.key
or use theRAW_IDENTITY_SECRET_KEY
environment variable
psyche-centralized-client train
Allows the client to join a training run and contribute to the model's training process
Usage: psyche-centralized-client train [OPTIONS] --run-id <RUN_ID> --server-addr <SERVER_ADDR>
Options:
-
-i
,--identity-secret-key-path <IDENTITY_SECRET_KEY_PATH>
— Path to the clients secret key. Create a new random one runningopenssl rand 32 > secret.key
. If not provided a random one will be generated -
--bind-p2p-port <BIND_P2P_PORT>
— Sets the port for the client's P2P network participation. If not provided, a random port will be chosen -
--bind-p2p-interface <BIND_P2P_INTERFACE>
— Sets the network interface for the client's P2P network participation. If not provided, will bind to all interfaces -
--logs <LOGS>
— Sets clients logs interface tui: Enables a terminal-based graphical interface for monitoring analytics. console: standard logs json: standard logs with json formatDefault value:
tui
Possible values:
tui
,console
,json
-
--run-id <RUN_ID>
— A unique identifier for the training run. This ID allows the client to join a specific active run -
--data-parallelism <DATA_PARALLELISM>
Default value:
1
-
--tensor-parallelism <TENSOR_PARALLELISM>
Default value:
1
-
--micro-batch-size <MICRO_BATCH_SIZE>
Default value:
1
-
--write-gradients-dir <WRITE_GRADIENTS_DIR>
— If provided, every shared gradient this client sees will be written to this directory -
--eval-tasks <EVAL_TASKS>
-
--eval-fewshot <EVAL_FEWSHOT>
Default value:
0
-
--eval-seed <EVAL_SEED>
Default value:
42
-
--eval-task-max-docs <EVAL_TASK_MAX_DOCS>
-
--checkpoint-dir <CHECKPOINT_DIR>
— If provided, every model parameters update will be save in this directory after each epoch -
--hub-repo <HUB_REPO>
— Path to the Hugging Face repository containing model data and configuration -
--wandb-project <WANDB_PROJECT>
-
--wandb-run <WANDB_RUN>
-
--wandb-group <WANDB_GROUP>
-
--wandb-entity <WANDB_ENTITY>
-
--write-log <WRITE_LOG>
-
--optim-stats-steps <OPTIM_STATS_STEPS>
-
--grad-accum-in-fp32
Default value:
false
-
--dummy-training-delay-secs <DUMMY_TRAINING_DELAY_SECS>
-
--max-concurrent-parameter-requests <MAX_CONCURRENT_PARAMETER_REQUESTS>
Default value:
8
-
--max-concurrent-downloads <MAX_CONCURRENT_DOWNLOADS>
Default value:
8
-
--compression <COMPRESSION>
Default value:
2
-
--server-addr <SERVER_ADDR>
This document was generated automatically by
clap-markdown
.
Server
# Command-Line Help for `psyche-centralized-server`This document contains the help content for the psyche-centralized-server
command-line program.
Command Overview:
psyche-centralized-server
↴psyche-centralized-server validate-config
↴psyche-centralized-server run
↴
psyche-centralized-server
Usage: psyche-centralized-server <COMMAND>
Subcommands:
validate-config
— Checks that the configuration declared in thestate.toml
file is validrun
— Starts the server and launches the coordinator with the declared configuration
psyche-centralized-server validate-config
Checks that the configuration declared in the state.toml
file is valid
Usage: psyche-centralized-server validate-config [OPTIONS] --state <STATE>
Options:
--state <STATE>
— Path to thestate.toml
file to validate--data-config <DATA_CONFIG>
— Path todata.toml
file to validate. If no provided then it will not be checked
psyche-centralized-server run
Starts the server and launches the coordinator with the declared configuration
Usage: psyche-centralized-server run [OPTIONS] --state <STATE>
Options:
-
--state <STATE>
— Path to TOML of Coordinator state -
-s
,--server-port <SERVER_PORT>
— Port for the server, which clients will use to connect. if not specified, a random free port will be chosen -
--tui <TUI>
Default value:
true
Possible values:
true
,false
-
--data-config <DATA_CONFIG>
— Path to TOML of data server config -
--save-state-dir <SAVE_STATE_DIR>
— Path to save the server and coordinator state -
--init-warmup-time <INIT_WARMUP_TIME>
— Sets the warmup time for the run. This overrides thewarmup_time
declared in the state file -
--withdraw-on-disconnect <WITHDRAW_ON_DISCONNECT>
— Automatically withdraw clients that disconenct from the serverDefault value:
true
Possible values:
true
,false
This document was generated automatically by
clap-markdown
.
Implementing models
This codebase includes a set of sample programs that let you design, implement, and test model architectures without spinning up the whole Psyche p2p training architecture.
We currently only implement Llama and Deepseek (see shared/modeling/src/models/
), but PRs are very welcome to add more architectures and model types.
The train
example, documented below, is useful to test how your model trains using AdamW vs DisTrO.
Running
cargo run --example train -- ---help
You'll need a pre-tokenized dataset downloaded to your disk for training.
A PR is welcome to add an option to the trainer to use the HTTP data provider! You can refer to the http example in the data-provider crate for a sample implementation.
For a Llama 2 model, a pre-tokenized dataset to test with is available at https://huggingface.co/datasets/emozilla/fineweb-10bt-tokenized-datatrove-llama2/.
Psyche only needs the .ds
files, and will load any/all .ds
files in the specified folder - you can download just one for smaller tests.
If you've downloaded part or all of the above dataset into a folder data/fineweb-10bt
inside the Psyche repo, you can start a simple training run on a 20m parameter Llama 2 model:
cargo run --example train -- \
--model emozilla/llama2-20m-init \
--data-path ./data/fineweb-10bt/ \
--total-batch 2 \
--micro-batch 1
Adding a new model type
The train
example currently asssumes your model is a Llama or Deepseek v2/v3 model, and instantiates it via (LlamaForCausalLM|DeepseekForCausalLM)::from_pretrained
.
We currently only support causal language models - to implement a new one, you can create a file similar to llama_for_causal_lm
and implement your model, ensuring you provide a trait impl for CausalLM
.
You might also need to modify the data provider, if your data is structured in some way. Since you're implementing the forward pass yourself, you can serve and interpret data passed from the data provider however you need. The data provider currently only supports reading fixed-size batches from input files, so data batches with different sizes will require some additional work.
PRs welcome for any new kinds of dataset loading!
Secrets
We manage secrets in our repo using agenix
.
These secrets are keyed to specific developers via SSH public keys.
Some are used for deployments, and some can be used for development.
You can read more about agenix and how secrets are used in our deployment HERE
What secrets do we store?
# this file contains secrets that we can store encrypted in this repo.
# they can be decrypted by the specified ssh public keys using `agenix`.
let
keys = import ./nix/keys.nix;
in
{
## Local Development
# a shared devnet wallet
"secrets/devnet/wallet.age".publicKeys = keys.allDevKeys;
# RPC url for devnet
"secrets/devnet/rpc.age".publicKeys = keys.allDevKeys;
# RPC url for mainnet
"secrets/mainnet/rpc.age".publicKeys = keys.allDevKeys;
## Deployments
# all RPC urls for our devnet indexer
"secrets/devnet/backend.age".publicKeys = keys.allKeys;
# all RPC urls for our mainnet indexer
"secrets/mainnet/backend.age".publicKeys = keys.allKeys;
}
Editing a secret
you must have your pubkey listed in secrets.nix
for a secret if you want to modify the existing one!
ask someone whose key is in secrets.nix
to be added.
To edit the secret whatever.age
, run
agenix -e secrets/whatever.age
Building the Psyche Book
That's the document you're reading! :D
Development
Simply run just serve_book
to serve the book over http on localhost!
Building
nix build .#psyche-book
The book will be output to result/
, which you can preview easily with python -m http.server -d ./result/
CI
Overview
We use Garnix as our CI provider. It:
- Builds packages in our Nix flakes
- Runs all Nix checks including formatting, lints, & Rust tests.
Deployment Branches
Some branches are configured for automatic deployment. These branches serve as dedicated testing environments.
Development Environments
These environments are stateful and accessible via SSH for developer troubleshooting. Public keys are listed in this repo.
Source Branch | Purpose | Hostname |
---|---|---|
test-deploy-devnet | Indexer/frontend for devnet | devnet-preview.psyche.network |
test-deploy-mainnet | Indexer/frontend for mainnet | mainnet-preview.psyche.network |
test-deploy-docs | Preview docs | docs.preview.psyche.network |
Production Environment
main
automatically deploys the website/indexer to https://mainnet.psyche.network/ and the docs to https://docs.psyche.network/.
This is a stateful deploy, but with no SSH access for security reasons.
Contributing to Psyche
Found a bug?
-
Make sure we're not already aware of it by checking GitHub Issues.
-
If it seems your bug is new, open an issue. Describe the expected & actual behaviour in as much detail as possible, ensuring to include system information (CUDA? CPU?) and any relevant command-line params (data parallelism? tensor parallelism? compression ratio?).
Fixed a bug?
-
Submit a GitHub PR with your bugfix.
-
Make sure your PR clearly explains what was broken and how you fixed it. Reference any related issues.
-
Before submitting, check out our guidelines to keep things consistent.
Want to add a cool feature or change something?
- First, share your idea on the Psyche forum and get some feedback.
- Feel free to start developing whenever you want, but we generally won't accept a PR unless there's been some discussion and feedback about whether your feature fits Psyche's goals.
Have questions about how things work?
- Post your questions on the Psyche forum - that's the best place to get answers!
Want to improve our docs?
- We'd love that. Feel free to open PRs!
Thank you for your contributions to Psyche :heart:
PR guidelines
We prefer PRs to be made and merged using rebase, not merge commits. It's not a deal-breaker, but rebase makes us happy <3
Clean Linear History
Rebasing creates a linear commit history without merges going back and forth, making it much easier to identify the place a change was made. Fixups in merge commits that introduce bugs are no longer associated with the original code, wheras with with rebase you'd find the bug as part of its original commit.
Merge commits add extra noise to the history without adding meaningful content about what changed.
Better Bisect Experience
A linear history makes git bisect
more effective for finding bugs, as each commit represents a coherent, working state of the codebase.
Preserving Meaningful Commits
While we advocate for rebase, we do not advocate for squashing all commits. Each commit should:
- Document a single logical step in your development process
- Be independently revertible if needed
- Separate concerns such as:
- Refactoring (changing structure but not behavior)
- Feature additions (changing behavior)
- Bug fixes
- Documentation updates
- Build & pass all checks if checked out individually.
What to Avoid
- Don't squash meaningful commits together - this buries important changes in large diffs and loses the step-by-step narrative
- Don't use merge commits within feature branches
- Don't include "fix up" or "oops" commits in your final PR - these are fine to have during development, but before opening your PR, use
git commit --amend
or interactive rebase to clean these up. A typical rebase workflow is explained in this blog post. git absorb is also very useful for small fixups.