Welcome to Psyche

Psyche is a system that enables distributed training of transformer-based AI models over the internet, aiming to foster collaboration between untrusted parties to create state-of-the-art machine learning models. It leverages a peer-to-peer distributed network for communication and data sharing.

This documentation provides a comprehensive guide to understanding, using, and developing with Psyche, whether you're an end-user looking to participate in a training run, a developer interested in contributing to the project, or just curious about how it all works.

Introduction

How does it work?

At its core, Psyche is a protocol that coordinates multiple independent clients to train a single machine learning model together. Rather than running on a centralized server farm with high-speed interconnects between every accelerator (GPUs, usually), Psyche distributes the training workload across many independent computers, each contributing a small piece to the overall training process.

Psyche is built to maintain training integrity without requiring participants to trust each other. Through a combination of consensus mechanisms, game theory, and careful protocol design, Psyche will ensure that the trained model remains coherent and consistent despite being trained across disparate machines.

Psyche In Depth

The core system is composed of three main actors:

Coordinator: Serves as a source of truth for global state available to all clients in a given training run. Each run has one coordinator that oversees the entire process. The coordinator is implemented as both a program running on the Solana Blockchain and as a regular TCP server.
Client: A user participating in a training run. Clients receive the model to be trained and a specific dataset for that run. They send information to the coordinator to progress the training run and use a peer-to-peer network to share their results at each training step with other clients.
Data Provider: Each run requires training data. This data could be served by the Psyche Data Provider server, over HTTP, or loaded from local copies of a dataset.

Sample topologies

---
title: Decentralized Run, training data provided over HTTP.
---
flowchart TB
subgraph "Solana Blockchain"
    C(["Coordinator State"])
end
C <-- Solana RPC & TXs --> C1(("Client")) & C2(("Client")) & C3(("Client"))
C1 <-. p2p gossip .-> C2
C3 <-. p2p gossip .-> C2 & C1
DT["`
Hosted training data
and model snapshots
`"] --> C1 & C2 & C3

---
title: Centralized Run, training data provided thru TCP data server
---
flowchart TB
subgraph "Coordinator Server"
    CC(["Coordinator State"])
end
CC <-- TPC --> C11(("Client")) & C22(("Client")) & C33(("Client"))
C11 <-. p2p gossip .-> C22
C33 <-. p2p gossip .-> C22 & C11
DTT["`
Hosted training data
and model snapshots
`"] --> C11 & C22 & C33

What constitutes a training run?

The training process for a given model is divided into small steps that incrementally train the model in a coordinated manner. A training run is divided into epochs, where clients can join and leave the run, and epochs are further divided into steps, where the model is incrementally trained.

During a training run, clients primarily perform three tasks:

Training: Train the model using an assigned subset of the data.
Witnessing: Verify the liveness and correctness of other participants.
Verifying: Recompute and compare training results to identify and punish malicious participants.

Waiting for Members & Warmup

At the start of an epoch, all clients have a window of time to join the run by requesting to be added by coordinator, and then connecting to the other participating clients.

Once a minimum threshold of clients has been met, the run will transition to the Warmup phase and begin a countdown to allow connected clients to update their copy of the model, after which it will enter the Training phase.

To obtain a copy of the model, the Coordinator will either direct clients to a checkpoint uploaded somewhere like: HuggingFace or direct clients to download the model from other clients via the p2p network.

Training

At the beginning of an epoch, after the Warmup phase ends, clients are assigned specific tasks that require them to train the model on a portion of the data.

The coordinator contains information that uniquely assigns pieces of training data to clients based on the current round.

If clients have already been training (i.e., it is not the first round of the epoch), they will apply the results from the previous round, then retrieve the data sample they need for the current round.

After completing the training on their assigned data, each client emits a p2p broadcast to all other clients containing their training results and a cryptographic commitment that binds them to those results.

As the training results are received from other clients, they are downloaded to be later incorporated into the current model.

Witnessing

At the start of each round, one or more clients are randomly selected as witnesses. The number of witnesses can be configured. Witnesses train the model as usual, but also build bloom filters that track which nodes they have received training results from, signifying that they are actively participating and providing valid results.

These bloom filters are sent to the coordinator, which then combines them into a provable consensus of which results to apply to the model.

Once a witness quorum is reached, the coordinator advances to the Training phase to allow all clients a brief window to download every training result.

Once the Witness phase concludes, the coordinator returns to the Training phase. Clients are assigned new data, and the process repeats. After a predefined number of rounds, a Cooldown round occurs, marking the end of an epoch.

The witness/train loop visualized

Here's a high-level overview of the process.

Additional details exist, but this captures the overall flow of a single Round from an Epoch:

sequenceDiagram
    participant Client1
    participant Client2
    participant Coordinator
    participant Data Hosting
    Client1 ->> Data Hosting: get_data
    Client2 ->> Data Hosting: get_data
    Coordinator ->> Client2: witness
    Note over Client1: Train
    Note over Client2: Train
    Client1 ->> Client2: Send results
    Client2 ->> Client1: Send results
    Note over Client1: Download results
    Note over Client2: Download results
    Client2 ->> Coordinator: Send witness
    Note over Coordinator: Quorum reached
    Note over Coordinator: Starting Witness phase

Glossary

For a list of common terms within the project along with definitions, please refer to the glossary

General Workflow

Client

A client is an active participant responsible for executing the training tasks within a run. It handles assigned data batches for training, generates commitments, and participates in the witness process when elected to validate the work of its peers. Each client maintains its own state synchronized with the Coordinator.

Coordinator

The Coordinator stores metadata about the training run's state and a list of participants.

It handles the transition between each Phase of a Round, and provides a random seed that's used to determine data assignments, witnesses, and more.

It's responsible for providing a point of synchronization for all clients within a run.

Ticks (State Transitions)

The coordinator behaves like a state machine, moving from one state to another, with each state transition having specific requirements.

When certain events occur or time-based conditions are met, the Coordinator can be "ticked" forwards to transition from one Phase to another Phase.

sequenceDiagram
    loop
        Note over Backend, Coordinator: Wait for a timeout or backend state
        Backend->>Coordinator: Tick
        Coordinator->>Backend: New state produced
        Backend->Client1: New coordinator state consumed by Client
        Backend->Client2: New coordinator state consumed by Client
    end

Beginning an Epoch (state: WaitingForMembers)

The Coordinator begins in the WaitingForMembers phase, with no clients connected.

Whatever backend you're running the Coordinator in should accept pending clients to be added to upcoming Epochs.

When inside the WaitingForMembers phase, your backend will pass new clients to the Coordinator until a configured min_clients threshold is met, at which point the coordinator's tick will transition it to the Warmup phase.

sequenceDiagram
    Note over Coordinator: min_clients = 2
    Client1->>Coordinator: Join
    Client2->>Coordinator: Join
    Note over Coordinator: Entering Warmup
    Client1->>Client2: Connect
    Client2->>Client1: Connect
    Note over Coordinator: The Warmup countdown elapses
    Note over Coordinator: Entering Training

Model Loading (state: Warmup)

This phase is designed to let all clients download the model & load it onto their GPUs.

If a client has dropped whilst waiting for the warmup time, the Backend then removes the client from the Coordinator's clients list.

If the number of clients falls below min_clients, the Coordinator goes back to the WaitingForMembers phase.

Once the Warmup time passes, the Coordinator loads all the information for the next training round and change its phase to RoundTrain. The Server will broadcast this Training Coordinator state to all clients.

Training (state: RoundTrain)

In this phase, the Coordinator provides a random seed.

Each client can use this seed, alongside the current round index and epoch index to determine which indices of the training data to use.

Each client then proceeds to run the training on the selected training data.

This state will end when clients later exchanges Witness messages.

Witnessing training results

As clients complete their training, they send their results to all other clients, including the Witnesses. The witnesses will each send a witness proof to the Coordinator, building towards a witness quorum.

A witness proof contains a bloom filter describing which pieces of data the witness received training results for, and which clients did that work. Elected witnesses are responsible for creating these witness proofs and and sending them to the Coordinator.

The witnesses for each round are chosen randomly from all the clients, using the same random seed as for data assignments. A witness will attempt to send an opportunistic witness message once it's seen a received a training result for every single batch in the current round.

Witness Quorum

The Coordinator advances the run from the Training phase to the Witness phase in one of two ways:

If enough witnesses observe all results and reach a witness quorum for the round, they notify the Coordinator that it is safe to advance. This process, named opportunistic witnessing, accelerates the transition to the Witness phase, rather than having to wait a fixed time for training results.
If witnesses do not receive all required results from other clients before the maximum time specified for the Training phase, the Coordinator will nonetheless transition to the Witness phase after the maximum Training time elapses.

Witness phase (state: RoundWitness)

This phase exists to give the witnesses an opportunity to send their proofs to the Coordinator in the event that they have not received enough training results from other clients to have reached the quorum and send their proofs opportunistically.

There is also brief slack period for non-witness nodes to catch up by downloading any remaining results they might have not received.

When the Witness phase finishes via timeout, the Coordinator transitions from Witness to the Cooldown phase in three cases:

If we are in the last round of the epoch.
If the clients have dropped to less than the minimum required by the config.
If the number of witnesses for the round is less than the quorum specified by the config.

Any clients that have failed health checks will also be removed from the current epoch.

Cooldown phase (state: Cooldown)

The Cooldown phase is the last phase of an epoch, during which the Coordinator waits for either the Cooldown period to elapse, or a checkpoint to have happened.

When the Cooldown phase begins, the Coordinator resets the current model checkpoint state to Checkpoint::P2P, signifying that new joiners should download the latest copy of the model from the other participants.

Upon exiting the Cooldown phase, the Coordinator transitions to the next epoch, saving the previous epoch state, and moving back to the WaitingForMembers phase.

It all comes together

Here is an overview of the whole process from a high level perspective:

sequenceDiagram
    Backend->>Coordinator: tick
    Coordinator->>Backend: Change state to `RoundTrain`
    Backend->>Client1: New state
    Backend->>Client2: New state
    par Start training
        Client1->>Client1: Start training
        Client2->>Client2: Start training
    end
    Client1->>Committee: get_witness
    Client2->>Committee: get_witness
    Committee->>Client1: false
    Committee->>Client2: true
    Note over Client1: Train
    Note over Client2: Train
    Note over Client2: Fill bloom filters
    Client2->>Backend: try send opportunistic witness
    Backend->>Coordinator: Witness message
    Note over Coordinator: Enough witnesses for round
    Coordinator->>Coordinator: Update state to RoundWitness
    Note over Coordinator: Timeout round witness time
    alt step > total steps
        Coordinator->>Coordinator: Update state to Waitingformembers
    else height == rounds per epoch
        Coordinator->>Coordinator: Update state to Cooldown
    else
        Coordinator->>Coordinator: Update state to RoundTrain with step + 1
    end

Health checks

Each client should repeatedly send health checks to the coordinator. Clients are assigned a score determined by the Coordinator using the trainer_healthy_score_by_witnesses method. This score increases as a client sends the required data to be added to the participants' bloom filters, allowing the Coordinator to confirm that the client is actively participating in the training.

A client also sends a list of other clients it considers unhealthy to the server using the HealthCheck message. The Coordinator processes this information to determine whether those clients are healthy. Clients deemed inactive or non-participatory are marked for removal in the next round.

Centralized Backend

In this Backend, the Coordinator is owned and ticked forwards by a Server that communicates via clients over TCP.

The Server's Coordinator is initially configured in main.rs. It's loaded using the configuration file state.toml.

flowchart LR
    S[Server] --run--> A[App]
    S --new--> C[Coordinator]
    C --run_id
        init warmup
        min clients
        model--> A

The Server uses some parts of the Coordinator configuration, like the data server configuration, if enabled, to boot up all the functionality it needs.

When a new client joins the run it has to communicate the run_id that it wants to join, to ensure the client's joining the correct run. After processing the join message, the client is added to a pending clients list, and runs the Coordinator's tick function to potentially add the client into the run.

When a tick condition is met, the Server ticks the Coordinator forwards, then broadcasts the Coordinator's new state to all connected clients.

Decentralized Backend

In this Backend, the Coordinator is an account associated with a Solana Program, and ticked forwards by a tick method that can be called by anyone.

A training run can be created by calling the init_coordinator method in the Coordinator program, and subsequently information about the model to be trained can be set by calling the update method.

For a new client to join the run, it must call the join_run method in the Coordinator program and pass the run_id for the run it intends to join. After the Solan Program processes the join message, the client is added to a pending clients list, and the Program runs the Coordinator's tick function to potentially add the client into the run.

When a tick condition is met, anybody using Solana can tick the Coordinator forwards by calling the tick method (clients in a Run will do this automatically). This new state is then read via an RPC subscription on each Client, progressing through the regular state machine.

flowchart LR
    T["Psyche Team"] -- deploy Solana Program --> P["Solana Program"]
    R["Run Creator"] -- init_coordinator with run_id --> A["Account for this training run"]
    R["Run Creator"] -- update with run info --> A
    C[Client] -- "join_run" --> A
    C --tick--> A
    G["A random Solana user"] -- tick --> A

Decentralized training flow

flowchart TD
 subgraph sg_solana["Solana"]
    direction TB
        CoordinatorState["Coordinator Program State <br> (Run State, Epoch,<br>Round, Clients)"]
  end
 subgraph sg_distro["DisTrO Optimizer"]
    direction TB
        MomentumUpdate["Update Local Momentum <br> m<sub>t</sub> = βm<sub>t-1</sub> + g<sub>t</sub>"]
        DCTExtract["Extract Fast Components <br> (q<sub>t</sub>) (DCT + TopK)"]
        CompressedUpdate["Compressed Local q<sub>t</sub> <br> (Indices + Amplitudes)"]
        MomentumResidual["Update Local<br>Momentum Residual<br> m<sub>t+1</sub> = m<sub>t</sub> - q<sub>t</sub>"]
  end
 subgraph sg_loop["Local Training"]
    direction TB
        LocalWeights["Model Weights (x<sub>t</sub>)"]
        ApplyAggregatedUpdate["Apply Aggregated Update <br> x<sub>t</sub> = x<sub>t-1</sub> - η Q<sub>t-1</sub>"]
        ReceiveDecode["Receive &amp;<br>Decode/Aggregate <br> Compressed q<sub>t-1</sub><br> from Peers"]
        ForwardBackward["Forward/Backward Pass <br> (Use x<sub>t</sub>, <br>Compute Gradient g<sub>t</sub>)"]
        FetchData["Fetch Assigned Data <br> (Batch<sub>t</sub>)"]
        Gradient["Local Gradient (g<sub>t</sub>)"]
        sg_distro
        P2PNetworkInterface["P2P Network Interface"]
  end
 subgraph sg_client["Client"]
    direction TB
        ClientSM["Client State Machine <br> (Warmup, Train,<br>Witness, Cooldown)"]
        sg_loop
  end
 subgraph sg_p2p["P2P Gossip & Blob Transfer"]
    direction TB
        ClientNode2("Client Node 2")
        ClientNode3("Client Node 3")
        ClientNodeN("Client Node N")
  end
    DataProvider["Data Provider <br> (Local File/HTTP/etc.)"]
    ClientSM -- Manages --> sg_loop
    ClientSM -- Receives State Updates --- CoordinatorState
    ApplyAggregatedUpdate --> LocalWeights
    ReceiveDecode -- "Aggregated Q<sub>t-1</sub>" --> ApplyAggregatedUpdate
    LocalWeights -- Used By --> ForwardBackward
    FetchData -- Provides Data --> ForwardBackward
    ForwardBackward -- Produces Gradient --> Gradient
    Gradient -- Updates --> MomentumUpdate
    MomentumUpdate --> DCTExtract
    DCTExtract -- Produces --> CompressedUpdate
    DCTExtract -- Updates --> MomentumResidual
    CompressedUpdate -- Broadcasts Local Compressed Update --> P2PNetworkInterface
    P2PNetworkInterface -- Receives Compressed Updates --> ReceiveDecode
    DataProvider -- Provides Data --> FetchData
    P2PNetworkInterface <-- Send/Receive Updates -------> sg_p2p
    ClientNode2 <-- Transfer Data Off-chain --> ClientNode3 & ClientNodeN
    ClientNode3 <-- Transfer Data Off-chain --> ClientNodeN
    CoordinatorState -- Assigns Data/Committee --> ClientSM
    ClientSM -- "Submits Transactions (e.g., Join, Tick, Witness)" --> CoordinatorState

Data Provider

When you're training an AI model, you need data to train on! Psyche supports multiple kinds of data providers that will fetch & provide the data your model needs to train.

Local data provider Each client already has the training data downloaded locally.
HTTP data provider Each client will request individual pieces of data from a webserver as they are assigned that data for training
TCP data provider Each client will reach out to a dedicated server over TCP & request samples of data.

Overview

When a client starts a round of training, it is assigned an ID or a range of IDs by the coordinator, representing all the "batches" of data that will be used for that round. Each batch contains a specific subsection of the overall training data.

The size of a batch is always the same, and can be configured in the run config This order is deterministic, and is distributed across each client, so no piece of data will be trained on more than once.

To understand how the data is partitioned for each client, refer to the following diagram:

flowchart TD
    C((Coordinator))
    C1[Client A]
    C2[Client B]
    C -- Assigned Batch IDs 1, 2, 3 --> C1
    C -- Assigned Batch IDs 4, 5, 6 --> C2
    subgraph Data Provider
        B1["Batch 1"]
        B2["Batch 2"]
        B3["Batch 3"]
        B4["Batch 4"]
        B5["Batch 5"]
        B6["Batch 6"]
        B4 ~~~ B1
        B5 ~~~ B2
        B6 ~~~ B3
    end
    B1 --> C1
    B2 --> C1
    B3 --> C1
    B4 --> C2
    B5 --> C2
    B6 --> C2

Provider configuration

Inside the run config, the key [model.LLM.data_location] specifies whether the data will be hosted on a TCP server, accessed via HTTP, or stored in a local folder. We also support loading data from GCP as a subsection of the HTTP data provider.

The required configuration depends on the data provider implementation being used:

TCP Server:
- If the data provider is configured as a TCP server, and an additional file named data.toml is required.
- This file contains the configuration required for the TCP server, including:
  - Data location
  - Token size
  - Sequence length
  - A seed to shuffle the data if necessary
- Example data.toml files can be found in psyche/config within the various initial state examples.
HTTP Provider:
- For the HTTP data provider, no additional configuration file is needed.
- The required fields for this setup include:
  - The URL (or a set of URLs) from which the data will be fetched - or, if you're loading data from GCP, a GCP bucket and an optional subdirectory.
  - Token size (in bytes)
  - A shuffle seed, if data shuffling is desired.
Local Provider:
- Simply point to the folder where the data should be loaded from.

When an epoch starts, all clients must have an identical model to train with.

At the beginning of a run, all clients must download the model parameters, tokenizer configuration, and model configuration from HuggingFace, where the model must have been previously uploaded

(TODO: add more details on uploading a model).

Each client will then modify their copy of the model by receiving new training results from other clients and applying them. This keeps everyone's copy of model identical within an epoch without an additional full synchronization step.

When a new client joins a run that has already progressed past its first epoch, it would not be correct for the client to download the original model from HuggingFace, as the model parameters would have already been updated during training. Instead, the new client must acquire a copy of the model from the peers who have been actively training it.

This synchronization process occurs during the Warmup phase, while the coordinator waits to begin the next Training phase.

To address this, we checkpoint the model at the end of an epoch, where clients save and share the entire model for new peers to join. There are two checkpointing variants: HuggingFace based and P2P based.

HuggingFace checkpoint

In this approach, a client or a set of clients are designated as the checkpointers for the run. These clients upload their copy of updated model to HuggingFace after each epoch, and send the URL for this checkpoint to the coordinator. When a new client joins the run, it retrieves the checkpoint URL from the coordinator, and connects to HuggingFace to download the latest copy of the model parameters and configuration files.

P2P checkpoint

In the peer-to-peer (P2P) approach, a new client synchronizes by obtaining the latest model directly from other peers. It receives the model information and parameters from any available peer, requesting a set of parameters for each layer from different clients. This process allows the client to assemble the latest model state and participate in the training without an explicit upload step to a central server occuring.

Here's an example of a P2P model sharing interaction:

flowchart TB
   C((Coordinator))
   C1[Client 1]
   C2[Client 2]
   C3[Client 3]
   C4[Joining Client]
   C --warmup---> C1
   C --warmup---> C2
   C --warmup---> C3
   C2 --Model config--> C4
   C4 -.Join.-> C
   C1 -.Layer 1 weights.-> C4
   C2 -.Layer 2 weights.-> C4
   C3 -.Layer 3 weights.-> C4

Training Rewards

When clients participate in a training run, the Coordinator keeps track of the compute contributions.

An earning_rate is added to the Client's earned points at the end of every successful training Epoch.

Run Treasurer, Compute Incentives

A Training run can be created through a Treasurer escrow smart contract. In this case the Run's authority will be the Treasurer smart contract itself.

In this case, an arbitrary token can be distributed through the Treasurer's Token holding. Every time a client earns a point on the run's coordinator, the treasurer will allow claiming a fixed amount of reward token for each earned coordinator point.

The source code for the treasurer smart contract can be found here: https://github.com/PsycheFoundation/psyche/tree/main/architectures/decentralized/solana-treasurer.

Mining Pool, Pooling funds

Participating in a run can be expensive — a powerful GPU may be required to train a particular model. Users can pool resources together through a Mining Pool smart contract. The source code used can be found here: https://github.com/PsycheFoundation/psyche/tree/main/architectures/decentralized/solana-mining-pool.

Each user contributing to a Mining Pool will delegate their funds so those can be used by the Mining Pool authority and owner to purchase compute power. The Mining Pool authority can then re-distribute equitably through the Mining Pool any token that may have been received as a result of the training.

Psyche Glossary

ActiveStep The state machine phases a Client goes through during a training Round or Epoch, synchronized with the Coordinator's RunState. Includes Warmup, Training, Witness, and Cooldown.

AMD ROCm An alternative GPU compute platform to NVIDIA's CUDA. Support for ROCm is planned for Psyche clients in the future.

Authorizer A Solana program that issues authorizations to specific users.

Authorization A specific role (scope) assigned to a single user (grantee) by a specific authority (grantor). The grantee can then delegate authorization to other keys (the delegates) that can act on its behalf. In practice, this is useful for managing permissions to nodes in data center clusters easily.

Batch A subset of the training data processed by clients in a single step within a Round. Identified by a BatchId.

BatchId A unique identifier for a specific Batch of training data.

Bloom Filter A probabilistic data structure used for efficient set membership testing (e.g., checking if a client's commitment has been witnessed). Used in WitnessBloom. Has a small chance of false positives.

BLOOM_FALSE_RATE The target false positive rate (1% in this case) for the Bloom Filters used in the witness protocol.

Checkpoint A saved state of the LLM being trained. Psyche uses checkpoints to allow runs to be paused, resumed, or recovered after interruptions. Checkpoints can be stored in a central HubRepo or shared between clients via P2P.

Checkpointers Designated, trusted participants responsible for saving the model Checkpoint during the Cooldown phase.

Client The software participants run on their own hardware (typically with a GPU) to contribute to the distributed training process. Clients perform computations, submit results (Commitments), and participate in Witnessing.

ClientState The status of a Client as tracked by the Coordinator. Key states include Healthy, Dropped, Withdrawn, and Ejected.

Commitment A cryptographic hash (SHA-256) of a client's computational results for a given Batch. Submitting commitments allows the Coordinator and Witnesses to verify work was done without transferring the full results initially.

Commitee The particular role of a client in a given round. Can be one of Trainer, Verifier or TieBreaker.

Cooldown A phase (RunState and ActiveStep) at the end of an Epoch where model Checkpoints are saved and the system prepares for the next epoch.

Coordinator The central orchestrator of the Psyche training system, implemented as a Solana program. It manages the training lifecycle (RunState), client participation (ClientState), data batch assignment, and Witnessing.

CoordinatorConfig The set of parameters defining how a specific training run operates (e.g., warmup_time, witness_quorum, rounds_per_epoch).

CUDA NVIDIA's parallel computing platform and programming model, required for running the Psyche client on NVIDIA GPUs.

Data Provider Component responsible for supplying the training data in organized Batches.

Desync An error state (StepError::Desync) occurring when a Client's ActiveStep falls out of synchronization with the Coordinator's RunState.

Docker A platform used to build, ship, and run applications in Containers. Psyche uses Docker to distribute and run the client software.

Dropped A ClientState indicating a client has become unresponsive or disconnected unexpectedly.

Ejected A ClientState indicating a client has been forcibly removed from the training run, typically due to failing health checks or malicious behavior. Ejected clients may be subject to Slashing.

Epoch A major cycle in the training process, composed of multiple Rounds. A Checkpoint starts with the WaitingForMembers and Warmup phases and ends with a Cooldown phase.

Exited Clients A buffer on the Coordinator holding records of clients that have recently left the run (Dropped, Withdrawn, Ejected).

Finished A RunState indicating that the training run has completed its configured total_steps.

Garnix CI (Continuous Integration) service based on Nix, used by Psyche.

Health Check A verification procedure (health_check()) initiated by designated witness clients. Its purpose is to monitor peer clients and confirm they are actively processing their assigned training batches. When a witness client detects a peer that appears unresponsive or failing (unhealthy), it notifies the central coordinator. The coordinator independently verifies the status of the reported peer by running its own health check. If this verification is verified then the peer is marked as unhealthy and is kicked.

Healthy The desired ClientState, indicating the client is connected, responsive, and participating correctly in the training process. Only Healthy clients typically receive Rewards.

HubRepo A centralized repository location (e.g., Hugging Face, S3 bucket) where the model Checkpoint can be stored, particularly when initializing or if P2P storage is unavailable.

Iroh A P2P library that Psyche uses for data-sharing between the clients.

Lightweight Hashing Using efficient hashing algorithms like SHA-256 for Commitments to allow for fast verification by the Coordinator and Witnesses.

Metal Apple's graphics and compute API. A future backend target for running the Psyche client on Mac hardware.

min_clients The minimum number of Healthy clients required for a training run to progress beyond the WaitingForMembers state.

Mining Pool A Solana program that implements a basic "mining" or lending pool mechanism where users (lenders) can deposit collateral into a pool to delegate funds to other participants with more compute power and eventually claim redeemable tokens proportionate to their share of the total deposited collateral.

NUM_STORED_ROUNDS A constant defining how many past rounds' states are kept in the Coordinator's history buffer (e.g., 4 rounds).

Nix Tool for declarative and reproducible builds used by Psyche.

Opportunistic Witnessing A feature that allows progressing early from the RoundTrain phase to the Witness phase, given that the witness quorum is reached.

Paused A RunState where the training process is temporarily stopped by manual intervention. Can be resumed later.

P2P Peer-to-Peer, meaning a client acts both as a client and as a server, sharing data with it's peers. This is the intended way of data-sharing during a stable run.

Psyche Nous Research's set of systems that enable distributed training of transformer-based AI models over the internet.

Round A smaller cycle within an Epoch. Involves a training phase (RoundTrain) and a validation phase (RoundWitness).

RoundTrain The phase (RunState and ActiveStep) where clients download assigned data Batches, perform training computations (e.g., calculate gradients), and submit Commitments.

RoundWitness The phase (RunState and ActiveStep) where clients act as Witnesses to validate the Commitments submitted by other clients during RoundTrain. Requires a witness_quorum to succeed.

rounds_per_epoch A configuration parameter (CoordinatorConfig) specifying how many Rounds make up one Epoch.

RunState The overall state of the training run as managed by the Coordinator. Examples include Uninitialized, WaitingForMembers, Warmup, RoundTrain, RoundWitness, Cooldown, Paused, Finished.

SHA-256 The specific cryptographic hash function used to create Commitments in Psyche.

Solana The blockchain platform on which the Psyche Coordinator program runs.

StepError A category of errors related to the Client's ActiveStep progression, such as Desync.

tick() A function periodically called on the Coordinator program to drive the state machine transitions (advancing RunState based on time limits, client counts, and submitted results). Specific versions exist for different states (e.g., tick_waiting_for_members, tick_round_witness).

total_steps A configuration parameter defining the total number of training steps or batches the run aims to complete before entering the Finished state.

Training The ActiveStep where the client actively computes gradients or other training operations on its assigned data Batch.

Treasurer A Solana program that runs on top of psyche's Coordinator managing the distribution of rewards to the clients and keeping track of the points earned by each client in the training process.

Uninitialized The default starting RunState of the Coordinator before a training run is configured and started.

WaitingForMembers The RunState where the Coordinator waits for the minimum number of clients (min_clients) to connect and become Healthy before starting the training process.

Warmup The initial phase (RunState and ActiveStep) of a training run where clients download the model Checkpoint and initialize their training environment.

Witness A Client selected to validate other client's work.

WitnessBloom The specific Bloom Filter used on the Coordinator to track which client Commitments have been successfully witnessed.

Witness Quorum The minimum number of clients that must successfully act as Witnesses and agree on the validity of results for a Round to be considered successful.

Withdrawn A ClientState indicating that a client has exited the run.

End-user configuration

Joining a run

Learn how to use your existing compute to join a run

You may also want to read the client FAQ

Creating a run

To train your own models on Psyche, you should familiarize yourself with:

The process for Creating a run
The Run configuration file format
The process for Authentication

Joining a training run

Pre-requisites

The Psyche client currently only runs under Linux.

NVIDIA Driver

Psyche requires an NVIDIA CUDA-capable GPU. If your system does not have NVIDIA drivers installed, follow NVIDIA's installation guide for your Linux distribution.

Running under Docker

The Psyche client is distributed as a Docker image. In order to run it, you will need to have some container engine. We develop & test Psyche using Docker, so we recommend you use the same.

If you don't have Docker installed, follow the Docker Engine installation guide for your Linux distribution.

NVIDIA Container Toolkit

The NVIDIA Container Toolkit is used to enable GPU access inside Docker container, which Psyche uses for model training. To install it, follow the Nvidia Container Toolkit installation guide for your Linux distribution.

Solana RPC providers

To ensure reliability, performance, and security, all end-users must configure their own private Solana RPC provider, though configuring two is recommended to accommodate outages and network blips. We recommend using a dedicated RPC service such as Helius, QuickNode, Triton, or self-hosting your own Solana RPC node.

Configuration

A .env file should be created containing all the necessary configuration variables for joining a training run. These variables will be used to interact with the Solana blockchain, specify the model you'll contribute compute to, and configure the Psyche client based on your hardware resources.

Your .env file should contain at least these configuration options:

RPC: The RPC url of your primary Solana provider.
WS_RPC: The websocket RPC url of the same primary Solana provider.
RPC_2: The RPC url of your other Solana provider. If you don't have one, use a public alternative. For example, https://api.devnet.solana.com for Devnet.
WS_RPC_2: The websocket RPC url of your other Solana provider, or a public alternative if you don't have one. For example, wss://api.devnet.solana.com for Devnet.
RUN_ID: The ID of the training run you will join.
NVIDIA_DRIVER_CAPABILITIES: An environment variable that the NVIDIA Container Toolkit uses to determine which compute capabilities should be provided to your container. It is recommended to set it to 'all', e.g. NVIDIA_DRIVER_CAPABILITIES=all.
DATA_PARALLELISM: The number of GPUs the training data will be distributed across. This speeds up computation if you have the resources.
TENSOR_PARALLELISM: The number of GPUs the loaded model will be distributed across. This lets you train a model you can't fit on one single GPU.
MICRO_BATCH_SIZE: Number of samples processed per GPU per step (affects memory usage, set it as high as VRAM allows)
AUTHORIZER: The Solana address that delegated authorization to the Solana public key you will use to join the run. You can read more about authorization here.

Testing authorization

You can check if your client is authorized to join a given run by running psyche-solana-client can-join --run-id <RUN_ID> --authorizer <AUTHORIZER> --wallet <PUBKEY>, where <AUTHORIZER> is the same Solana authorizer delegate you have configured in your .env file, and <PUBKEY> is the key associated with the given client.

Note that you can use the command solana address to figure out the public key of the current host.

This command will return successfully if the wallet is authorized to join that run, given the associated authorizer.

Running the Psyche client docker image

To download and run the Psyche client thru Docker, run the following command, replacing <path_to_env_file> and <path_to_solana_pubkey> with your own.

docker run -d \
    --env-file <path_to_env_file> \
    -e RAW_WALLET_PRIVATE_KEY="$(cat <path_to_solana_pubkey>)" \
    --gpus all \
    --network "host" \
    nousresearch/psyche-client:latest

Fetching earned rewards

If you have accumulated training, you can claim those using the following command

psyche-solana-client treasurer-claim-rewards \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Creating a run

To create a new training run and make it available for nodes to join, you'll need to create it, configure it, and unpause it.

Creating the account

First, create the run on-chain. You'll need to provide:

the RPC & websocket RPC urls so the client can communicate with an RPC node.
a unique run ID - just a few characters to uniquely identify your run.
a name & description for your run

Also, for all the commands you will need to provide the path to you Solana private key.

Setup Joining Authorizations

Before we can get started we need to decide who will be able to join the run. You can read more about authorization here.

We'll need a private key that manages join permissions, we'll call it: join_authority.json

Join Authority for Public Runs

If we're looking to make a permissionless run (anyone can join), we'll need to create an authorization that's valid for everyone.

sh scripts/join-authorization-create.sh [RPC] join_authority.json 11111111111111111111111111111111

Join Authority for Private Runs

If we'll only allow some users to join the run we'll need to create one authorization per user (each user can then set multiple delegate keys later)

sh scripts/join-authorization-create.sh [RPC] join_authority.json [MY_USER_PUBKEY]

Creating a run without token rewards

For a standard run without token incentive distribution layer

psyche-solana-client create-run \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --join-authority [JOIN_AUTHORITY_PUBKEY] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Creating a run with token rewards

For a run that distributes tokens as reward to the training participants, we need to specify the mint of the collateral token to be distributed:

psyche-solana-client create-run \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --join-authority [JOIN_AUTHORITY_PUBKEY] \
    --treasurer-collateral-mint [COLLATERAL_MINT_PUBKEY] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Initializing configuration

Then, set the run's config. You'll need to provide:

the RPC & websocket RPC urls so the client can communicate with an RPC node.
the run ID you previously used
the path to a config.toml file, following the run config schema

psyche-solana-client update-config \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --config-path [CONFIG_FILE_PATH] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Starting the training

At this point, your run is ready to go! You can now set its state to "unpaused", and let clients join & begin training your model.

psyche-solana-client set-paused \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --resume \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Congratulations! As soon as your first client joins, your model is being trained.

Configuring training rewards

You can configure how many points does each client earns and loses for each epoch of training.

psyche-solana-client set-future-epoch-rates \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --earning-rate [EARNING_RATE] \
    --slashing-rate [SLASHING_RATE] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Funding the run with collateral

To distribute collateral to users, we need to periodically top-up the run's treasury so that points earned by users during compute can then be claimed against the treasury.

psyche-solana-client treasurer-top-up-rewards \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --collateral-amount [COLLATERAL_AMOUNT] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Inspect the content of a run

Optionally you can get detailled technical information about a run that was previously created for troubleshooting purposes.

psyche-solana-client json-dump-run \
    --rpc [RPC] \
    --run-id [RUN_ID]

For more info about a specific user inside of a run, you can also use:

psyche-solana-client json-dump-user \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --wallet [PUBLIC_KEY]

Run configuration

A training run on Psyche is described using a Run Configuration file. It's a .toml file with information about the model shape, size, checkpoints, optimizer settings, run witnessing settings, and more.

There's two top-level values in a run configuration: a config, and a model.

While some examples are described below, you can find the full range of options for the coordinator here and for the model here

Config

Here's a sample config with some of its options documented.

[config]
# maximum time, in seconds, to let nodes download the model from a checkpoint / other nodes
warmup_time = 30

# time, in seconds, to let nodes bring the model from the GPU to disk, and to opt to join the next round.
cooldown_time = 30

# how many training rounds in one "epoch", from warmup to cooldown.
rounds_per_epoch = 20

# maximum time, in seconds, to allow nodes to train in one round.
# this will limit the types of GPUs your model can be trained on,
# since setting it low will prevent slower hardware from completing
# training in time.
max_round_train_time = 30

# time, in seconds, to allow witnesses to publish their messages before next round
round_witness_time = 1

# number of clients that need to be active for an epoch to continue on.
# if the number of clients goes below this number, we initiate a Cooldown and then back to WaitingForMembers.
# this should be adjusted alongside max_round_train_time, because one client will train a lot slower
# than 100.
min_clients = 1

# minumum number of clients required before we transition from WaitingForMembers to Warmup.
# must be equal to or greater than min_clients
init_min_clients = 1

# what percent of nodes are dedicated to verifying correctness. always set to 0 for now.
verification_percent = 0

# how many nodes are selected each round to publish witness proofs
witness_nodes = 1

# the total number of training data batches per-step. this also determines your maximum number of clients.
# the batch size will linearly increase from global_batch_size_start to global_batch_size_end over
# global_batch_size_warmup_tokens tokens
global_batch_size_start = 8
global_batch_size_end = 8
global_batch_size_warmup_tokens = 0

# the total number of training steps to partake in. this is used for the LR schedule in the model section too.
total_steps = 25000

Model

# so far only LLMs are supported.
[model.LLM]
architecture = "HfLlama"
data_type = "Pretraining"
max_seq_len = 2048

[model.LLM.checkpoint.Hub]
repo_id = "emozilla/llama2-20m-init"

[model.LLM.data_location.Http]
token_size_in_bytes = "TwoBytes"
shuffle = "DontShuffle"

[model.LLM.data_location.Http.location.Gcp]
bucket_name = "nous-pretraining-public-us"
filter_directory = "fineweb-edu-tokenized-llama2"

[model.LLM.lr_schedule.Cosine]
base_lr = 4.0e-4
warmup_steps = 250
warmup_init_lr = 0.0
total_steps = 25000
final_lr = 4.0e-5

# only the DisTrO optimizer is supported when training models on Psyche.
[model.LLM.optimizer.Distro]
clip_grad_norm = 1.0
compression_decay = 0.999
compression_chunk = 64
compression_topk = 8
quantize_1bit = true

Authentication and Keys

When clients participate in a decentralized training run, a set of Solana Keypairs is used to authenticate each type of user.

Users Roles

A different set of key will be used for each role within the Training flow.

The following roles will be important:

The Run's main_authority is the private-key that creates and owns the run, it is the only key that is allowed to modify the run's configuration.
The Run's join_authority is the private-key that is responsible for allowing or disallowing clients's keys to join a training run. It is set by the main_authority during the creation of the Run.
A client's authorizer (or grantee) key is the "master" private-key of a compute provider. That key may be allowed to join a run and to set delegate keys that can also join the run on its behalf.
A Client's delegate key is a temporary and ephemeral key that can be allowed to join a run's training on behalf of a user.

A Training run can be configured to be restricted to only a set of whitelisted keys, this kind of run is considered "Permissioned". As opposed to a "Permissionless" which is open to anyone without any authorization required.

Permissioned Runs

When a In order to be able to join a run, a user (with a key) must first be allowed to join a run.

This is done through the following steps:

The join_authority (the grantor) issues an authorization to an authorizer (the grantee)
The authorizer (the grantee) sets a list of delegate keys that can join the run on its behalf
The delegate key then can join a run

Keys Authorizations

Make sure to install the scripting dependencies:

sudo apt-get install jq
cargo install solana_toolbox_cli

For the join_authority (the grantor) to issues new authorization a script is provided:

# We assume that "grantor.json" contains the Private Key of the "join_authority"
# The "grantor.json" can be created using: $ solana-keygen new -o grantee.json
# We assume that $GRANTEE_PUBKEY is set to the public key of the "authorizer" (or grantee)
# The $GRANTEE_PUBKEY can be retrieved by using: $ solana-keygen pubkey grantee.json
sh scripts/join-authorization-create.sh devnet grantor.json $GRANTEE_PUBKEY

For the authorizer (the grantee) to set a list of delegate, the following script is provided:

# We assume that $GRANTOR_PUBKEY is set to the public key of the "join_authority" of the run
# The $GRANTOR_PUBKEY can be retrieved by using: $ solana-keygen pubkey grantor.json
# We assume that "grantee.json" contains the Private Key of the "authorizer"
# The "grantee.json" can be created using: $ solana-keygen new -o grantee.json
# We assume that a set of keypairs exist at path: delegate1.json, delegate2.json, etc
sh scripts/join-authorization-set-delegates.sh devnet $GRANTOR_PUBKEY grantee.json delegate*.json

Further information

The source code for the authorizer smart contract used by the Psyche's coordinator can be found here with its readme: https://github.com/PsycheFoundation/psyche/tree/main/architectures/decentralized/solana-authorizer

Client FAQ

Which operating systems are supported?
- We officially support modern Linux versions. macOS is supported for development purposes only (not for production training).
What are the hardware requirements to run a client?
- Linux: You need a CUDA-compatible GPU. As for the exact specs this will depend on the size of the model being trained.
- macOS (development only): Apple Silicon Macs can use Metal Performance Shaders for GPU acceleration.
- Support for AMD ROCm is planned for the future.
Can I join a run at any moment?
- Yes! You will remain as a pending client and start training in the next epoch.
Can I leave a run at any moment? How do I leave a run?
- Yes, you may leave a run by closing the container with the client, either with Ctrl+C in the terminal or by stopping manually the container if it's running detached. However take into account that once rewards are implemented, you will lose all rewards for that epoch.
What happens if my connection drops or my client crashes?
- This is similar to closing the client, just make sure the docker container is correctly stopped and re-run the client.
How do I update the client to the latest version?
- You can force Docker to pull the latest image by running docker pull nousresearch/psyche-client:latest before running the client.
Do I need a Solana wallet to train? Does it need to have funds?
Are the client and coordinator open-source? Can I report bugs?
- Yes, you may check Psyche's github repo

Psyche Development

As the Psyche project is large & complex, we'll walk you through some of the processes we use in development.

Setup & Useful Commands

Installation and Setup

Psyche uses nix + flakes to install every single dependency and development tool Psyche needs to run and be developed. This is the preferred way of working on Psyche, as it guarantees a consistent development and build process regardless of your machine's specific configuration.

If you can't / don't want to use Nix, it's also possible to manually install all the required deps for Psyche.

Linux & macOS via Nix

Installing Nix

To install nix, simply run curl --proto '=https' --tlsv1.2 -sSf -L https://install.determinate.systems/nix | sh -s -- install or find it at your local package manager.

You may need to add the following line to ~/.config/nix/nix.conf or /etc/nix/nix.conf

experimental-features = nix-command flakes

Binary cache

To speed up your builds & your local dev shell, we recommend enabling the binary cache from garnix, our CI provider.

In order to use the cache that garnix provides, change your nix.conf, adding https://cache.garnix.io to substituters, and cache.garnix.io:CTFPyKSLcx5RMJKfLo5EEPUObbA78b0YQ2DTCJXqr9g= to trusted-public-keys.

If you've just installed Nix via the Determinite Systems installer above, you can do this by adding these lines to /etc/nix/nix.conf:

extra-substituters = https://cache.garnix.io
extra-trusted-public-keys = cache.garnix.io:CTFPyKSLcx5RMJKfLo5EEPUObbA78b0YQ2DTCJXqr9g=

Setup Using `direnv`

You can optionally use direnv to automatically enter a Nix environment when you cd into the Psyche folder.

Install direnv from your system's package manager. After running direnv allow in the Psyche directory once, your terminal will automatically enter a development shell when you subsequently cd into the Psyche directory.

Setup Without `direnv`

Each time you open a new shell in the Psyche directory, run nix develop to enter a development shell.

Platform Differences

Linux: Uses CUDA for NVIDIA GPUs. Full distributed training support with NCCL.
macOS: Uses Metal Performance Shaders for Apple Silicon GPUs. Single-GPU only (parallelism features disabled).

Ubuntu

The following instructions are needed for a server with a fresh Ubuntu installation

1. Install drivers (if not already installed)

sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install

2. Create and enter a Python virtual env

sudo apt install -y python3-pip python3-venv
python3 -m venv .venv
source .venv/bin/activate

3. Install Torch 2.7.0 CUDA 12.8

pip3 install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128

4. Libtorch environment variables

Add the following section to .cargo/config.toml. Adjust LD_LIBRARY_PATH for your <repo_directory> and specific version of Python (3.10 shown here). NOTE: Don't commit these changes!

[env]
LIBTORCH_USE_PYTORCH = "1"
LD_LIBRARY_PATH = "<repo_directory>/.venv/lib/python3.10/site-packages/torch/lib"

5. Download & install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

6. (optional) Install `just`

sudo snap install just --edge --classic

7. (optional) Install Solana and Anchor

Install Solana

sh -c "$(curl -sSfL https://release.anza.xyz/beta/install)"

After installation, follow the instructions to add the Solana tools to PATH.

Install Anchor

cargo install --git https://github.com/coral-xyz/anchor --rev a7a23eea308440a9fa9cb79cee7bddd30ab163d5 anchor-cli

This may require

sudo apt install pkg-config libudev-dev libssl-dev libfontconfig-dev

Windows (outdated)

Install CUDA libraries: https://developer.nvidia.com/cuda-12-4-1-download-archive?target_os=Windows&target_arch=x86_64&target_version=11
Download libtorch & extract: https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcu124.zip
Download OpenSSL: https://slproweb.com/download/Win64OpenSSL-3_3_3.exe
Install Perl: https://github.com/StrawberryPerl/Perl-Dist-Strawberry/releases/download/SP_53822_64bit/strawberry-perl-5.38.2.2-64bit.msi
Create a .cargo/config.toml file to set environment variables

NOTE: Building may take several minutes the first time as openssl-sys takes a long time (for some reason)

[env]
LIBTORCH = <path_to_libtorch>
OPENSSL_LIB_DIR = <path_to_openssl>/lib/VC/x64/MT
OPENSSL_INCLUDE_DIR = <path_to_openssl>/include

Docker

requires Nix!

Create a Docker image with the necessary dependencies to run a Psyche client:

Install the necessary NVIDIA and CUDA drivers as explained in the previous sections.
Install the NVIDIA container toolkit. If using Ubuntu, just run:

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Create an .env file following the .env.example in psyche/config/client and update the necessary environment variables.
Run just nix build_docker_solana_client.

Useful commands

Psyche uses just to run some common tasks.

You can run just to see the whole list of commands!

Running checks

requires Nix!

just check

If it passes, CI will pass.

Formatting

just fmt

Running Psyche on-chain

To build the Solana programs, you'll need a handful of Solana tools installed. See the setup if you're not using Nix.

To start, you'll need to create a Solana wallet to fund your transactions.

solana-keygen new

Run on a local validator (localnet)

In a new terminal, run the following command to:

setup a solana-test-validator
Deploy all the required programs
Create a local run with name <RUN_ID>. If no run name is provided, the name test will be used by default.

just setup-solana-localnet-test-run run_id=<RUN_ID>

Then, in another terminal, run a client to train the test model and joining the run with name RUN_ID. If no run name is provided, the name test will be used by default.

just start-training-localnet-client run_id=<RUN_ID>

This will start a run to train a 1.1b parameter model with all the parallelism features enabled. For a more lightweight run to avoid OOM errors, or just to use your hardware less, (we see you 8gb VRAM cards!) there's also:

just setup-solana-localnet-light-test-run
just start-training-localnet-light-client

By default the client will use the private key generated by solana-keygen new above (located by default in ~/.config/solana/id.json).

To spin up another client and join the run we'll have to create another keypair using:

solana-keygen new --outfile <PATH_TO_NEW_KEYPAIR>

and run the same just command but using the new created keypair:

WALLET_FILE=<PATH_TO_NEW_KEYPAIR> just start-training-localnet-client run_id=<RUN_ID>

or:

WALLET_FILE=<PATH_TO_NEW_KEYPAIR> just start-training-localnet-light-client run_id=<RUN_ID>

Run on Solana's Devnet

You'll need to fund your wallet to make transactions on Devnet. You can request an airdrop from the Solana foundation of up to 10 devnet sol every 8 hours. Simply run

solana-keygen pubkey

and paste the resulting key into the airdrop website.

You can then use the same steps for deploying the programs, creating a run, and training on localnet above, but using following just commands:

just setup-solana-devnet-test-run
just start-training-devnet-client

alongside the -light variants

just setup-solana-devnet-light-test-run
just start-training-devnet-light-client

Regenerating program keypairs

If you're developing things that change the structure of the program's accounts layout, deploying an update to the coordinator program will likely cause breakage with existing runs that have coordinator accounts already instantiated.

Any programs, including the Psyche website's indexer, will fail to read the content of the on-chain data if you use a new IDL with an old in-memory layout.

Therefore, changes to the data structures that end up on-chain will require a deployment of a new coordinator program under a new ProgramID to prevent breakage of existing runs.

In order to do this by yourself, you'll need to generate a new ProgramID (and keypair).

To deploy a program to devnet or localnet with a new program keypair, regenerate its devnet/localnet keypair file (checked into the repo!)

For the solana coordinator, that would be:

solana-keygen new -o architectures/decentralized/solana-coordinator/target/deploy/psyche_solana_coordinator-keypair.json -f

You can see the newly generated program ID by running

solana-keygen pubkey architectures/decentralized/solana-coordinator/target/deploy/psyche_solana_coordinator-keypair.json

Make sure to then update the declare_id's content with the new keys before deploying the new development contracts, either manually or with anchor keys sync in the appropriate project folder.

if you want to push these changes to the repo, you'll need to use git add -f, since they're normally .gitignored.

Running Psyche offchain

When developing for Psyche, you might not want to spin up all the Solana infrastructure if you're working on a feature like the distributed networking or the training code.

To that end, we maintain a "centralized" client & server package that simply communicate over TCP instead of dealing with code deployed to a Solana network.

There's a server package, and a client package. To develop with them, you'd spin up one server with whatever run config you want

Local Testnet

The local testnet is a helper application designed to easily spin up a Server and multiple clients. It's useful for doing sample runs on your own hardware, and for development.

Pre-requisites

Since we want to run many clients and the server we'll need several terminal windows to monitor them. The tool uses tmux to create them.

If you're using the Nix devShell, tmux is already included.

Running

A sample invocation that fires up 3 clients to train on a 20m model might look like this:

just local-testnet \
    --num-clients 3 \
    --config-path ./config/consilience-match-llama2-20m-fineweb-pretrain-dev/

There's a lot of options to configure the local testnet. Check em out below!

Command-line options

# Command-Line Help for `psyche-centralized-local-testnet`

This document contains the help content for the psyche-centralized-local-testnet command-line program.

Command Overview:

`psyche-centralized-local-testnet`

Usage: psyche-centralized-local-testnet <COMMAND>

Subcommands:

start — Starts the local-testnet running each part of the system in a separate terminal pane

`psyche-centralized-local-testnet start`

Starts the local-testnet running each part of the system in a separate terminal pane

Usage: psyche-centralized-local-testnet start [OPTIONS] --num-clients <NUM_CLIENTS> --config-path <CONFIG_PATH>

Options:

--num-clients <NUM_CLIENTS> — Number of clients to start
--config-path <CONFIG_PATH> — File path to the configuration that the coordinator will need to start
--write-distro-data <WRITE_DISTRO_DATA> — If provided, write DisTrO data to disk in this path
--server-port <SERVER_PORT> — Port where the server for this testnet will be listen it to (this is the one that clients must use when connecting)

Default value: 20000
--tui <TUI> — Enables a terminal-based graphical interface for monitoring analytics

Default value: true

Possible values: true, false
--random-kill-num <RANDOM_KILL_NUM> — Kill N clients randomly every <RANDOM_KILL_INTERVAL> seconds
--allowed-to-kill <ALLOWED_TO_KILL> — Which clients we're allowed to kill randomly
--random-kill-interval <RANDOM_KILL_INTERVAL> — Kill <RANDOM_KILL_NUM> clients randomly every N seconds

Default value: 120
--log <LOG> — Sets the level of the logging for more granular information

Default value: warn,psyche=debug
--first-client-checkpoint <FIRST_CLIENT_CHECKPOINT> — HF repo where the first client could get the model and the configuration to use
--hf-token <HF_TOKEN>
--write-log

Default value: false
--wandb-project <WANDB_PROJECT>
--wandb-group <WANDB_GROUP>
--wandb-entity <WANDB_ENTITY>
--optim-stats <OPTIM_STATS>
--eval-tasks <EVAL_TASKS>

This document was generated automatically by clap-markdown.

Server & Client

Both of these applications can be spun up individually at your discretion instead of using the local testnet. We include all their command-line options for your reading pleasure:

Client

# Command-Line Help for `psyche-centralized-client`

This document contains the help content for the psyche-centralized-client command-line program.

Command Overview:

`psyche-centralized-client`

Usage: psyche-centralized-client <COMMAND>

Subcommands:

show-identity — Displays the client's unique identifier, used to participate in training runs
train — Allows the client to join a training run and contribute to the model's training process

`psyche-centralized-client show-identity`

Displays the client's unique identifier, used to participate in training runs

Usage: psyche-centralized-client show-identity [OPTIONS]

Options:

--identity-secret-key-path <IDENTITY_SECRET_KEY_PATH> — Path to the clients secret key. Create a new random one running openssl rand 32 > secret.key or use the RAW_IDENTITY_SECRET_KEY environment variable

`psyche-centralized-client train`

Allows the client to join a training run and contribute to the model's training process

Usage: psyche-centralized-client train [OPTIONS] --run-id <RUN_ID> --server-addr <SERVER_ADDR>

Options:

-i, --identity-secret-key-path <IDENTITY_SECRET_KEY_PATH> — Path to the clients secret key. Create a new random one running openssl rand 32 > secret.key. If not provided a random one will be generated
--bind-p2p-port <BIND_P2P_PORT> — Sets the port for the client's P2P network participation. If not provided, a random port will be chosen
--bind-p2p-interface <BIND_P2P_INTERFACE> — Sets the network interface for the client's P2P network participation. If not provided, will bind to all interfaces
--logs <LOGS> — Sets clients logs interface tui: Enables a terminal-based graphical interface for monitoring analytics. console: standard logs json: standard logs with json format

Default value: tui

Possible values: tui, console, json, none
--oltp-auth-header <OLTP_AUTH_HEADER> — An auth header string for an opentelemetry endpoint. Used for both logging and metrics
--oltp-metrics-url <OLTP_METRICS_URL> — A URL for sending opentelemetry metrics. probably ends in /v1/metrics
--oltp-tracing-url <OLTP_TRACING_URL> — A URL for sending opentelemetry traces. probably ends in /v1/traces
--oltp-logs-url <OLTP_LOGS_URL> — A URL for sending opentelemetry logs. probably ends in /v1/logs
--oltp-report-interval <OLTP_REPORT_INTERVAL> — how often to report metrics thru opentelemetry

Default value: 60.0
--metrics-local-port <METRICS_LOCAL_PORT> — If present, output some metrics & stats via this TCP port in JSON format. Useful for debugging or local integration
--run-id <RUN_ID> — A unique identifier for the training run. This ID allows the client to join a specific active run
--data-parallelism <DATA_PARALLELISM>

Default value: 1
--tensor-parallelism <TENSOR_PARALLELISM>

Default value: 1
--micro-batch-size <MICRO_BATCH_SIZE>

Default value: 1
--write-gradients-dir <WRITE_GRADIENTS_DIR> — If provided, every shared gradient this client sees will be written to this directory
--eval-tasks <EVAL_TASKS>
--eval-seed <EVAL_SEED>

Default value: 42
--eval-task-max-docs <EVAL_TASK_MAX_DOCS>
--prompt-task
--checkpoint-dir <CHECKPOINT_DIR> — If provided, every model parameters update will be save in this directory after each epoch
--hub-repo <HUB_REPO> — Path to the Hugging Face repository containing model data and configuration
--hub-max-concurrent-downloads <HUB_MAX_CONCURRENT_DOWNLOADS>

Default value: 3
--wandb-project <WANDB_PROJECT>
--wandb-run <WANDB_RUN>
--wandb-group <WANDB_GROUP>
--wandb-entity <WANDB_ENTITY>
--write-log <WRITE_LOG>
--optim-stats-steps <OPTIM_STATS_STEPS>
--grad-accum-in-fp32

Default value: false
--dummy-training-delay-secs <DUMMY_TRAINING_DELAY_SECS>
--max-concurrent-parameter-requests <MAX_CONCURRENT_PARAMETER_REQUESTS>

Default value: 4
--max-concurrent-downloads <MAX_CONCURRENT_DOWNLOADS>

Default value: 4
--device <DEVICE> — Device(s) to use: auto, cpu, mps, cuda, cuda:N, cuda:X,Y,Z

Default value: auto
--sidecar-port <SIDECAR_PORT>
--server-addr <SERVER_ADDR>

This document was generated automatically by clap-markdown.

Server

# Command-Line Help for `psyche-centralized-server`

This document contains the help content for the psyche-centralized-server command-line program.

Command Overview:

`psyche-centralized-server`

Usage: psyche-centralized-server <COMMAND>

Subcommands:

validate-config — Checks that the configuration declared in the state.toml file is valid
run — Starts the server and launches the coordinator with the declared configuration

`psyche-centralized-server validate-config`

Checks that the configuration declared in the state.toml file is valid

Usage: psyche-centralized-server validate-config [OPTIONS] --state <STATE>

Options:

--state <STATE> — Path to the state.toml file to validate
--data-config <DATA_CONFIG> — Path to data.toml file to validate. If no provided then it will not be checked

`psyche-centralized-server run`

Starts the server and launches the coordinator with the declared configuration

Usage: psyche-centralized-server run [OPTIONS] --state <STATE>

Options:

--state <STATE> — Path to TOML of Coordinator state
-s, --server-port <SERVER_PORT> — Port for the server, which clients will use to connect. if not specified, a random free port will be chosen
--tui <TUI>

Default value: true

Possible values: true, false
--data-config <DATA_CONFIG> — Path to TOML of data server config
--save-state-dir <SAVE_STATE_DIR> — Path to save the server and coordinator state
--init-warmup-time <INIT_WARMUP_TIME> — Sets the warmup time for the run. This overrides the warmup_time declared in the state file
--withdraw-on-disconnect <WITHDRAW_ON_DISCONNECT> — Automatically withdraw clients that disconnect from the server

Default value: true

Possible values: true, false
--oltp-auth-header <OLTP_AUTH_HEADER> — An auth header string for an opentelemetry endpoint. Used for both logging and metrics
--oltp-metrics-url <OLTP_METRICS_URL> — A URL for sending opentelemetry metrics. probably ends in /v1/metrics
--oltp-tracing-url <OLTP_TRACING_URL> — A URL for sending opentelemetry traces. probably ends in /v1/traces
--oltp-logs-url <OLTP_LOGS_URL> — A URL for sending opentelemetry logs. probably ends in /v1/logs

This document was generated automatically by clap-markdown.

Implementing models

This codebase includes a set of sample programs that let you design, implement, and test model architectures without spinning up the whole Psyche p2p training architecture.

We currently only implement Llama and Deepseek (see shared/modeling/src/models/), but PRs are very welcome to add more architectures and model types.

The train example, documented below, is useful to test how your model trains using AdamW vs DisTrO.

Running

cargo run --example train -- ---help

You'll need a pre-tokenized dataset downloaded to your disk for training.

A PR is welcome to add an option to the trainer to use the HTTP data provider! You can refer to the http example in the data-provider crate for a sample implementation.

For a Llama 2 model, a pre-tokenized dataset to test with is available at https://huggingface.co/datasets/emozilla/fineweb-10bt-tokenized-datatrove-llama2/. Psyche only needs the .ds files, and will load any/all .ds files in the specified folder - you can download just one for smaller tests.

If you've downloaded part or all of the above dataset into a folder data/fineweb-10bt inside the Psyche repo, you can start a simple training run on a 20m parameter Llama 2 model:

cargo run --example train -- \
    --model emozilla/llama2-20m-init \
    --data-path ./data/fineweb-10bt/ \
    --total-batch 2 \
    --micro-batch 1

Adding a new model type

The train example currently asssumes your model is a Llama or Deepseek v2/v3 model, and instantiates it via (LlamaForCausalLM|DeepseekForCausalLM)::from_pretrained.

We currently only support causal language models - to implement a new one, you can create a file similar to llama_for_causal_lm and implement your model, ensuring you provide a trait impl for CausalLM.

There's alpha-level support for models written in Python. See the Python docs for more information.

You might also need to modify the data provider, if your data is structured in some way. Since you're implementing the forward pass yourself, you can serve and interpret data passed from the data provider however you need. The data provider currently only supports reading fixed-size batches from input files, so data batches with different sizes will require some additional work.

PRs welcome for any new kinds of dataset loading!

Python Integration

[!WARNING] Python support is still under development and not production-ready. The APIs used to write it are not documented because they are still subject to large amounts of change.

Overview

Psyche provides a Python integration that allows you to write modeling code in Python using libraries like Hugging Face Transformers while leveraging Psyche's Rust core for training orchestration. This integration is designed for research where you want the flexibility of Python modeling with Psyche's training infrastructure, and production-scale training where you want to take advanted of highly optimized training frameworks already built in Python.

The Python integration works through a "sidecar" process that Psyche spawns and communicates with during training.

Development Setup

To develop with the Python integration, we have a Python development shell available.

This shell provides:

The psyche Python module (built from Rust using PyO3)
PyTorch
Transformers library
Other required Python dependencies

Development Workflow

When you enter the dev shell, it compiles the Rust extension that provides the psyche Python module. If you modify any Rust code in the Python extension or its dependencies, you must exit and re-enter the dev shell to recompile the extension.

We recommend running commands directly through the dev shell without entering it, which will recompile the extension as needed.

For example, to run the train program using python:

nix develop .#dev-python --command cargo run --features python --example train -- \
  --model emozilla/llama2-20m-init \
  --data-path ./data/fineweb-10bt/ \
  --total-batch 2 \
  --micro-batch 1 \
  --python

Alternatively, you can enter the shell with

nix develop .#dev-python

but this is likely to be a footgun as it's easy to forget to exit and re-enter the shell.

Architecture

The Python integration uses a sidecar architecture:

Psyche Core (Rust): Handles data loading, distributed training coordination, and spawns Python processes
Python Sidecar: Runs the modeling code using PyTorch and Transformers or any other Python code.

When you use the --python flag, Psyche automatically spawns Python sidecar processes using:

python -m psyche.sidecar --parent-pid <pid> --backend <backend> --init-method <method> --world-size <size> --rank <rank>

Testing Your Changes

To test modifications to the Python integration:

Modify the sidecar code in the Python extension
Run the training example:

nix develop .#dev-python --command cargo run --features python --example train -- \
  --model emozilla/llama2-20m-init \
  --data-path ./data/fineweb-10bt/ \
  --total-batch 2 \
  --micro-batch 1 \
  --python

How It Works

Initialization: Psyche spawns Python sidecar processes for each rank
Model Creation: The sidecar receives model architecture and source information via the distributed store
Training Loop: Psyche coordinates training by sending operations (train, optimize, extract) to the sidecar
Data Flow: Training data is broadcast to all processes, and gradients/parameters are communicated back through PyTorch's distributed primitives

The sidecar handles three main operations:

Train: Forward/backward pass with gradient accumulation
Optimize: Apply DisTrO results to the model being trained
Extract: Model state extraction for checkpointing

This architecture allows you to write complex modeling code in Python while integrating with Psyche's distributed training network.

Secrets

We manage secrets in our repo using agenix. These secrets are keyed to specific developers via SSH public keys. Some are used for deployments, and some can be used for development.

You can read more about agenix and how secrets are used in our deployment HERE

What secrets do we store?

# this file contains secrets that we can store encrypted in this repo.
# they can be decrypted by the specified ssh public keys using `agenix`.
let
  keys = import ./nix/keys.nix;
in
{
  ## Local Development
  # a shared devnet wallet
  "secrets/devnet/wallet.age".publicKeys = keys.allDevKeys;

  # RPC url for devnet
  "secrets/devnet/rpc.age".publicKeys = keys.allDevKeys;

  # RPC url for mainnet
  "secrets/mainnet/rpc.age".publicKeys = keys.allDevKeys;

  ## Deployments

  # all RPC urls for our devnet indexer
  "secrets/devnet/backend.age".publicKeys = keys.allKeys;

  # all RPC urls for our mainnet indexer
  "secrets/mainnet/backend.age".publicKeys = keys.allKeys;
}

Editing a secret

you must have your pubkey listed in secrets.nix for a secret if you want to modify the existing one!

ask someone whose key is in secrets.nix to be added.

To edit the secret whatever.age, run

agenix -e secrets/whatever.age

Building the Psyche Book

That's the document you're reading! :D

Development

Simply run just serve_book to serve the book over http on localhost!

Building

nix build .#psyche-book

The book will be output to result/, which you can preview easily with python -m http.server -d ./result/

CI

Overview

We use Garnix as our CI provider. It:

Builds packages in our Nix flakes
Runs all Nix checks including formatting, lints, & Rust tests.

Deployment Branches

Some branches are configured for automatic deployment. These branches serve as dedicated testing environments.

Development Environments

These environments are stateful and accessible via SSH for developer troubleshooting. Public keys are listed in this repo.

Source Branch	Purpose	Hostname
`test-deploy-devnet`	Indexer/frontend for devnet	`devnet-preview.psyche.network`
`test-deploy-mainnet`	Indexer/frontend for mainnet	`mainnet-preview.psyche.network`
`test-deploy-docs`	Preview docs	`docs.preview.psyche.network`

Production Environment

main automatically deploys the website/indexer to https://mainnet.psyche.network/ and the docs to https://docs.psyche.network/.

This is a stateful deploy, but with no SSH access for security reasons.

Contributing to Psyche

Found a bug?

Make sure we're not already aware of it by checking GitHub Issues.
If it seems your bug is new, open an issue. Describe the expected & actual behaviour in as much detail as possible, ensuring to include system information (CUDA? CPU?) and any relevant command-line params (data parallelism? tensor parallelism? compression ratio?).

Fixed a bug?

Submit a GitHub PR with your bugfix.
Make sure your PR clearly explains what was broken and how you fixed it. Reference any related issues.
Before submitting, check out our guidelines to keep things consistent.

Want to add a cool feature or change something?

First, share your idea on the Psyche forum and get some feedback.
Feel free to start developing whenever you want, but we generally won't accept a PR unless there's been some discussion and feedback about whether your feature fits Psyche's goals.

Have questions about how things work?

Post your questions on the Psyche forum - that's the best place to get answers!

Want to improve our docs?

We'd love that. Feel free to open PRs!

Thank you for your contributions to Psyche :heart:

PR guidelines

We prefer PRs to be made and merged using rebase, not merge commits. It's not a deal-breaker, but rebase makes us happy <3

Clean Linear History

Rebasing creates a linear commit history without merges going back and forth, making it much easier to identify the place a change was made. Fixups in merge commits that introduce bugs are no longer associated with the original code, wheras with with rebase you'd find the bug as part of its original commit.

Merge commits add extra noise to the history without adding meaningful content about what changed.

Better Bisect Experience

A linear history makes git bisect more effective for finding bugs, as each commit represents a coherent, working state of the codebase.

Preserving Meaningful Commits

While we advocate for rebase, we do not advocate for squashing all commits. Each commit should:

Document a single logical step in your development process
Be independently revertible if needed
Separate concerns such as:
- Refactoring (changing structure but not behavior)
- Feature additions (changing behavior)
- Bug fixes
- Documentation updates
Build & pass all checks if checked out individually.

What to Avoid

Don't squash meaningful commits together - this buries important changes in large diffs and loses the step-by-step narrative
Don't use merge commits within feature branches
Don't include "fix up" or "oops" commits in your final PR - these are fine to have during development, but before opening your PR, use git commit --amend or interactive rebase to clean these up. A typical rebase workflow is explained in this blog post. git absorb is also very useful for small fixups.