Welcome to Psyche

Psyche is a set of systems that enable distributed training of transformer-based AI models over the internet. It seeks to enable collaboration between untrusted parties to train state-of-the-art ML models.

Psyche is a system that enables distributed training of transformer-based AI models over the internet, aiming to foster collaboration between untrusted parties to create state-of-the-art machine learning models. It leverages a peer-to-peer distributed network for communication and data sharing.

This documentation provides a comprehensive guide to understanding, using, and developing with Psyche, whether you're an end-user looking to participate in a training run, a developer interested in contributing to the project, or just curious about how it all works.

Introduction

How does it work?

At its core, Psyche is a protocol that coordinates multiple independent clients to train a single machine learning model together. Rather than running on a centralized server farm with high-speed interconnects between every accelerator (GPUs, usually), Psyche distributes the training workload across many independent computers, each contributing a small piece to the overall training process.

Psyche is built to maintain training integrity without requiring participants to trust each other. Through a combination of consensus mechanisms, game theory, and careful protocol design, Psyche will ensure that the trained model remains coherent and consistent despite being trained across disparate machines.

Decentralized training flow

flowchart TD
 subgraph sg_solana["Solana"]
    direction TB
        CoordinatorState["Coordinator Program State <br> (Run State, Epoch,<br>Round, Clients)"]
  end
 subgraph sg_distro["DisTrO Optimizer"]
    direction TB
        MomentumUpdate["Update Local Momentum <br> m<sub>t</sub> = βm<sub>t-1</sub> + g<sub>t</sub>"]
        DCTExtract["Extract Fast Components <br> (q<sub>t</sub>) (DCT + TopK)"]
        CompressedUpdate["Compressed Local q<sub>t</sub> <br> (Indices + Amplitudes)"]
        MomentumResidual["Update Local<br>Momentum Residual<br> m<sub>t+1</sub> = m<sub>t</sub> - q<sub>t</sub>"]
  end
 subgraph sg_loop["Local Training"]
    direction TB
        LocalWeights["Model Weights (x<sub>t</sub>)"]
        ApplyAggregatedUpdate["Apply Aggregated Update <br> x<sub>t</sub> = x<sub>t-1</sub> - η Q<sub>t-1</sub>"]
        ReceiveDecode["Receive &amp;<br>Decode/Aggregate <br> Compressed q<sub>t-1</sub><br> from Peers"]
        ForwardBackward["Forward/Backward Pass <br> (Use x<sub>t</sub>, <br>Compute Gradient g<sub>t</sub>)"]
        FetchData["Fetch Assigned Data <br> (Batch<sub>t</sub>)"]
        Gradient["Local Gradient (g<sub>t</sub>)"]
        sg_distro
        P2PNetworkInterface["P2P Network Interface"]
  end
 subgraph sg_client["Client"]
    direction TB
        ClientSM["Client State Machine <br> (Warmup, Train,<br>Witness, Cooldown)"]
        sg_loop
  end
 subgraph sg_p2p["P2P Gossip & Blob Transfer"]
    direction TB
        ClientNode2("Client Node 2")
        ClientNode3("Client Node 3")
        ClientNodeN("Client Node N")
  end
    DataProvider["Data Provider <br> (Local File/HTTP/etc.)"]
    ClientSM -- Manages --> sg_loop
    ClientSM -- Receives State Updates --- CoordinatorState
    ApplyAggregatedUpdate --> LocalWeights
    ReceiveDecode -- "Aggregated Q<sub>t-1</sub>" --> ApplyAggregatedUpdate
    LocalWeights -- Used By --> ForwardBackward
    FetchData -- Provides Data --> ForwardBackward
    ForwardBackward -- Produces Gradient --> Gradient
    Gradient -- Updates --> MomentumUpdate
    MomentumUpdate --> DCTExtract
    DCTExtract -- Produces --> CompressedUpdate
    DCTExtract -- Updates --> MomentumResidual
    CompressedUpdate -- Broadcasts Local Compressed Update --> P2PNetworkInterface
    P2PNetworkInterface -- Receives Compressed Updates --> ReceiveDecode
    DataProvider -- Provides Data --> FetchData
    P2PNetworkInterface <-- Send/Receive Updates -------> sg_p2p
    ClientNode2 <-- Transfer Data Off-chain --> ClientNode3 & ClientNodeN
    ClientNode3 <-- Transfer Data Off-chain --> ClientNodeN
    CoordinatorState -- Assigns Data/Committee --> ClientSM
    ClientSM -- "Submits Transactions (e.g., Join, Tick, Witness)" --> CoordinatorState

Psyche In Depth

The core system is composed of three main actors:

  • Coordinator: Serves as a source of truth for global state available to all clients in a given training run. Each run has one coordinator that oversees the entire process. The coordinator is implemented as both a program running on the Solana Blockchain and as a regular TCP server.

  • Client: A user participating in a training run. Clients receive the model to be trained and a specific dataset for that run. They send information to the coordinator to progress the training run and use a peer-to-peer network to share their results at each training step with other clients.

  • Data Provider: Each run requires training data. This data could be served by the Psyche Data Provider server, over HTTP, or loaded from local copies of a dataset.

Sample topologies

---
title: Decentralized Run, training data provided over HTTP.
---
flowchart TB
subgraph "Solana Blockchain"
    C(["Coordinator State"])
end
C <-- Solana RPC & TXs --> C1(("Client")) & C2(("Client")) & C3(("Client"))
C1 <-. p2p gossip .-> C2
C3 <-. p2p gossip .-> C2 & C1
DT["`
Hosted training data
and model snapshots
`"] --> C1 & C2 & C3
---
title: Centralized Run, training data provided thru TCP data server
---
flowchart TB
subgraph "Coordinator Server"
    CC(["Coordinator State"])
end
CC <-- TPC --> C11(("Client")) & C22(("Client")) & C33(("Client"))
C11 <-. p2p gossip .-> C22
C33 <-. p2p gossip .-> C22 & C11
DTT["`
Hosted training data
and model snapshots
`"] --> C11 & C22 & C33

What constitutes a training run?

The training process for a given model is divided into small steps that incrementally train the model in a coordinated manner. A training run is divided into epochs, where clients can join and leave the run, and epochs are further divided into steps, where the model is incrementally trained.

During a training run, clients primarily perform three tasks:

  • Training: Train the model using an assigned subset of the data.
  • Witnessing: Verify the liveness and correctness of other participants.
  • Verifying: Recompute and compare training results to identify and punish malicious participants.

Waiting for Clients & Warmup

At the start of an epoch, all clients have a window of time to join the run by requesting to be added by coordinator, and then connecting to the other participating clients.

Once a minimum threshold of clients has been met, the run will transition to the Warmup phase and begin a countdown to allow connected clients to update their copy of the model, after which it will enter the Training phase.

To obtain a copy of the model, the Coordinator will either direct clients to a checkpoint uploaded somewhere like: HuggingFace or direct clients to download the model from other clients via the p2p network.

Training

At the beginning of an epoch, after the Warmup phase ends, clients are assigned specific tasks that require them to train the model on a portion of the data.

The coordinator contains information that uniquely assigns pieces of training data to clients based on the current round.

If clients have already been training (i.e., it is not the first round of the epoch), they will apply the results from the previous round, then retrieve the data sample they need for the current round.

After completing the training on their assigned data, each client emits a p2p broadcast to all other clients containing their training results and a cryptographic commitment that binds them to those results.

As the training results are recieved from other clients, they are downloaded to be later incorporated into the current model.

Witnessing

At the start of each round, one or more clients are randomly selected as witnesses. The number of witnesses can be configured. Witnesses train the model as usual, but also build bloom filters that track which nodes they have received training results from, signifying that they are actively participating and providing valid results.

These bloom filters are sent to the coordinator, which then combines them into a provable consensus of which results to apply to the model.

Once a witness quorum is reached, the coordinator advances to the Training phase to allow all clients a brief window to download every training result.

Once the Witness phase concludes, the coordinator returns to the Training phase. Clients are assigned new data, and the process repeats. After a predefined number of rounds, a Cooldown round occurs, marking the end of an epoch.

The witness/train loop visualized

Here's a high-level overview of the process.

Additional details exist, but this captures the overall flow of a single Round from an Epoch:

sequenceDiagram
    participant Client1
    participant Client2
    participant Coordinator
    participant Data Hosting
    Client1 ->> Data Hosting: get_data
    Client2 ->> Data Hosting: get_data
    Coordinator ->> Client2: witness
    Note over Client1: Train
    Note over Client2: Train
    Client1 ->> Client2: Send results
    Client2 ->> Client1: Send results
    Note over Client1: Download results
    Note over Client2: Download results
    Client2 ->> Coordinator: Send witness
    Note over Coordinator: Quorum reached
    Note over Coordinator: Starting Witness phase

Coordinator

The Coordinator stores metadata about the training run's state and a list of participants.

It handles the transition between each Phase of a Round, and provides a random seed that's used to determine data assignments, witnesses, and more.

It's responsible for providing a point of synchronization for all clients within a run.

Ticks (State Transitions)

The coordinator behaves like a state machines, moving from one state to another, with each state transition having specific requirements.

When certain events occur or time-based conditions are met, the Coordinator can be "ticked" forwards to transition from one Phase to another Phase.

sequenceDiagram
    loop
        Note over Backend, Coordinator: Wait for a timeout or backend state
        Backend->>Coordinator: Tick
        Coordinator->>Backend: New state produced
        Backend->Client1: New coordinator state consumed by Client
        Backend->Client2: New coordinator state consumed by Client
    end

Beginning an Epoch (state: WaitingForMembers)

The Coordinator begins in the WaitingForMembers phase, with no clients connected.

Whatever backend you're running the Coordinator in should accept pending clients to be added to upcoming Epochs.

When inside the WaitingForMembers phase, your backend will pass new clients to the Coordinator until a configured min_clients threshold is met, at which point the coordinator's tick will transition it to the Warmup phase.

sequenceDiagram
    Note over Coordinator: min_clients = 2
    Client1->>Coordinator: Join
    Client2->>Coordinator: Join
    Note over Coordinator: Entering Warmup
    Client1->>Client2: Connect
    Client2->>Client1: Connect
    Note over Coordinator: The Warmup countdown elapses
    Note over Coordinator: Entering Training

Model Loading (state: Warmup)

This phase is designed to let all clients download the model & load it onto their GPUs.

If a client has dropped whilst waiting for the warmup time, the Backend then removes the client from the Coordinator's clients list.

If the number of clients falls below min_clients, the Coordinator goes back to the WaitingForMembers phase.

Once the Warmup time passes, the Coordinator loads all the information for the next training round and change its phase to RoundTrain. The Server will broadcast this Training Coordinator state to all clients.

Training (state: RoundTrain)

In this phase, the Coordinator provides a random seed.

Each client can use this seed, alongside the current round index and epoch index to determine which indicies of the training data to use.

Each client then proceeds to run the training on the selected training data.

This state will end when clients later exchanges Witness messages.

Witnessing training results

As clients complete their training, they send their results to all other clients, including the Witnesses. The witnesses will each send a witness proof to the Coordinator, building towards a witness quorum.

A witness proof contains a bloom filter describing which pieces of data the witness recieved training results for, and which clients did that work. Elected witnesses are responsible for creating these witness proofs and and sending them to the Coordinator.

The witnesses for each round are chosen randomly from all the clients, using the same random seed as for data assignments. A witness will attempt to send an opportunistic witness message once it's seen a recieved a training result for every single batch in the current round.

Witness Quorum

The Coordinator advances the run from the Training phase to the Witness phase in one of two ways:

  • If enough witnesses observe all results and reach a witness quorum for the round, they notify the Coordinator that it is safe to advance. This process, named opportunistic witnessing, accelerates the transition to the Witness phase, rather than having to wait a fixed time for training results.
  • If witnesses do not receive all required results from other clients before the maximum time specified for the Training phase, the Coordinator will nontheless transition to the Witness phase after the maximum Training time elapses.

Witness phase (state: RoundWitness)

This phase exists to give the witnesses an opportunity to send their proofs to the Coordinator in the event that they have not received enough training results from other clients to have reached the quorum and send their proofs opportunistically.

There is also brief slack period for non-witness nodes to catch up by downloading any remaining results they might have not recieved.

When the Witness phase finishes via timeout, the Coordinator transitions from Witness to the Cooldown phase in three cases:

  • If we are in the last round of the epoch.
  • If the clients have dropped to less than the minimum required by the config.
  • If the number of witnesses for the round is less than the quorum specified by the config.

Any clients that have failed health checks will also be removed from the current epoch.

Cooldown phase (state: Cooldown)

The Cooldown phase is the last phase of an epoch, during which the Cooordinator waits for either the Cooldown period to elapse, or a checkpoint to have happened.

When the Cooldown phase begins, the Coordinator resets the current model checkpoint state to Checkpoint::P2P, signifying that new joiners should download the latest copy of the model from the other participants.

Upon exiting the Cooldown phase, the Coordinator transitions to the next epoch, saving the previous epoch state, and moving back to the WaitingForMembers phase.

It all comes together

Here is an overview of the whole process from a high level perspective:

sequenceDiagram
    Backend->>Coordinator: tick
    Coordinator->>Backend: Change state to `RoundTrain`
    Backend->>Client1: New state
    Backend->>Client2: New state
    par Start training
        Client1->>Client1: Start training
        Client2->>Client2: Start training
    end
    Client1->>Committee: get_witness
    Client2->>Committee: get_witness
    Committee->>Client1: false
    Committee->>Client2: true
    Note over Client1: Train
    Note over Client2: Train
    Note over Client2: Fill bloom filters
    Client2->>Backend: try send optimistic witness
    Backend->>Coordinator: Witness message
    Note over Coordinator: Enough witnesses for round
    Coordinator->>Coordinator: Update state to RoundWitness
    Note over Coordinator: Timeout round witness time
    alt step > total steps
        Coordinator->>Coordinator: Update state to Waitingformembers
    else height == rounds per epoch
        Coordinator->>Coordinator: Update state to Cooldown
    else
        Coordinator->>Coordinator: Update state to RoundTrain with step + 1
    end

Health checks

Each client should repeatedly send health checks to the coordinator. Clients are assigned a score determined by the Coordinator using the trainer_healthy_score_by_witnesses method. This score increases as a client sends the required data to be added to the participants' bloom filters, allowing the Coordinator to confirm that the client is actively participating in the training.

A client also sends a list of other clients it considers unhealthy to the server using the HealthCheck message. The Coordinator processes this information to determine whether those clients are healthy. Clients deemed inactive or non-participatory are marked for removal in the next round.

Centralized Backend

In this Backend, the Coordinator is owned and ticked forwards by a Server that communicates via clients over TCP.

The Server's Coordinator is initially configured in main.rs. It's loaded using the configuration file state.toml.

flowchart LR
    S[Server] --run--> A[App]
    S --new--> C[Coordinator]
    C --run_id
        init warmup
        min clients
        model--> A

The Server uses some parts of the Coordinator configuration, like the data server configuration, if enabled, to boot up all the functionality it needs.

When a new client joins the run it has to communicate the run_id that it wants to join, to ensure the client's joining the correct run. After processing the join message, the client is added to a pending clients list, and runs the Coordinator's tick function to potentially add the client into the run.

When a tick condition is met, the Server ticks the Coordinator forwards, then broadcasts the Coordinator's new state to all connected clients.

Decentralized Backend

In this Backend, the Coordinator is an account associated with a Solana Program, and ticked forwards by a tick method that can be called by anyone.

A training run can be created by calling the init_coordinator method in the Coordinator program, and subsequently information about the model to be trained can be set by calling the update method.

For a new client to join the run, it must call the join_run method in the Coordinator program and pass the run_id for the run it intends to join. After the Solan Program processes the join message, the client is added to a pending clients list, and the Program runs the Coordinator's tick function to potentially add the client into the run.

When a tick condition is met, anybody using Solana can tick the Coordinator forwards by calling the tick method (clients in a Run will do this automatically). This new state is then read via an RPC subscription on each Client, progressing through the regular state machine.

flowchart LR
    T["Psyche Team"] -- deploy Solana Program --> P["Solana Program"]
    R["Run Creator"] -- init_coordinator with run_id --> A["Account for this training run"]
    R["Run Creator"] -- update with run info --> A
    C[Client] -- "join_run" --> A
    C --tick--> A
    G["A random Solana user"] -- tick --> A

Data Provider

When you're training an AI model, you need data to train on! Psyche supports multiple kinds of data providers that will fetch & provide the data your model needs to train.

  • Local data provider Each client already has the training data downloaded locally.
  • HTTP data provider Each client will request individual pieces of data from a webserver as they are assigned that data for training
  • TCP data provider Each client will reach out to a dedicated server over TCP & request samples of data.

Overview

When a client starts a round of training, it is assigned an ID or a range of IDs by the coordinator, representing all the "batches" of data that will be used for that round. Each batch contains a specific subsection of the overall training data.

The size of a batch is always the same, and can be configured in the run config This order is deterministic, and is distributed across each client, so no piece of data will be trained on more than once.

To understand how the data is partitioned for each client, refer to the following diagram:

flowchart TD
    C((Coordinator))
    C1[Client A]
    C2[Client B]
    C -- Assigned Batch IDs 1, 2, 3 --> C1
    C -- Assigned Batch IDs 4, 5, 6 --> C2
    subgraph Data Provider
        B1["Batch 1"]
        B2["Batch 2"]
        B3["Batch 3"]
        B4["Batch 4"]
        B5["Batch 5"]
        B6["Batch 6"]
        B4 ~~~ B1
        B5 ~~~ B2
        B6 ~~~ B3
    end
    B1 --> C1
    B2 --> C1
    B3 --> C1
    B4 --> C2
    B5 --> C2
    B6 --> C2

Provider configuration

Inside the run config, the key [model.LLM.data_location] specifies whether the data will be hosted on a TCP server, accessed via HTTP, or stored in a local folder. We also support loading data from GCP as a subsection of the HTTP data provider.

The required configuration depends on the data provider implementation being used:

  1. TCP Server:

    • If the data provider is configured as a TCP server, and an additional file named data.toml is required.
    • This file contains the configuration required for the TCP server, including:
      • Data location
      • Token size
      • Sequence length
      • A seed to shuffle the data if necessary
    • Example data.toml files can be found in psyche/config within the various initial state examples.
  2. HTTP Provider:

    • For the HTTP data provider, no additional configuration file is needed.
    • The required fields for this setup include:
      • The URL (or a set of URLs) from which the data will be fetched - or, if you're loading data from GCP, a GCP bucket and an optional subdirectory.
      • Token size (in bytes)
      • A shuffle seed, if data shuffling is desired.
  3. Local Provider:

    • Simply point to the folder where the data should be loaded from.

Model sharing

When an epoch starts, all clients must have an identical model to train with.

At the beginning of a run, all clients must download the model parameters, tokenizer configuration, and model configuration from HuggingFace, where the model must have been previously uploaded

(TODO: add more details on uploading a model).

Each client will then modify their copy of the model by receiving new training results from other clients and applying them. This keeps everyone's copy of model identical within an epoch without an additional full synchronization step.

When a new client joins a run that has already progressed past its first epoch, it would not be correct for the client to download the original model from HuggingFace, as the model parameters would have already been updated during training. Instead, the new client must acquire a copy of the model from the peers who have been actively training it.

This synchronization process occurs during the Warmup phase, while the coordinator waits to begin the next Training phase.

To address this, we checkpoint the model at the end of an epoch, where clients save and share the entire model for new peers to join. There are two checkpointing variants: HuggingFace based and P2P based.

HuggingFace checkpoint

In this approach, a client or a set of clients are designated as the checkpointers for the run. These clients upload their copy of updated model to HuggingFace after each epoch, and send the URL for this checkpoint to the coordinator. When a new client joins the run, it retrieves the checkpoint URL from the coordinator, and connects to HuggingFace to download the latest copy of the model parameters and configuration files.

P2P checkpoint

In the peer-to-peer (P2P) approach, a new client synchronizes by obtaining the latest model directly from other peers. It receives the model information and parameters from any available peer, requesting a set of parameters for each layer from different clients. This process allows the client to assemble the latest model state and participate in the training without an explicit upload step to a central server occuring.

Here's an example of a P2P model sharing interaction:

flowchart TB
   C((Coordinator))
   C1[Client 1]
   C2[Client 2]
   C3[Client 3]
   C4[Joining Client]
   C --warmup---> C1
   C --warmup---> C2
   C --warmup---> C3
   C2 --Model config--> C4
   C4 -.Join.-> C
   C1 -.Layer 1 weights.-> C4
   C2 -.Layer 2 weights.-> C4
   C3 -.Layer 3 weights.-> C4

End-user configuration

To train your own models on Psyche, you should familiarize yourself with:

Creating a run

To create a new training run and make it available for nodes to join, you'll need to create it, configure it, and unpause it.

First, create the run on-chain. You'll need to provide:

  • the RPC & websocket RPC urls so the client can communicate with an RPC node.
  • a unique run ID - just a few characters to uniquely identify your run.
  • a name & description for your run
psyche-solana-client create-run \
    --rpc [RPC] \
    --ws-rpc [WS_RPC] \
    --run-id [RUN_ID] \
    --name [NAME] \
    --description [DESCRIPTION]

Then, set the run's config. You'll need to provide:

  • the RPC & websocket RPC urls so the client can communicate with an RPC node.
  • the run ID you previously used
  • the path to a config.toml file, following the run config schema
psyche-solana-client update-config \
    --rpc [RPC] \
    --ws-rpc [WS_RPC] \
    --run-id [RUN_ID] \
    --config-path [CONFIG_FILE]

At this point, your run is ready to go! You can now set its state to "unpaused", and let clients join & begin training your model.

psyche-solana-client set-paused \
    --rpc [RPC] \
    --ws-rpc [WS_RPC] \
    --run-id [RUN_ID] \
    resume

Congratulations! As soon as your first client joins, your model is being trained.

Run configuration

A training run on Psyche is described using a Run Configuration file. It's a .toml file with information about the model shape, size, checkpoints, optimizer settings, run witnessing settings, and more.

There's two top-level values in a run configuration: a config, and a model.

While some examples are described below, you can find the full range of options for the coordinator here and for the model here

Config

Here's a sample config with some of its options documented.

[config]
# maximum time, in seconds, to let nodes download the model from a checkpoint / other nodes
warmup_time = 30

# time, in seconds, to let nodes bring the model from the GPU to disk, and to opt to join the next round.
cooldown_time = 30

# how many training rounds in one "epoch", from warmup to cooldown.
rounds_per_epoch = 20

# maximum time, in seconds, to allow nodes to train in one round.
# this will limit the types of GPUs your model can be trained on,
# since setting it low will prevent slower hardware from completing
# training in time.
max_round_train_time = 30

# time, in seconds, to allow witnesses to publish their messages before next round
round_witness_time = 1

# number of clients that need to be active for an epoch to continue on.
# if the number of clients goes below this number, we initiate a Cooldown and then back to WaitingForClients.
# this should be adjusted alongside max_round_train_time, because one client will train a lot slower
# than 100.
min_clients = 1

# minumum number of clients required before we transition from WaitingForClients to Warmup.
# must be equal to or greater than min_clients
init_min_clients = 1

# what percent of nodes are dedicated to verifying correctness. always set to 0 for now.
verification_percent = 0

# how many nodes are selected each round to publish witness proofs
witness_nodes = 1

# the total number of training data batches per-step. this also determines your maximum number of clients.
# the batch size will linearly increase from global_batch_size_start to global_batch_size_end over
# global_batch_size_warmup_tokens tokens
global_batch_size_start = 8
global_batch_size_end = 8
global_batch_size_warmup_tokens = 0

# the total number of training steps to partake in. this is used for the LR schedule in the model section too.
total_steps = 25000

Model

# so far only LLMs are supported.
[model.LLM]
architecture = "HfLlama"
data_type = "Pretraining"
max_seq_len = 2048

[model.LLM.checkpoint.Hub]
repo_id = "emozilla/llama2-20m-init"

[model.LLM.data_location.Http]
token_size_in_bytes = "TwoBytes"
shuffle = "DontShuffle"

[model.LLM.data_location.Http.location.Gcp]
bucket_name = "nous-pretraining-public-us"
filter_directory = "fineweb-edu-tokenized-llama2"

[model.LLM.lr_schedule.Cosine]
base_lr = 4.0e-4
warmup_steps = 250
warmup_init_lr = 0.0
total_steps = 25000
final_lr = 4.0e-5

# only the DisTrO optimizer is supported when training models on Psyche.
[model.LLM.optimizer.Distro]
clip_grad_norm = 1.0
compression_decay = 0.999
compression_chunk = 64
compression_topk = 8
quantize_1bit = true

Authentication and Keys

When clients participate to a decentralized training run, a set of solana Keypairs is used to authenticate each type of user.

Users Roles

A different set of key will be used for each role within the Training flow.

The following roles will be important:

  • The Run's main_authority is the private-key that creates and owns the run, it is the only key that is allowed to modify the run's configuration.

  • The Run's join_authority is the private-key that is responsible for allowing or disallowing clients's keys to join a training run. It is set by the main_authority during the creation of the Run.

  • A client's authorizer (or grantee) key is the "master" private-key of a compute provider. That key may be allowed to join a run and to set delegate keys that can also join the run on its behalf.

  • A Client's delegate key is a temporary and ephemeral key that can be allowed to join a run's training on behalf of a user.

A Training run can be configured to be restricted to only a set of whitelisted keys, this kind of run is considered "Permissioned". As opposed to a "Permissionless" which is open to anyone without any authorization required.

Permissioned Runs

When a In order to be able to join a run, a user (with a key) must first be allowed to join a run.

This is done through the following steps:

  1. The join_authority (the grantor) issues an authorization to an authorizer (the grantee)
  2. The authorizer (the grantee) sets a list of delegate keys that can join the run on its behalf
  3. The delegate key then can join a run

Keys Authorizations

Make sure to install the scripting dependencies:

sudo apt-get install jq
cargo install solana_toolbox_cli

For the join_authority (the grantor) to issues new authorization a script is provided:

# We assume that "grantor.json" contains the Private Key of the "join_authority"
# The "grantor.json" can be created using: $ solana-keygen new -o grantee.json
# We assume that $GRANTEE_PUBKEY is set to the public key of the "authorizer" (or grantee)
# The $GRANTEE_PUBKEY can be retrieved by using: $ solana-keygen pubkey grantee.json
sh scripts/join-authorization-create.sh devnet grantor.json $GRANTEE_PUBKEY

For the authorizer (the grantee) to set a list of delegate, the following script is provided:

# We assume that $GRANTOR_PUBKEY is set to the public key of the "join_authority" of the run
# The $GRANTOR_PUBKEY can be retrieved by using: $ solana-keygen pubkey grantor.json
# We assume that "grantee.json" contains the Private Key of the "authorizer"
# The "grantee.json" can be created using: $ solana-keygen new -o grantee.json
# We assume that a set of keypairs exist at path: delegate1.json, delegate2.json, etc
sh scripts/join-authorization-set-delegates.sh devnet $GRANTOR_PUBKEY grantee.json delegate*.json

Further information

The source code for the authorizer smart contract used by the Psyche's coordinator can be found here with its readme: https://github.com/NousResearch/psyche/tree/main/architectures/decentralized/solana-authorizer

Psyche Development

As the Psyche project a large & complex, we'll walk you through some of the processes we use in development.

Setup & Useful Commands

Installation and Setup

Any Linux, via Nix

Psyche can use nix + flakes as a build system, to make your life easier. To install nix, simply run curl --proto '=https' --tlsv1.2 -sSf -L https://install.determinate.systems/nix | sh -s -- install or find it at your local package manager.

You can optionally use direnv to automatically enter a Nix environment when you cd into the Psyche folder. Either option will install every single dependency and development tool Psyche needs to run and be developed.

Using direnv

Install direnv from your system's package manager. After running direnv allow in the Psyche directory once, your terminal will automatically enter a development shell when you subsequently cd into the Psyche directory.

Without direnv

Enter the Psyche directory, then run nix develop to enter a development shell.

Ubuntu

The following instructions are needed for a server with a fresh Ubuntu installation

1. Install drivers

sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install

2. Install CUDA libraries

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
rm cuda-keyring_1.1-1_all.deb
sudo apt-get install libnccl-dev libnccl2
sudo apt install nvidia-cuda-toolkit

3. Download libtorch & extract

wget https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcu124.zip
unzip libtorch-cxx11-abi-shared-with-deps-2.6.0+cu124.zip
rm libtorch-cxx11-abi-shared-with-deps-2.6.0+cu124.zip

4. Libtorch environment variables

In the .bashrc file, set the following libtorch environment variables. Here <path_to_libtorch> is the absolute path to the extracted libtorch folder from the previous step

export LIBTORCH=<path_to_libtorch>
export LIBTORCH_INCLUDE=<path_to_libtorch>
export LIBTORCH_LIB=<path_to_libtorch>
export LD_LIBRARY_PATH=<path_to_libtorch>/lib:$LD_LIBRARY_PATH
export CUDA_ROOT=/usr/local/cuda-12.4

This can also be achieved by making a .cargo/config.toml file in the checkout path

[env]
LIBTORCH=<path_to_libtorch>
LD_LIBRARY_PATH=<path_to_libtorch>/lib
CUDA_ROOT = "/usr/local/cuda-12.4"

5. Download & install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

6. (optional) Install just

sudo snap install just --edge --classic

7. (optional) Install Solana and Anchor

Install Solana

sh -c "$(curl -sSfL https://release.anza.xyz/beta/install)"

After installation, follow the instructions to add the Solana tools to PATH.

Install Anchor

cargo install --git https://github.com/coral-xyz/anchor --rev a7a23eea308440a9fa9cb79cee7bddd30ab163d5 anchor-cli

This may require

sudo apt install pkg-config libudev-dev libssl-dev

Windows

  1. Install CUDA libraries: https://developer.nvidia.com/cuda-12-4-1-download-archive?target_os=Windows&target_arch=x86_64&target_version=11

  2. Download libtorch & extract: https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcu124.zip

  3. Download OpenSSL: https://slproweb.com/download/Win64OpenSSL-3_3_2.exe

  4. Install Perl: https://github.com/StrawberryPerl/Perl-Dist-Strawberry/releases/download/SP_53822_64bit/strawberry-perl-5.38.2.2-64bit.msi

  5. Create a .cargo/config.toml file to set environment variables

NOTE: Building may take several minutes the first time as openssl-sys takes a long time (for some reason)

[env]
LIBTORCH = <path_to_libtorch>
OPENSSL_LIB_DIR = <path_to_openssl>/lib/VC/x64/MT
OPENSSL_INCLUDE_DIR = <path_to_openssl>/include

MacOS / aarch64

These platforms aren't supported right now :( PRs welcome!

Docker

Create a Docker image with the necessary dependencies to run a Psyche client:

  1. Install the necessary NVIDIA and CUDA drivers as explained in the previous sections.
  2. Install the NVIDIA container toolkit. If using Ubuntu, just run:
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
  1. Create an .env file following the .env.example in psyche/config/client and update the necessary environment variables.
  2. Run docker compose build.

Useful commands

Psyche uses just to run some common tasks.

You can run just to see the whole list of commands!

Running checks

requires Nix!

just check

If it passes, CI will pass.

Formatting

just fmt

Running Psyche on-chain

To build the Solana programs, you'll need a handful of Solana tools installed. See the setup if you're not using Nix.

To start, you'll need to create a Solana wallet to fund your transactions.

solana-keygen new

Run on a local validator (localnet)

In a new terminal, run a validator with:

solana-test-validator

Deploy all the required programs and create a local run using:

just setup-solana-localnet-test-run

And run a client to train the test model using:

just start-training-client

This will start a run to train a 1.1b parameter model with all the parallelism features enabled. For a more lightweight run to avoid OOM errors, or just to use your hardware less, (we see you 8gb VRAM cards!) there's also:

just setup-solana-localnet-light-test-run
just start-training-light-client

Run on Solana's Devnet

You'll need to fund your wallet to make transactions on Devnet. You can request an airdrop from the Solana foundation of up to 10 devnet sol every 8 hours. Simply run

solana-keygen pubkey

and paste the resulting key into the airdrop website.

You can then use the same steps for deploying the programs, creating a run, and training on localnet above, but using following just commands:

just setup-solana-devnet-test-run
just start-training-devnet-client

alongside the -light variants

just setup-solana-devnet-light-test-run
just start-training-devnet-light-client

Regenerating program keypairs

If you're developing things that change the structure of the program's accounts layout, deploying an update to the coordinator program will likely cause breakage with existing runs that have coordinator accounts already instantiated.

Any programs, including the Psyche website's indexer, will fail to read the content of the on-chain data if you use a new IDL with an old in-memory layout.

Therefore, changes to the data structures that end up on-chain will require a deployment of a new coordinator program under a new ProgramID to prevent breakage of existing runs.

In order to do this by yourself, you'll need to generate a new ProgramID (and keypair).

To deploy a program to devnet or localnet with a new program keypair, regenerate its devnet/localnet keypair file (checked into the repo!)

For the solana coordinator, that would be:

solana-keygen new -o architectures/decentralized/solana-coordinator/target/deploy/psyche_solana_coordinator-keypair.json -f

You can see the newly generated program ID by running

solana-keygen pubkey architectures/decentralized/solana-coordinator/target/deploy/psyche_solana_coordinator-keypair.json

Make sure to then update the declare_id's content with the new keys before deploying the new development contracts, either manually or with anchor keys sync in the appropriate project folder.

if you want to push these changes to the repo, you'll need to use git add -f, since they're normally .gitignored.

Running Psyche offchain

When developing for Psyche, you might not want to spin up all the Solana infrastructure if you're working on a feature like the distributed networking or the training code.

To that end, we maintain a "centralized" client & server package that simply communicate over TCP instead of dealing with code deployed to a Solana network.

There's a server package, and a client package. To develop with them, you'd spin up one server with whatever run config you want

Local Testnet

The local testnet is a helper application designed to easily spin up a Server and multiple clients. It's useful for doing sample runs on your own hardware, and for development.

Pre-requisites

Since we want to run many clients and the server we'll need several terminal windows to monitor them. The tool uses tmux to create them.

If you're using the Nix devShell, tmux is already included.

Running

A sample invocation that fires up 3 clients to train on a 20m model might look like this:

just local-testnet \
    --num-clients 3 \
    --config-path ./config/consilience-match-llama2-20m-fineweb-pretrain-dev/

There's a lot of options to configure the local testnet. Check em out below!

Command-line options # Command-Line Help for `psyche-centralized-local-testnet`

This document contains the help content for the psyche-centralized-local-testnet command-line program.

Command Overview:

psyche-centralized-local-testnet

Usage: psyche-centralized-local-testnet <COMMAND>

Subcommands:
  • start — Starts the local-testnet running each part of the system in a separate terminal pane

psyche-centralized-local-testnet start

Starts the local-testnet running each part of the system in a separate terminal pane

Usage: psyche-centralized-local-testnet start [OPTIONS] --num-clients <NUM_CLIENTS> --config-path <CONFIG_PATH>

Options:
  • --num-clients <NUM_CLIENTS> — Number of clients to start

  • --config-path <CONFIG_PATH> — File path to the configuration that the coordinator will need to start

  • --write-distro-data <WRITE_DISTRO_DATA> — If provided, write DisTrO data to disk in this path

  • --server-port <SERVER_PORT> — Port where the server for this testnet will be listen it to (this is the one that clients must use when connecting)

    Default value: 20000

  • --tui <TUI> — Enables a terminal-based graphical interface for monitoring analytics

    Default value: true

    Possible values: true, false

  • --random-kill-num <RANDOM_KILL_NUM> — Kill N clients randomly every <RANDOM_KILL_INTERVAL> seconds

  • --allowed-to-kill <ALLOWED_TO_KILL> — Which clients we're allowed to kill randomly

  • --random-kill-interval <RANDOM_KILL_INTERVAL> — Kill <RANDOM_KILL_NUM> clients randomly every N seconds

    Default value: 120

  • --log <LOG> — Sets the level of the logging for more granular information

    Default value: warn,psyche=debug

  • --first-client-checkpoint <FIRST_CLIENT_CHECKPOINT> — HF repo where the first client could get the model and the configuration to use

  • --hf-token <HF_TOKEN>

  • --write-log

    Default value: false

  • --wandb-project <WANDB_PROJECT>

  • --wandb-group <WANDB_GROUP>

  • --wandb-entity <WANDB_ENTITY>

  • --optim-stats <OPTIM_STATS>

  • --eval-tasks <EVAL_TASKS>


This document was generated automatically by clap-markdown.

Server & Client

Both of these applications can be spun up individually at your discretion instead of using the local testnet. We include all their command-line options for your reading pleasure:

Client # Command-Line Help for `psyche-centralized-client`

This document contains the help content for the psyche-centralized-client command-line program.

Command Overview:

psyche-centralized-client

Usage: psyche-centralized-client <COMMAND>

Subcommands:
  • show-identity — Displays the client's unique identifier, used to participate in training runs
  • train — Allows the client to join a training run and contribute to the model's training process

psyche-centralized-client show-identity

Displays the client's unique identifier, used to participate in training runs

Usage: psyche-centralized-client show-identity [OPTIONS]

Options:
  • --identity-secret-key-path <IDENTITY_SECRET_KEY_PATH> — Path to the clients secret key. Create a new random one running openssl rand 32 > secret.key or use the RAW_IDENTITY_SECRET_KEY environment variable

psyche-centralized-client train

Allows the client to join a training run and contribute to the model's training process

Usage: psyche-centralized-client train [OPTIONS] --run-id <RUN_ID> --server-addr <SERVER_ADDR>

Options:
  • -i, --identity-secret-key-path <IDENTITY_SECRET_KEY_PATH> — Path to the clients secret key. Create a new random one running openssl rand 32 > secret.key. If not provided a random one will be generated

  • --bind-p2p-port <BIND_P2P_PORT> — Sets the port for the client's P2P network participation. If not provided, a random port will be chosen

  • --bind-p2p-interface <BIND_P2P_INTERFACE> — Sets the network interface for the client's P2P network participation. If not provided, will bind to all interfaces

  • --logs <LOGS> — Sets clients logs interface tui: Enables a terminal-based graphical interface for monitoring analytics. console: standard logs json: standard logs with json format

    Default value: tui

    Possible values: tui, console, json

  • --run-id <RUN_ID> — A unique identifier for the training run. This ID allows the client to join a specific active run

  • --data-parallelism <DATA_PARALLELISM>

    Default value: 1

  • --tensor-parallelism <TENSOR_PARALLELISM>

    Default value: 1

  • --micro-batch-size <MICRO_BATCH_SIZE>

    Default value: 1

  • --write-gradients-dir <WRITE_GRADIENTS_DIR> — If provided, every shared gradient this client sees will be written to this directory

  • --eval-tasks <EVAL_TASKS>

  • --eval-fewshot <EVAL_FEWSHOT>

    Default value: 0

  • --eval-seed <EVAL_SEED>

    Default value: 42

  • --eval-task-max-docs <EVAL_TASK_MAX_DOCS>

  • --checkpoint-dir <CHECKPOINT_DIR> — If provided, every model parameters update will be save in this directory after each epoch

  • --hub-repo <HUB_REPO> — Path to the Hugging Face repository containing model data and configuration

  • --wandb-project <WANDB_PROJECT>

  • --wandb-run <WANDB_RUN>

  • --wandb-group <WANDB_GROUP>

  • --wandb-entity <WANDB_ENTITY>

  • --write-log <WRITE_LOG>

  • --optim-stats-steps <OPTIM_STATS_STEPS>

  • --grad-accum-in-fp32

    Default value: false

  • --dummy-training-delay-secs <DUMMY_TRAINING_DELAY_SECS>

  • --max-concurrent-parameter-requests <MAX_CONCURRENT_PARAMETER_REQUESTS>

    Default value: 8

  • --max-concurrent-downloads <MAX_CONCURRENT_DOWNLOADS>

    Default value: 8

  • --compression <COMPRESSION>

    Default value: 2

  • --server-addr <SERVER_ADDR>


This document was generated automatically by clap-markdown.

Server # Command-Line Help for `psyche-centralized-server`

This document contains the help content for the psyche-centralized-server command-line program.

Command Overview:

psyche-centralized-server

Usage: psyche-centralized-server <COMMAND>

Subcommands:
  • validate-config — Checks that the configuration declared in the state.toml file is valid
  • run — Starts the server and launches the coordinator with the declared configuration

psyche-centralized-server validate-config

Checks that the configuration declared in the state.toml file is valid

Usage: psyche-centralized-server validate-config [OPTIONS] --state <STATE>

Options:
  • --state <STATE> — Path to the state.toml file to validate
  • --data-config <DATA_CONFIG> — Path to data.toml file to validate. If no provided then it will not be checked

psyche-centralized-server run

Starts the server and launches the coordinator with the declared configuration

Usage: psyche-centralized-server run [OPTIONS] --state <STATE>

Options:
  • --state <STATE> — Path to TOML of Coordinator state

  • -s, --server-port <SERVER_PORT> — Port for the server, which clients will use to connect. if not specified, a random free port will be chosen

  • --tui <TUI>

    Default value: true

    Possible values: true, false

  • --data-config <DATA_CONFIG> — Path to TOML of data server config

  • --save-state-dir <SAVE_STATE_DIR> — Path to save the server and coordinator state

  • --init-warmup-time <INIT_WARMUP_TIME> — Sets the warmup time for the run. This overrides the warmup_time declared in the state file

  • --withdraw-on-disconnect <WITHDRAW_ON_DISCONNECT> — Automatically withdraw clients that disconenct from the server

    Default value: true

    Possible values: true, false


This document was generated automatically by clap-markdown.

Implementing models

This codebase includes a set of sample programs that let you design, implement, and test model architectures without spinning up the whole Psyche p2p training architecture.

We currently only implement Llama and Deepseek (see shared/modeling/src/models/), but PRs are very welcome to add more architectures and model types.

The train example, documented below, is useful to test how your model trains using AdamW vs DisTrO.

Running

cargo run --example train -- ---help

You'll need a pre-tokenized dataset downloaded to your disk for training.

A PR is welcome to add an option to the trainer to use the HTTP data provider! You can refer to the http example in the data-provider crate for a sample implementation.

For a Llama 2 model, a pre-tokenized dataset to test with is available at https://huggingface.co/datasets/emozilla/fineweb-10bt-tokenized-datatrove-llama2/. Psyche only needs the .ds files, and will load any/all .ds files in the specified folder - you can download just one for smaller tests.

If you've downloaded part or all of the above dataset into a folder data/fineweb-10bt inside the Psyche repo, you can start a simple training run on a 20m parameter Llama 2 model:

cargo run --example train -- \
    --model emozilla/llama2-20m-init \
    --data-path ./data/fineweb-10bt/ \
    --total-batch 2 \
    --micro-batch 1

Adding a new model type

The train example currently asssumes your model is a Llama or Deepseek v2/v3 model, and instantiates it via (LlamaForCausalLM|DeepseekForCausalLM)::from_pretrained.

We currently only support causal language models - to implement a new one, you can create a file similar to llama_for_causal_lm and implement your model, ensuring you provide a trait impl for CausalLM.

You might also need to modify the data provider, if your data is structured in some way. Since you're implementing the forward pass yourself, you can serve and interpret data passed from the data provider however you need. The data provider currently only supports reading fixed-size batches from input files, so data batches with different sizes will require some additional work.

PRs welcome for any new kinds of dataset loading!

Secrets

We manage secrets in our repo using agenix. These secrets are keyed to specific developers via SSH public keys. Some are used for deployments, and some can be used for development.

You can read more about agenix and how secrets are used in our deployment HERE

What secrets do we store?

# this file contains secrets that we can store encrypted in this repo.
# they can be decrypted by the specified ssh public keys using `agenix`.
let
  keys = import ./nix/keys.nix;
in
{
  ## Local Development
  # a shared devnet wallet
  "secrets/devnet/wallet.age".publicKeys = keys.allDevKeys;

  # RPC url for devnet
  "secrets/devnet/rpc.age".publicKeys = keys.allDevKeys;

  # RPC url for mainnet
  "secrets/mainnet/rpc.age".publicKeys = keys.allDevKeys;

  ## Deployments

  # all RPC urls for our devnet indexer
  "secrets/devnet/backend.age".publicKeys = keys.allKeys;

  # all RPC urls for our mainnet indexer
  "secrets/mainnet/backend.age".publicKeys = keys.allKeys;
}

Editing a secret

you must have your pubkey listed in secrets.nix for a secret if you want to modify the existing one!

ask someone whose key is in secrets.nix to be added.

To edit the secret whatever.age, run

agenix -e secrets/whatever.age

Building the Psyche Book

That's the document you're reading! :D

Development

Simply run just serve_book to serve the book over http on localhost!

Building

nix build .#psyche-book

The book will be output to result/, which you can preview easily with python -m http.server -d ./result/

CI

Overview

We use Garnix as our CI provider. It:

  • Builds packages in our Nix flakes
  • Runs all Nix checks including formatting, lints, & Rust tests.

Deployment Branches

Some branches are configured for automatic deployment. These branches serve as dedicated testing environments.

Development Environments

These environments are stateful and accessible via SSH for developer troubleshooting. Public keys are listed in this repo.

Source BranchPurposeHostname
test-deploy-devnetIndexer/frontend for devnetdevnet-preview.psyche.network
test-deploy-mainnetIndexer/frontend for mainnetmainnet-preview.psyche.network
test-deploy-docsPreview docsdocs.preview.psyche.network

Production Environment

main automatically deploys the website/indexer to https://mainnet.psyche.network/ and the docs to https://docs.psyche.network/.

This is a stateful deploy, but with no SSH access for security reasons.

Contributing to Psyche

Found a bug?

  • Make sure we're not already aware of it by checking GitHub Issues.

  • If it seems your bug is new, open an issue. Describe the expected & actual behaviour in as much detail as possible, ensuring to include system information (CUDA? CPU?) and any relevant command-line params (data parallelism? tensor parallelism? compression ratio?).

Fixed a bug?

  • Submit a GitHub PR with your bugfix.

  • Make sure your PR clearly explains what was broken and how you fixed it. Reference any related issues.

  • Before submitting, check out our guidelines to keep things consistent.

Want to add a cool feature or change something?

  • First, share your idea on the Psyche forum and get some feedback.
  • Feel free to start developing whenever you want, but we generally won't accept a PR unless there's been some discussion and feedback about whether your feature fits Psyche's goals.

Have questions about how things work?

  • Post your questions on the Psyche forum - that's the best place to get answers!

Want to improve our docs?

  • We'd love that. Feel free to open PRs!

Thank you for your contributions to Psyche :heart:

PR guidelines

We prefer PRs to be made and merged using rebase, not merge commits. It's not a deal-breaker, but rebase makes us happy <3

Clean Linear History

Rebasing creates a linear commit history without merges going back and forth, making it much easier to identify the place a change was made. Fixups in merge commits that introduce bugs are no longer associated with the original code, wheras with with rebase you'd find the bug as part of its original commit.

Merge commits add extra noise to the history without adding meaningful content about what changed.

Better Bisect Experience

A linear history makes git bisect more effective for finding bugs, as each commit represents a coherent, working state of the codebase.

Preserving Meaningful Commits

While we advocate for rebase, we do not advocate for squashing all commits. Each commit should:

  1. Document a single logical step in your development process
  2. Be independently revertible if needed
  3. Separate concerns such as:
    • Refactoring (changing structure but not behavior)
    • Feature additions (changing behavior)
    • Bug fixes
    • Documentation updates
  4. Build & pass all checks if checked out individually.

What to Avoid

  • Don't squash meaningful commits together - this buries important changes in large diffs and loses the step-by-step narrative
  • Don't use merge commits within feature branches
  • Don't include "fix up" or "oops" commits in your final PR - these are fine to have during development, but before opening your PR, use git commit --amend or interactive rebase to clean these up. A typical rebase workflow is explained in this blog post. git absorb is also very useful for small fixups.