Welcome to Psyche

Psyche is a system that enables distributed training of transformer-based AI models over the internet, aiming to foster collaboration between untrusted parties to create state-of-the-art machine learning models. It leverages a peer-to-peer distributed network for communication and data sharing.

This documentation provides a comprehensive guide to understanding, using, and developing with Psyche, whether you're an end-user looking to participate in a training run, a developer interested in contributing to the project, or just curious about how it all works.

Introduction

How does it work?

At its core, Psyche is a protocol that coordinates multiple independent clients to train a single machine learning model together. Rather than running on a centralized server farm with high-speed interconnects between every accelerator (GPUs, usually), Psyche distributes the training workload across many independent computers, each contributing a small piece to the overall training process.

Psyche is built to maintain training integrity without requiring participants to trust each other. Through a combination of consensus mechanisms, game theory, and careful protocol design, Psyche will ensure that the trained model remains coherent and consistent despite being trained across disparate machines.

Client Quickstart

If you are a client wanting to join an exiting training round you can refer to the join a run documentation

End-user configuration

Joining a run

Learn how to use your existing compute to join a run

You may also want to read the client FAQ

Creating a run

To train your own models on Psyche, you should familiarize yourself with:

Quickstart: Providing Compute to NousNet

This guide walks you through the complete process of setting up your machine to provide compute to a NousNet training run. It assumes you have been provided the run-manager binary by the run administrator.

Prerequisites Checklist

Before starting, ensure you have:

  • Linux operating system (Ubuntu recommended)
  • NVIDIA GPU with sufficient VRAM for the model being trained
  • The run-manager binary
  • Run ID from the run administrator

Step 1: Verify NVIDIA Drivers

NousNet requires an NVIDIA CUDA-capable GPU. Verify your drivers are installed:

nvidia-smi

You should see output showing your GPU model, driver version, and CUDA version. If this command fails, install NVIDIA drivers following the NVIDIA driver installation guide.


Step 2: Install Docker

Install Docker Engine following the official Docker installation guide for your Linux distribution.

After installation, verify Docker is working:

docker --version

Docker Post-Installation Steps

Important: You must add your user to the docker group to run Docker without sudo:

sudo usermod -aG docker $USER

Then log out and back in (or reboot) for the group change to take effect.

Verify the change worked:

docker run hello-world

If this runs without requiring sudo, you're set.

For more details, see the Docker post-installation guide.


Step 3: Install NVIDIA Container Toolkit

The NVIDIA Container Toolkit enables GPU access inside Docker containers. This is required for NousNet to use your GPU for training.

Follow the NVIDIA Container Toolkit installation guide for your distribution.

After installation, verify GPU access works inside Docker:

docker run --rm --gpus all nvidia/cuda:12.2.2-devel-ubuntu22.04

You should see the same GPU information as running nvidia-smi directly.

Troubleshooting: If you see an error like could not select device driver "" with capabilities: [[gpu]], the NVIDIA Container Toolkit is not installed correctly. Revisit the installation guide.


Step 4: Install Solana CLI and Create Wallet

Install Solana CLI

sh -c "$(curl -sSfL https://release.anza.xyz/stable/install)"

After installation, add Solana to your PATH by adding this line to your ~/.bashrc or ~/.zshrc:

export PATH="$HOME/.local/share/solana/install/active_release/bin:$PATH"

Then reload your shell:

source ~/.bashrc  # or source ~/.zshrc

Verify the installation:

solana --version

For more details, see the Solana installation docs.

Generate a Keypair

Create a new Solana keypair for your node:

solana-keygen new --outfile ~/.config/solana/psyche-node.json

You'll be prompted to set an optional passphrase. The keypair file will be created at the specified path.

Important: Back up this keypair file securely. If you lose it, you lose access to any rewards earned.

Get your public key (you'll need this):

solana-keygen pubkey ~/.config/solana/psyche-node.json

Step 5: Get Authorization to Join the Run

NousNet runs are permissioned. To join, you need the run administrator to authorize your wallet.

  1. Send your public key to the run administrator (the output from solana-keygen pubkey above)
  2. The administrator will create an authorization for your key
  3. Once authorized, you can proceed to join the run

Step 6: Fund Your Wallet (Devnet)

Your wallet needs SOL to pay for transaction fees when communicating with the Solana blockchain.

First, configure Solana CLI to use devnet:

solana config set --url https://api.devnet.solana.com

Then request an airdrop from the devnet faucet:

solana airdrop 2 ~/.config/solana/psyche-node.json

Verify your balance:

solana balance ~/.config/solana/psyche-node.json

Note: If the airdrop fails due to rate limiting, wait a few minutes and try again, or use the Solana Faucet web interface.


Step 7: Create the Environment File

Create a .env file with your configuration. This file tells the run-manager how to connect and authenticate.

# Create the env file
cat > ~/.config/psyche/run.env << 'EOF'
# Path to your Solana keypair
WALLET_PRIVATE_KEY_PATH=/home/YOUR_USERNAME/.config/solana/psyche-node.json

# Solana RPC endpoints (devnet)
RPC=https://api.devnet.solana.com
WS_RPC=wss://api.devnet.solana.com

# The run you're joining (provided by run administrator)
RUN_ID=your_run_id_here

# Your public key (the one authorized by the run admin)
AUTHORIZER=YOUR_PUBLIC_KEY_HERE

# Required for GPU access in container
NVIDIA_DRIVER_CAPABILITIES=all
EOF

Replace the following values:

VariableReplace With
YOUR_USERNAMEYour Linux username
your_run_id_hereThe run ID from your administrator
YOUR_PUBLIC_KEY_HEREYour wallet's public key

Optional Configuration

You can add these optional variables to tune performance, please ask run adminstrator for help:

# Number of GPUs to use for data parallelism (default: 1)
DATA_PARALLELISM=1

# Number of GPUs to distribute model across (default: 1)
TENSOR_PARALLELISM=1

# Samples per GPU per training step (tune based on VRAM)
MICRO_BATCH_SIZE=4

Step 8: Run the Manager

Make the binary executable if needed:

chmod +x ./run-manager

Open and enter a tmux window:

tmux

Start providing compute to the network:

./run-manager --env-file ~/.config/psyche/run.env

The run-manager will:

  1. Connect to the Solana coordinator
  2. Pull the appropriate Docker image for the run
  3. Start the training container
  4. Stream logs to your terminal

Step 9: Verify It's Working

After starting, you should see:

  1. Image pull progress - Docker downloading the NousNet client image
  2. Container startup - The training container initializing
  3. Connection logs - Your client connecting to the coordinator
  4. Training logs - Progress updates as training proceeds

A healthy startup looks something like:

INFO run_manager: Docker tag for run 'your_run': nousresearch/psyche-client:v0.x.x
INFO run_manager: Pulling image from registry: nousresearch/psyche-client:v0.x.x
INFO run_manager: Starting container...
INFO run_manager: Started container: abc123...
[+] Starting to train in run your_run...

To stop the client gracefully, press Ctrl+C.


Troubleshooting

GPU Not Detected in Container

Error: could not select device driver "" with capabilities: [[gpu]]

Solution: The NVIDIA Container Toolkit is not installed or configured correctly. Revisit Step 3 and ensure you can run docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi successfully.

Docker Permission Denied

Error: permission denied while trying to connect to the Docker daemon socket

Solution: Your user isn't in the docker group. Run:

sudo usermod -aG docker $USER

Then log out and back in.

Wallet Not Found

Error: Failed to read wallet file from: /path/to/keypair.json

Solution: Verify the WALLET_PRIVATE_KEY_PATH in your .env file points to an existing file:

ls -l ~/.config/solana/psyche-node.json

RPC Connection Failures

Error: RPC error: failed to get account or connection timeouts

Solution:

  • Verify your RPC endpoints are correct in the .env file
  • For devnet, use https://api.devnet.solana.com and wss://api.devnet.solana.com
  • The public devnet RPC has rate limits; if issues persist, consider using a dedicated RPC provider

Not Authorized to Join

Error: Authorization or permission errors when trying to join

Solution: Confirm with the run administrator that your public key has been authorized. You can verify your authorization status:

./run-manager can-join \
    --rpc https://api.devnet.solana.com \
    --run-id YOUR_RUN_ID \
    --authorizer YOUR_PUBLIC_KEY \
    --address YOUR_PUBLIC_KEY

Container Keeps Restarting

Symptom: Container restarts repeatedly with "version mismatch"

Solution: This usually indicates a Docker image pull issue:

  • Check your internet connection
  • Verify Docker Hub is accessible: docker pull hello-world
  • Check disk space: df -h

Running Multiple Machines

If you want to provide compute from multiple machines, each machine must use a different keypair. Running the same keypair on multiple machines simultaneously will cause issues.

NousNet uses a delegation system for this:

  1. Your main keypair (the one authorized by the run admin) acts as your master key
  2. You generate additional delegate keys for each machine
  3. You register those delegates under your master key
  4. Each machine uses its own delegate key

Setup for Multiple Machines

On your first machine (where your master key is):

  1. Generate a delegate keypair for each additional machine:
solana-keygen new --outfile ~/.config/solana/psyche-delegate-1.json
solana-keygen new --outfile ~/.config/solana/psyche-delegate-2.json
# ... etc
  1. Get the public keys:
solana-keygen pubkey ~/.config/solana/psyche-delegate-1.json
solana-keygen pubkey ~/.config/solana/psyche-delegate-2.json
  1. Register the delegates under your master key (requires the run admin's join authority pubkey):
run-manager join-authorization-delegate \
    --rpc [RPC] \
    --wallet-private-key-path [USER_MASTER_KEYPAIR_FILE] \
    --join-authority [JOIN_AUTHORITY_PUBKEY]
    --delegates-clear [true/false] # Optionally remove previously set delegates
    --delegates-added [USER_DELEGATES_PUBKEYS] # Multiple pubkeys can be added

Note: Ask the run administrator for the JOIN_AUTHORITY_PUBKEY.

  1. Copy each delegate keypair file to its respective machine.

  2. Fund each delegate wallet with SOL for transaction fees.

On each additional machine:

Configure the .env file to use that machine's delegate keypair:

WALLET_PRIVATE_KEY_PATH=/path/to/psyche-delegate-N.json
AUTHORIZER=YOUR_MASTER_PUBLIC_KEY

The AUTHORIZER should be your master key's public key (the one authorized by the run admin), not the delegate's public key.


Claiming

  • Claiming Rewards: After participating in training, you can claim rewards using:
    ./run-manager treasurer-claim-rewards \
        --rpc https://api.devnet.solana.com \
        --run-id YOUR_RUN_ID \
        --wallet-private-key-path ~/.config/solana/psyche-node.json
    

Quick Reference

CommandPurpose
nvidia-smiVerify GPU and drivers
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smiVerify GPU access in Docker
solana-keygen pubkey ~/.config/solana/psyche-node.jsonGet your public key
solana balance ~/.config/solana/psyche-node.jsonCheck wallet balance
./run-manager --env-file ~/.config/psyche/run.envStart providing compute
Ctrl+CStop the client gracefully

Joining a training run

This guide is for end-users who wish to participate in a training run, it assumes you have a predistributed run-manager binary. If you are looking for more in-depth documentation or how to run from source you can refer to the development documentation

Prerequisites

Before joining a run you need to make sure you meet a few requisites:

Linux Operating System

The Psyche client currently only runs on modern Linux distributions.

NVIDIA GPU and Drivers

Psyche requires an NVIDIA CUDA-capable GPU for model training. Your system must have NVIDIA drivers installed.

To check if you have NVIDIA drivers:

nvidia-smi

If this command doesn't work or shows an error, you need to install NVIDIA drivers. Follow NVIDIA's installation guide for your Linux distribution.

Docker

The Psyche client runs inside a Docker container. You need Docker Engine installed on your system.

To check if Docker is installed:

docker --version

If Docker isn't installed, follow the Docker Engine installation guide for your Linux distribution.

NVIDIA Container Toolkit

The NVIDIA Container Toolkit is required to enable GPU access inside Docker containers, which Psyche uses for model training.

To install the NVIDIA Container Toolkit, follow the NVIDIA Container Toolkit installation guide for your Linux distribution.

Solana Wallet/Keypair

You need a Solana keypair (wallet) to participate in training. This keypair identifies your client on the blockchain.

If you need to create a new keypair, you can use the Solana CLI, specifying where you want to create it

solana-keygen new --outfile <path/to/keypair/file.json>

Quick Start

The recommended way to run a Psyche client is through the run-manager, which should have been distributed to you. The run manager will handle downloading the correct Docker image, starting your client, and keeping it updated automatically. Before running it, you should create an environment file with some needed variables.

The .env file should have at least this defined:

WALLET_PATH=/path/to/your/keypair.json

# Required: Solana RPC Endpoints
RPC=https://your-primary-rpc-provider.com
WS_RPC=wss://your-primary-rpc-provider.com

# Required: Which run id to join
RUN_ID=your_run_id_here

# Recommended: Fallback RPC Endpoints (for reliability)
RPC_2=https://your-backup-rpc-provider.com
WS_RPC_2=wss://your-backup-rpc-provider.com

Then, you can start training through the run manager running:

./run-manager --env-file /path/to/your/.env

After the initial setup, you'll see the Psyche client logs streaming in real-time. These logs show training progress, network status, and other important information.

To stop the client, press Ctrl+C in the terminal.

RPC Hosts

We recommend using a dedicated RPC service such as Helius, QuickNode, Triton, or self-hosting your own Solana RPC node.

Additional config variables

In general it's not neccesary to change these variables to join a run since we provide sensible defaults, though you might need to.

NVIDIA_DRIVER_CAPABILITIES - An environment variable that the NVIDIA Container Toolkit uses to determine which compute capabilities should be provided to your container. It is recommended to set it to 'all', e.g. NVIDIA_DRIVER_CAPABILITIES=all.

DATA_PARALLELISM - Number of GPUs to distribute training data across.

  • If you have multiple GPUs, you can set this to 2, 4, etc. to speed up training
  • If you have 1 GPU, set this to 1

TENSOR_PARALLELISM - Number of GPUs to distribute the model across, this lets you train a model you can't fit on one single GPU.

  • If you have 1 GPU, set this to 1
  • If your have n GPUs you can distribute the model across all of them by setting it to n.

MICRO_BATCH_SIZE - Number of samples processed per GPU per training step

  • Set as high as your GPU memory allows

AUTHORIZER - The Solana address that authorized your wallet to join this run

Testing Authorization

Before joining a run, you can verify that your client is authorized by using the run-manager command:

run-manager can-join --run-id <RUN_ID> --authorizer <AUTHORIZER> --address <PUBKEY>

Where:

  • <RUN_ID> is the run ID you want to join (from your .env file)
  • <AUTHORIZER> is the Solana authorizer address (from your .env file)
  • <PUBKEY> is your wallet's public key

You can find your wallet's public key by running:

solana address

This command will return successfully if your wallet is authorized to join the run. This helps debug authorization issues before attempting to join.

Troubleshooting

Docker Not Found

Error: Failed to execute docker command. Is Docker installed and accessible?

Solution: Install Docker using the Docker installation guide. Make sure your user is in the docker group:

sudo usermod -aG docker $USER

Then log out and back in for the group change to take effect.

NVIDIA Drivers Not Working

Error: Container starts but crashes immediately, or you see GPU-related errors in logs

Solution:

  • Verify drivers are installed: nvidia-smi

RPC Connection Failures

Error: RPC error: failed to get account or connection timeouts

Solution:

  • Verify your RPC endpoints are correct in your .env file
  • Check that your RPC provider API key is valid, if present
  • Try your backup RPC endpoints (RPC_2, WS_RPC_2)

Wallet/Keypair Not Found

Error: Failed to read wallet file from: /path/to/keypair.json

Solution:

  • Verify the file exists: ls -l /path/to/keypair.json
  • Check file permissions: chmod 600 /path/to/keypair.json
  • If using default location, ensure ~/.config/solana/id.json exists
  • Verify the path in your .env file matches the actual file location

Container Fails to Start

Error: Docker run failed: ... or container exits immediately

Solution:

  • Check Docker logs for more details
  • Ensure all required variables are in your .env file
  • Verify GPU access: docker run --rm --gpus all ubuntu nvidia-smi
  • Check disk space: df -h
  • Verify you have enough VRAM for your MICRO_BATCH_SIZE setting (try reducing it)

Process Appears Stuck

Error: No new logs appearing, process seems frozen, stuck with error messages.

Solution:

  • The run manager will attempt to restart the client, but sometimes this can fail and hang.
  • Press Ctrl+C to stop run-manager and wait a few seconds.
  • If for some reason this fails to stop it, you can check the running containers with docker ps and force stop the container manually with docker stop.

Version Mismatch Loop

Symptom: Container keeps restarting every few seconds with "version mismatch"

Solution:

  • This usually means there's an issue with pulling the new Docker image
  • Check your internet connection
  • Verify Docker can be run docker --version
  • Verify Docker Hub is accessible: docker pull hello-world
  • Check disk space for Docker images: docker system df

Checking Container Logs Manually

If run-manager exits but you want to see what happened, you can view Docker logs:

# List recent containers (including stopped ones)
docker ps -a

# View logs for a specific container
docker logs CONTAINER_ID

Claiming Rewards

After participating in training and accumulating rewards, you can claim them using the run-manager command:

run-manager treasurer-claim-rewards \
    --rpc <RPC> \
    --run-id <RUN_ID> \
    --wallet-private-key-path <JSON_PRIVATE_KEY_PATH>

Where:

  • <RPC> is your Solana RPC endpoint (same as in your .env file)
  • <RUN_ID> is the run ID you participated in
  • <JSON_PRIVATE_KEY_PATH> is the path to your wallet keypair file (e.g., ~/.config/solana/id.json)

This command will claim any rewards you've earned from contributing to the training run.

Building from source

If you wish to run the run-manager from source, first make sure that you have followed the development setup, are inside the nix environment, and run just run-manager path/to/.env.file

Creating a run

To create a new training run and make it available for nodes to join, you’ll need to create it, configure it, and unpause it. By default, every new run starts in a paused state until it is explicitly unpaused by the owner, and it can be paused again at any time.

Setting up the Run

First, create the run on-chain. You’ll need to provide:

  • The RPC and WebSocket RPC URLs so the client can communicate with an RPC node.
  • A unique run ID — just a few characters to uniquely identify your run.

For all of the following commands, you can either use the Psyche client Docker image or clone the Psyche repository and run the package directly using cargo run --bin psyche-solana-client -- ....

Setting up Join Authorizations

Before getting started, we need to decide who will be able to join the run. You can read more about this in the authorization section.

We’ll need a keypair file that manages join permissions. This can be the default one created by Solana when running solana-keygen new, located at ~/.config/solana/id.json or whatever key you want to control the join permissions.

Join Authority for Public Runs

If we're looking to make a permissionless run (anyone can join), we'll need to create an authorization that's valid for everyone. In this case, if we set the authorizer to be 11111111111111111111111111111111 in the following command it will be valid for everyone so any other client can join the created run without additional restrictions.

run-manager join-authorization-create \
    --rpc [RPC] \
    --wallet-private-key-path [SOLANA_KEY_FILE] \
    --authorizer 11111111111111111111111111111111

In this case the SOLANA_KEY_FILE needs to be the path to a Solana keypair that is creating the authorization for this specific run.

Join Authority for Private Runs

If you only want certain users to join a run, you’ll need to create one authorization per user (each user can later set multiple delegate keys).

For example, imagine you have a keypair for the run creator at ~/.config/solana/owner.json, which is also the account that grants authorization, and another keypair at ~/.config/solana/joiner.json for the client that is being authorized by the owner and wants to join and train in the run.

First, create the authorization with the following parameters:

run-manager join-authorization-create \
    --rpc [RPC] \
    --wallet-private-key-path ~/.config/solana/owner.json \
    --authorizer $(solana-keygen pubkey ~/.config/solana/joiner.json)

This command uses the public key of the user you want to allow to join and the keypair of the run owner to create the appropriate authorization. The solana-keygen pubkey command just gives you the public key derivated from the keypair file.

Now all that’s left is for the joiner to use their public key, now associated with the newly created authorization, when joining the run using the --authorization flag in the train command. More details can be found in the joining a run section.

Creating the run

Run creation accepts a variety of parameters. We’ll start with the fundamentals and then cover the remaining options. At a minimum, a run needs:

  • The Solana RPC endpoint corresponding to the validator you want to use.
  • A unique identifier, known as the run ID.
  • A join authority, which is the public key that manages access to the run (by default, this is the key that creates the run).
  • The private key of the wallet used to create the run.

For a standard run without a token incentive distribution layer (see rewards for more details):

run-manager create-run \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --join-authority [JOIN_AUTHORITY_PUBKEY] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH] \
    --client-version "latest"

For a run that distributes tokens as rewards to training participants, you must specify the public key of the token created on the Solana blockchain. This will be used as the mint for the collateral token to be distributed:

run-manager create-run \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --join-authority [JOIN_AUTHORITY_PUBKEY] \
    --treasurer-collateral-mint [TOKEN_PUBKEY] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH] \
    --client-version "latest"

At this point, your run has been successfully created.

Initializing configuration

Initially, the run will not have any configuration defined and will remain paused, so no clients can join yet.

To set the run configuration, you’ll need to provide mostly the same parameters as when creating the run, along with the path to a config.toml file that follows the run config schema.

run-manager update-config \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --config-path [CONFIG_FILE_PATH] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Unpausing the run

At this point, your run is ready to go. You can now set its state to unpaused, allowing clients to join and begin training your model.

run-manager set-paused \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --resume \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Congratulations! As soon as your first client joins, your model will start training.

Configuring training rewards

If you created a run with rewards enabled, you can configure how many points each client earns or loses per training epoch.

run-manager set-future-epoch-rates \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --earning-rate-total-shared [EARNING_RATE] \
    --slashing-rate-per-client [SLASHING_RATE] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

To distribute collateral to users, you must periodically top up the run’s treasury so that points earned during computation can be claimed.

run-manager treasurer-top-up-rewards \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --collateral-amount [COLLATERAL_AMOUNT] \
    --wallet-private-key-path [JSON_PRIVATE_KEY_PATH]

Getting information about a run

Optionally, you can retrieve detailed technical information about a previously created run for troubleshooting purposes.

run-manager json-dump-run \
    --rpc [RPC] \
    --run-id [RUN_ID]

For more information about a specific user within a run, you can also use:

run-manager json-dump-user \
    --rpc [RPC] \
    --run-id [RUN_ID] \
    --address [PUBLIC_KEY]

Run configuration

A training run on Psyche is described using a Run Configuration file. It's a .toml file with information about the model shape, size, checkpoints, optimizer settings, run witnessing settings, and more.

There's two top-level values in a run configuration: a config, and a model.

While some examples are described below, you can find the full range of options for the coordinator here and for the model here

Config

Here's a sample config with some of its options documented.

[config]
# maximum time, in seconds, to let nodes download the model from a checkpoint / other nodes
warmup_time = 30

# time, in seconds, to let nodes bring the model from the GPU to disk, and to opt to join the next round.
cooldown_time = 30

# time, in seconds, that an epoch will last.
epoch_time = 60

# maximum time, in seconds, to allow nodes to train in one round.
# this will limit the types of GPUs your model can be trained on,
# since setting it low will prevent slower hardware from completing
# training in time.
max_round_train_time = 30

# time, in seconds, to allow witnesses to publish their messages before next round
round_witness_time = 1

# number of clients that need to be active for an epoch to continue on.
# if the number of clients goes below this number, we initiate a Cooldown and then back to WaitingForMembers.
# this should be adjusted alongside max_round_train_time, because one client will train a lot slower
# than 100.
min_clients = 1

# minumum number of clients required before we transition from WaitingForMembers to Warmup.
# must be equal to or greater than min_clients.
init_min_clients = 1

# what percent of nodes are dedicated to verifying correctness. always set to 0 for now.
verification_percent = 0

# how many nodes are selected each round to publish witness proofs
# Can be set to 0 to select all nodes as witnesses.
witness_nodes = 1

# the total number of training data batches per-step. this also determines your maximum number of clients.
# the batch size will linearly increase from global_batch_size_start to global_batch_size_end over
# global_batch_size_warmup_tokens tokens
global_batch_size_start = 8
global_batch_size_end = 8
global_batch_size_warmup_tokens = 0

# the total number of training steps to partake in. this is used for the LR schedule in the model section too.
total_steps = 25000

Model

# so far only LLMs are supported.
[model.LLM]
# Architecture of the model to train on can be HfLlama or HfDeepseek for now.
# If running with Python sidecars this must be set to HfAuto.
architecture = "HfLlama"
data_type = "Pretraining"
max_seq_len = 2048

[model.LLM.checkpoint.Hub]
# Repo where the model is located in HugggingFace, will be used to download the model at the beginning of training.
repo_id = "emozilla/llama2-20m-init"

[model.LLM.data_location.Http]
# Token size in bytes, can be "TwoBytes" or "FourBytes"
token_size_in_bytes = "TwoBytes"
# Shuffle or not tokens for training, can be "Seeded" with a seed value or "DontShuffle"
shuffle = "DontShuffle"

# Data location to train on
[model.LLM.data_location.Http.location.Gcp]
bucket_name = "nous-pretraining-public-us"
filter_directory = "fineweb-edu-tokenized-llama2"

[model.LLM.lr_schedule.Cosine]
base_lr = 4.0e-4
warmup_steps = 250
warmup_init_lr = 0.0
total_steps = 25000
final_lr = 4.0e-5

# only the DisTrO optimizer is supported when training models on Psyche.
[model.LLM.optimizer.Distro]
clip_grad_norm = 1.0
compression_decay = 0.999
compression_chunk = 64
compression_topk = 8
quantize_1bit = true

Authentication and Keys

When clients participate in a decentralized training run, a set of Solana Keypairs is used to authenticate each type of user.

Users Roles

A different set of key will be used for each role within the Training flow.

The following roles will be important:

  • The Run's main_authority is the private-key that creates and owns the run, it is the only key that is allowed to modify the run's configuration.

  • The Run's join_authority is the private-key that is responsible for allowing or disallowing clients's keys to join a training run. It is set by the main_authority during the creation of the Run.

  • A client's authorizer key is the user master key (for compute providers). That can then set delegate keys that can join the run on its behalf.

  • A Client's delegate key is a temporary and ephemeral key that can be allowed to join a run's training on behalf of a user's master key.

A Training run can be configured to be restricted to only a set of whitelisted keys, this kind of run is considered "Permissioned". As opposed to a "Permissionless" which is open to anyone without per-user authorization required.

Permissionless Runs

Permissionless runs are open to anyone without any authorization required. The owner of the run can set this for a run when creating it. This type of authorization can be made by creating an authorization with a special authorizer valid for everyone: 11111111111111111111111111111111

A CLI is provided for this:

run-manager join-authorization-create \
    --rpc [RPC] \
    --wallet-private-key-path [JOIN_AUTHORITY_KEYPAIR_FILE] \
    --authorizer 11111111111111111111111111111111

Permissioned Runs

In order to be able to join a permissioned run, a user must first be whitelisted through a dedicated authorization.

This is done through the following steps:

  1. The join_authority issues an authorization for an authorizer (the user master key)
  2. The authorizer (the user master key) sets a list of delegate keys that can join the run on its behalf
  3. The delegate (an user temporary key) then can join a run

Keys Management

For the join_authority to issues new authorization, a CLI is provided:

run-manager join-authorization-create \
    --rpc [RPC] \
    --wallet-private-key-path [JOIN_AUTHORITY_KEYPAIR_FILE] \
    --authorizer [USER_MASTER_PUBKEY]

For the authorizer to then set a list of delegate, the following CLI is provided:

run-manager join-authorization-delegate \
    --rpc [RPC] \
    --wallet-private-key-path [USER_MASTER_KEYPAIR_FILE] \
    --join-authority [JOIN_AUTHORITY_PUBKEY]
    --delegates-clear [true/false] # Optionally remove previously set delegates
    --delegates-added [USER_DELEGATES_PUBKEYS] # Multiple pubkeys can be added

Removing the authorization is also possible through CLI:

run-manager join-authorization-delete \
    --rpc [RPC] \
    --wallet-private-key-path [JOIN_AUTHORITY_KEYPAIR_FILE] \
    --authorizer [USER_MASTER_PUBKEY]

Further information

To see how the authorization creation for a real run fits in the configuration see the Authorization section in the create run guide.

The source code for the authorizer smart contract used by the Psyche's coordinator can be found here with its readme: https://github.com/PsycheFoundation/psyche/tree/main/architectures/decentralized/solana-authorizer

Client FAQ

  • Which operating systems are supported?
    • We officially support modern Linux versions. macOS is supported for development purposes only (not for production training).
  • What are the hardware requirements to run a client?
    • Linux: You need a CUDA-compatible GPU. As for the exact specs this will depend on the size of the model being trained.
    • macOS (development only): Apple Silicon Macs can use Metal Performance Shaders for GPU acceleration.
    • Support for AMD ROCm is planned for the future.
  • Can I join a run at any moment?
    • Yes! You will remain as a pending client and start training in the next epoch.
  • Can I leave a run at any moment? How do I leave a run?
    • Yes, you may leave a run by closing the container with the client, either with Ctrl+C in the terminal or by stopping manually the container if it's running detached. However take into account that once rewards are implemented, you will lose all rewards for that epoch.
  • What happens if my connection drops or my client crashes?
    • This is similar to closing the client, just make sure the docker container is correctly stopped and re-run the client.
  • How do I update the client to the latest version?
    • You can force Docker to pull the latest image by running docker pull nousresearch/psyche-client:latest before running the client.
  • Do I need a Solana wallet to train? Does it need to have funds?
    • Yes, even if you want to join a run that does not track rewards you will need a Solana wallet with funds to pay for the transactions to the coordinator.
  • Are the client and coordinator open-source? Can I report bugs?

Psyche In Depth

The core system is composed of three main actors:

  • Coordinator: Serves as a source of truth for global state available to all clients in a given training run. Each run has one coordinator that oversees the entire process. The coordinator is implemented as both a program running on the Solana Blockchain and as a regular TCP server.

  • Client: A user participating in a training run. Clients receive the model to be trained and a specific dataset for that run. They send information to the coordinator to progress the training run and use a peer-to-peer network to share their results at each training step with other clients.

  • Data Provider: Each run requires training data. This data could be served by the Psyche Data Provider server, over HTTP, or loaded from local copies of a dataset.

Psyche provides two different implementations of the network, one for decentralized runs that use the Solana Blockchain with the coordinator running in it and another for centralized runs that use the Coordinator as a regular TCP server and mostly is mostly used to test local runs and as a dev oriented tool.

Sample topologies

---
title: Decentralized Run, training data provided over HTTP.
---
flowchart TB
subgraph "Solana Blockchain"
    C(["Coordinator State"])
end
C <-- Solana RPC & TXs --> C1(("Client")) & C2(("Client")) & C3(("Client"))
C1 <-. p2p gossip .-> C2
C3 <-. p2p gossip .-> C2 & C1
DT["`
Hosted training data
and model snapshots
`"] --> C1 & C2 & C3
---
title: Centralized Run, training data provided thru TCP data server.
---
flowchart TB
subgraph "Coordinator Server"
    CC(["Coordinator State"])
end
CC <-- TPC --> C11(("Client")) & C22(("Client")) & C33(("Client"))
C11 <-. p2p gossip .-> C22
C33 <-. p2p gossip .-> C22 & C11
DTT["`
Hosted training data
and model snapshots
`"] --> C11 & C22 & C33

What constitutes a training run?

The training process for a given model is divided into small steps that incrementally train the model in a coordinated manner. A training run is divided into epochs, where clients can join and leave the run, and epochs are further divided into rounds that will be further divided into steps, where the model is incrementally trained.

During a training run, clients primarily perform three tasks:

  • Training: Train the model using an assigned subset of the data.
  • Witnessing: Verify the liveness and correctness of other participants.
  • Verifying: Recompute and compare training results to identify and punish malicious participants.

These three phases constitute a round of training and will be looping until the epoch is completed.

Waiting for Members & Warmup

At the start of an epoch, all clients have a window of time to join the run by requesting to be added by coordinator, and then connecting to the other participating clients. This state will be known as the Waiting for Members phase.

Once a minimum threshold of clients has been met, the run will transition to the Warmup phase and begin a countdown to allow connected clients to update their copy of the model. To obtain a copy of the model, the Coordinator will either direct clients to a checkpoint uploaded somewhere like HuggingFace and they will have to download it from there or direct clients to download the model from other clients via the p2p network. In the first epoch, all clients will download the model from HuggingFace and after that every new epoch, clients will download the model from other clients via the p2p network.

After the Warmup phase ends, it will enter the Training phase.

Training

At the beginning of a round, either after the Warmup or Witness phase ends, clients are assigned specific tasks that require them to train the model on a portion of the data.

The coordinator contains information that uniquely assigns pieces of training data to clients based on the current round.

If clients have already been training (i.e., it is not the first round of the epoch), they will apply the results from the previous round, then retrieve the data sample they need for the current round.

After completing the training on their assigned data, each client emits a p2p broadcast to all other clients containing their training results and a cryptographic commitment that binds them to those results.

As the training results are received from other clients, they are downloaded to be later incorporated into the current copy of the model of each client.

Witnessing

At the start of each round, one or more clients are randomly selected as witnesses. The number of witnesses can be configured. Witnesses train the model as usual, but also build bloom filters that track which nodes they have received training results from, signifying that they are actively participating and providing valid results.

These bloom filters are sent to the coordinator, which then combines them into a provable consensus of which results to apply to the model.

Once a witness quorum is reached, the coordinator advances to the Training phase to allow all clients a brief window to download every training result of the previous round, clients are assigned new data, and the process repeats. After a fixed amount of time, a Cooldown round occurs, marking the end of an epoch. This time is configurable in the run creation process that we'll explore in the other sections.

The witness/train loop visualized

Here's a high-level overview of the process.

There are additional implementation details, but this captures the overall flow of a single Round in an Epoch:

sequenceDiagram
    participant Client1
    participant Client2
    participant Coordinator
    participant Data Hosting
    loop Every round
      Client1 ->> Data Hosting: get_data
      Client2 ->> Data Hosting: get_data
      Coordinator ->> Client2: witness
      Note over Client1: Train
      Note over Client2: Train
      Client1 ->> Client2: Send results
      Client2 ->> Client1: Send results
      Note over Client1: Download results
      Note over Client2: Download results
      Client2 ->> Coordinator: Send witness
      Note over Coordinator: Quorum reached
      Note over Coordinator: Starting Witness phase
      Note over Coordinator: Starting Training phase
    end

Glossary

For a list of common terms within the project along with definitions, please refer to the glossary

General Workflow

Client

A client is an active participant responsible for executing the training tasks within a run. It handles assigned data batches for training, generates commitments, and participates in the witness process when elected to validate the work of its peers. Each client maintains its own state synchronized with the Coordinator.

Coordinator

The Coordinator stores metadata about the training run's state and a list of participants.

It handles the transition between each Phase of a Round, and provides a random seed that's used to determine data assignments, witnesses, and more.

It's responsible for providing a point of synchronization for all clients within a run.

Ticks (State Transitions)

The coordinator behaves like a state machine, moving from one state to another, with each state transition having specific requirements.

When certain events occur or time-based conditions are met, the Coordinator can be "ticked" forwards to transition from one Phase to another Phase.

sequenceDiagram
    loop
        Note over Backend, Coordinator: Wait for a timeout or backend state
        Backend->>Coordinator: Tick
        Coordinator->>Backend: New state produced
        Backend->Client1: New coordinator state consumed by Client
        Backend->Client2: New coordinator state consumed by Client
    end

In this case the backend is just the layer of the Clients that communicates with the Coordinator, depending on the run nature it will communicate with Solana blockchain or just via TCP to the coordinator.

Beginning an Epoch (state: WaitingForMembers)

The Coordinator begins in the WaitingForMembers phase, with no clients connected.

Whatever backend you're running the Coordinator in should accept pending clients to be added to upcoming Epochs.

When inside the WaitingForMembers phase, your backend will pass new clients to the Coordinator until a configured min_clients threshold is met, at which point the coordinator's tick will transition it to the Warmup phase.

sequenceDiagram
    Note over Coordinator: min_clients = 2
    Client1->>Coordinator: Join
    Client2->>Coordinator: Join
    Note over Coordinator: Entering Warmup
    Client1->>Client2: Connect
    Client2->>Client1: Connect

Model Loading (state: Warmup)

This phase is designed to let all clients download the model & load it onto their GPUs.

If a client has dropped whilst waiting for the warmup time to elapse, the Backend then removes the client from the Coordinator's clients list and in case the number of clients falls below min_clients, the Coordinator goes back to the WaitingForMembers phase and wait for more clients to join.

There's two different ways the coordinator will transition to the RoundTrain phase:

  • If all the participant clients have finished loading the model and are ready to start training they send a specific message to the Coordinator and if the Coordinator receives this message from all clients, it transitions to the RoundTrain phase earlier.
  • If the Warmup max time passes the Coordinator will transition to the RoundTrain phase even if not all clients have finished loading the model. This max time is configurable and can be set in the configuration file.

The Backend will watch for the state transition and all clients will be notified of this new Training Coordinator state.

Training (state: RoundTrain)

In this phase, the Coordinator provides a random seed, each client can use this seed, alongside the current round index and epoch index to determine which indices of the whole training batch they will train on. Basically every client will train on a different subset of the training data.

As clients complete their training, they send their results to all other clients, including the Witnesses. The witnesses will each send a witness proof to the Coordinator, building towards a witness quorum.

A witness proof contains a bloom filter describing which pieces of data the witness received training results for, and which clients did that work. Elected witnesses are responsible for creating these witness proofs and and sending them to the Coordinator.

The witnesses for each round are chosen randomly from all the clients, using the same random seed as for data assignments. A witness will attempt to send an opportunistic witness message once it's seen a received a training result for every single batch in the current round. That message lets the Coordinator know that it can transition to the Witness phase without waiting all the training time.

The Coordinator advances the run from the Training phase to the Witness phase in one of two ways:

  • If enough witnesses observe all results and reach a witness quorum for the round, they notify the Coordinator that it is safe to advance. This process, named opportunistic witnessing, accelerates the transition to the Witness phase, rather than having to wait a fixed time for training results.
  • If witnesses do not receive all required results from other clients before the maximum time specified for the Training phase, the Coordinator will nonetheless transition to the Witness phase after the maximum Training time elapses.

The Backend will watch for the state transition and all clients will be notified of this new Witness Coordinator state.

Witness phase (state: RoundWitness)

This phase exists to give the witnesses an opportunity to send their proofs to the Coordinator in the event that they have not received enough training results from other clients to have reached the quorum and send their proofs opportunistically.

There is also brief slack period for non-witness nodes to catch up by downloading any remaining results they might have not received.

When the Witness phase finishes only reaching the maximum witness time, the Coordinator transitions from Witness to the Training phase again in most of the cases, it only transitions to a new state known as Cooldown in the following three cases:

  • If we are in the last round of the epoch.
  • If the clients have dropped to less than the minimum required by the config.
  • If the number of witnesses for the round is less than the quorum specified by the config.

Any clients that have failed health checks will also be removed from the current epoch.

Cooldown phase (state: Cooldown)

The Cooldown phase is the last phase of an epoch, during which the Coordinator waits the Cooldown period to elapse. At this point the clients will begin to do a new checkpoint of the model, this is saving the state of the model at that time to a external storage, such as a Hugging Face.

When the Cooldown phase begins, the Coordinator also resets the current model checkpoint state to Checkpoint::P2P, indicating that new joiners should download the latest copy of the model from the other participants and not from the usual checkpoint.

Upon exiting the Cooldown phase, the Coordinator transitions to the next epoch, saving the previous epoch state, and moving back to the WaitingForMembers phase. All the clients that were participating in the previous epoch automatically join to the new epoch unless they exit manually.

It all comes together

Here's is an overview of how the state of the run can change depending on the situation:

%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'35px'}}}%%
flowchart LR
    WFM((Waiting For Members))
    W((Warmup))
    T((Training))
    WI((Witness))
    CD((Cooldown))
    a{Are enough clients to start}
    b{All clients loaded the model}
    c{Max warmup time passed}
    d{Witness quorum reached}
    e{Max training time passed}
    f{End of the epoch reached}

    WFM --> a
    a -->|Yes| W
    a -->|No| WFM
    b -->|Yes| T
    b -->|No| c
    W --> b
    c -->|Yes| T
    c -->|No| W
    T --> d
    d -->|Yes| WI
    d -->|No| e
    e -->|Yes| WI
    WI --> f
    f -->|Yes| CD
    f -->|No| T
    CD --> WFM

And this is how it fits with real the real clients and how they interact in each of the stages. The committee in this case is the structure that contains all the witness data for the round.

sequenceDiagram
    Backend->>Coordinator: tick
    Coordinator->>Backend: Change state to `RoundTrain`
    Backend->>Client1: New state
    Backend->>Client2: New state
    par Start training
        Client1->>Client1: Start training
        Client2->>Client2: Start training
    end
    Client1->>Committee: get_witness
    Client2->>Committee: get_witness
    Committee->>Client1: false
    Committee->>Client2: true
    Note over Client1: Train
    Note over Client2: Train
    Note over Client2: Fill bloom filters
    Client2->>Backend: send opportunistic witness
    Backend->>Coordinator: Witness message
    Note over Coordinator: Enough witnesses for round
    Coordinator->>Coordinator: Update state to RoundWitness
    Note over Coordinator: Timeout round witness time
    alt step > total steps
        Coordinator->>Coordinator: Update state to Finished
    else current_epoch_time == max_time_per_epoch
        Coordinator->>Coordinator: Update state to Cooldown
    else
        Coordinator->>Coordinator: Update state to RoundTrain with step + 1
    end

Health checks

Each client should repeatedly send health checks to the coordinator. Clients are assigned a score determined by the Coordinator using the trainer_healthy_score_by_witnesses method. This score increases as a client sends the required data to be added to the participants' bloom filters, allowing the Coordinator to confirm that the client is actively participating in the training.

A client also sends a list of other clients it considers unhealthy to the server using the HealthCheck message. The Coordinator processes this information to determine whether those clients are healthy. Clients deemed inactive or non-participatory are marked for removal in the next round.

Centralized Backend

In this Backend, the Coordinator is owned and ticked forwards by a Server that communicates via clients over TCP.

The Server's Coordinator is initially configured in the main file of the server. It's loaded using the configuration a specific configuration file state.toml

flowchart LR
    S[Server] --run--> A[App]
    S --new--> C[Coordinator]
    C --Run config--> A

The Server uses some parts of the Coordinator configuration, like the data server configuration, if enabled, to boot up all the functionality it needs.

When a new client joins the run it has to communicate the run_id that it wants to join, to ensure the client's joining the correct run. After processing the join message, the client is added to a pending clients list, and runs the Coordinator's tick function to potentially add the client into the run.

When a tick condition is met, the Server ticks the Coordinator forwards, then broadcasts the Coordinator's new state to all connected clients.

Decentralized Backend

In this Backend, the Coordinator is an account associated with a Solana Program, and ticked forwards by a tick method that can be called by anyone.

A training run can be created by calling the init_coordinator method in the Coordinator program, and subsequently information about the model to be trained can be set by calling the update method.

For a new client to join the run, it must call the join_run method in the Coordinator program and pass the run_id for the run it intends to join. After the Solan Program processes the join message, the client is added to a pending clients list, and the Program runs the Coordinator's tick function to potentially add the client into the run.

When a tick condition is met, anybody using Solana can tick the Coordinator forwards by calling the tick method (clients in a Run will do this automatically). This new state is then read via an RPC subscription on each Client, progressing through the regular state machine.

flowchart LR
    T["Psyche Team"] -- deploy Solana Program --> P["Solana Program"]
    R["Run Creator"] -- init_coordinator with run_id --> A["Account for this training run"]
    R["Run Creator"] -- update with run info --> A
    C[Client] -- "join_run" --> A
    C --tick--> A
    G["A random Solana user"] -- tick --> A

Decentralized training flow

Here's a more detailed diagram including mostly every component involved in the Psyche training flow with a little more implementation details:

flowchart TD
 subgraph sg_solana["Solana"]
    direction TB
        CoordinatorState["Coordinator Program State <br> (Run State, Epoch,<br>Round, Clients)"]
  end
 subgraph sg_distro["DisTrO Optimizer"]
    direction TB
        MomentumUpdate["Update Local Momentum <br> m<sub>t</sub> = βm<sub>t-1</sub> + g<sub>t</sub>"]
        DCTExtract["Extract Fast Components <br> (q<sub>t</sub>) (DCT + TopK)"]
        CompressedUpdate["Compressed Local q<sub>t</sub> <br> (Indices + Amplitudes)"]
        MomentumResidual["Update Local<br>Momentum Residual<br> m<sub>t+1</sub> = m<sub>t</sub> - q<sub>t</sub>"]
  end
 subgraph sg_loop["Local Training"]
    direction TB
        LocalWeights["Model Weights (x<sub>t</sub>)"]
        ApplyAggregatedUpdate["Apply Aggregated Update <br> x<sub>t</sub> = x<sub>t-1</sub> - η Q<sub>t-1</sub>"]
        ReceiveDecode["Receive &amp;<br>Decode/Aggregate <br> Compressed q<sub>t-1</sub><br> from Peers"]
        ForwardBackward["Forward/Backward Pass <br> (Use x<sub>t</sub>, <br>Compute Gradient g<sub>t</sub>)"]
        FetchData["Fetch Assigned Data <br> (Batch<sub>t</sub>)"]
        Gradient["Local Gradient (g<sub>t</sub>)"]
        sg_distro
        P2PNetworkInterface["P2P Network Interface"]
  end
 subgraph sg_client["Client"]
    direction TB
        ClientSM["Client State Machine <br> (Warmup, Train,<br>Witness, Cooldown)"]
        sg_loop
  end
 subgraph sg_p2p["P2P Gossip & Blob Transfer"]
    direction TB
        ClientNode2("Client Node 2")
        ClientNode3("Client Node 3")
        ClientNodeN("Client Node N")
  end
    DataProvider["Data Provider <br> (Local File/HTTP/etc.)"]
    ClientSM -- Manages --> sg_loop
    ClientSM -- Receives State Updates --- CoordinatorState
    ApplyAggregatedUpdate --> LocalWeights
    ReceiveDecode -- "Aggregated Q<sub>t-1</sub>" --> ApplyAggregatedUpdate
    LocalWeights -- Used By --> ForwardBackward
    FetchData -- Provides Data --> ForwardBackward
    ForwardBackward -- Produces Gradient --> Gradient
    Gradient -- Updates --> MomentumUpdate
    MomentumUpdate --> DCTExtract
    DCTExtract -- Produces --> CompressedUpdate
    DCTExtract -- Updates --> MomentumResidual
    CompressedUpdate -- Broadcasts Local Compressed Update --> P2PNetworkInterface
    P2PNetworkInterface -- Receives Compressed Updates --> ReceiveDecode
    DataProvider -- Provides Data --> FetchData
    P2PNetworkInterface <-- Send/Receive Updates -------> sg_p2p
    ClientNode2 <-- Transfer Data Off-chain --> ClientNode3 & ClientNodeN
    ClientNode3 <-- Transfer Data Off-chain --> ClientNodeN
    CoordinatorState -- Assigns Data/Committee --> ClientSM
    ClientSM -- "Submits Transactions (e.g., Join, Tick, Witness)" --> CoordinatorState

Data Provider

When you're training an AI model, you need data to train on! Psyche supports multiple kinds of data providers that will fetch & provide the data your model needs to train.

  • Local data provider Each client already has the training data downloaded locally.
  • HTTP data provider Each client will request individual pieces of data from a webserver as they are assigned that data for training
  • TCP data provider Each client will reach out to a dedicated server over TCP & request samples of data.

Overview

When a client starts a round of training, it is assigned an ID or a range of IDs by the coordinator, representing all the "batches" of data that will be used for that round. Each batch contains a specific subsection of the overall training data.

The size of a batch is always the same, and can be configured in the run config This order is deterministic, and is distributed across each client, so no piece of data will be trained on more than once.

To understand how the data is partitioned for each client, refer to the following diagram:

flowchart TD
    C((Coordinator))
    C1[Client A]
    C2[Client B]
    C -- Assigned Batch IDs 1, 2, 3 --> C1
    C -- Assigned Batch IDs 4, 5, 6 --> C2
    subgraph Data Provider
        B1["Batch 1"]
        B2["Batch 2"]
        B3["Batch 3"]
        B4["Batch 4"]
        B5["Batch 5"]
        B6["Batch 6"]
        B4 ~~~ B1
        B5 ~~~ B2
        B6 ~~~ B3
    end
    B1 --> C1
    B2 --> C1
    B3 --> C1
    B4 --> C2
    B5 --> C2
    B6 --> C2

Provider configuration

Inside the run config, the key [model.LLM.data_location] specifies whether the data will be hosted on a TCP server, accessed via HTTP, or stored in a local folder. We also support loading data from GCP as a subsection of the HTTP data provider.

The required configuration depends on the data provider implementation being used:

  1. TCP Server:

    • If the data provider is configured as a TCP server, and an additional file named data.toml is required.
    • This file contains the configuration required for the TCP server, including:
      • Data location
      • Token size
      • Sequence length
      • A seed to shuffle the data if necessary
    • Example data.toml files can be found in psyche/config within the various initial state examples.
  2. HTTP Provider:

    • For the HTTP data provider, no additional configuration file is needed.
    • The required fields for this setup include:
      • The URL (or a set of URLs) from which the data will be fetched - or, if you're loading data from GCP, a GCP bucket and an optional subdirectory.
      • Token size (in bytes)
      • A shuffle seed, if data shuffling is desired.
  3. Local Provider:

    • Simply point to the folder where the data should be loaded from.

Model sharing

When an epoch starts, all clients must have an identical model to train with.

At the beginning of a run, all clients must download the model parameters, tokenizer configuration, and model configuration from HuggingFace, where the model must have been previously uploaded or updated.

Each client will then modify their copy of the model by receiving new training results from other clients and applying them. This keeps everyone's copy of model identical within an epoch without an additional full synchronization step.

When a new client joins a run that has already progressed past its first epoch, it would not be correct for the client to download the original model from HuggingFace, as the model parameters would have already been updated during training. Instead, the new client must acquire a copy of the model from the peers who have been actively training it.

This synchronization process occurs during the Warmup phase, while the coordinator waits to begin the next Training phase.

To address this, we checkpoint the model at the end of an epoch, where clients save and share the entire model for new peers to join. There are two checkpointing variants: HuggingFace based and P2P based.

HuggingFace checkpoint

In this approach, a client or a set of clients can optionally run as checkpointers if they declare a checkpoint URL when joining the run. These clients upload their copy of updated model to HuggingFace after each epoch, and send the URL for this checkpoint to the coordinator. When a new client joins the run, it retrieves the checkpoint URL from the coordinator, and connects to HuggingFace to download the latest copy of the model parameters and configuration files.

P2P checkpoint

In the peer-to-peer (P2P) approach, a new client synchronizes by obtaining the latest model directly from other peers. It receives the model information and parameters from any available peer, requesting a set of parameters for each layer from different clients. This process allows the client to assemble the latest model state and participate in the training without an explicit upload step to a central server occurring.

Here's an example of a P2P model sharing interaction:

flowchart TB
   C((Coordinator))
   C1[Client 1]
   C2[Client 2]
   C3[Client 3]
   C4[Joining Client]
   C --warmup---> C1
   C --warmup---> C2
   C --warmup---> C3
   C2 --Model config--> C4
   C4 -.Join.-> C
   C1 -.Layer 1 weights.-> C4
   C2 -.Layer 2 weights.-> C4
   C3 -.Layer 3 weights.-> C4

Training Rewards

When clients participate in a training run, the Coordinator keeps track of the compute contributions.

Each client is rewarded at the end of an epoch if the client successfully completed the whole epoch. A pool of reward "points" is shared equally among all the finishing clients of a given epoch. The reward is accounted through a counter of earned "points" for each client. The points can then be used as proof of contribution in rewards mechanisms such as the Treasurer (see below)

Run Treasurer, Compute Incentives

A Training run can be created through a Treasurer escrow smart contract. In this case the Run's authority will be the Treasurer smart contract itself.

In this case, an arbitrary token can be distributed through the Treasurer's Token holding. Every time a client earns a point on the run's coordinator, the treasurer will allow claiming a fixed amount of reward token for each earned coordinator point.

The source code for the treasurer smart contract can be found here: https://github.com/PsycheFoundation/psyche/tree/main/architectures/decentralized/solana-treasurer.

Mining Pool, Pooling funds

Participating in a run can be expensive — a powerful GPU may be required to train a particular model. Users can pool resources together through a Mining Pool smart contract. The source code used can be found here: https://github.com/PsycheFoundation/psyche/tree/main/architectures/decentralized/solana-mining-pool.

Each user contributing to a Mining Pool will delegate their funds so those can be used by the Mining Pool authority and owner to purchase compute power. The Mining Pool authority can then re-distribute equitably through the Mining Pool any token that may have been received as a result of the training.

Psyche Glossary

ActiveStep The state machine phases a Client goes through during a training Round or Epoch, synchronized with the Coordinator's RunState. Includes Warmup, Training, Witness, and Cooldown.

AMD ROCm An alternative GPU compute platform to NVIDIA's CUDA. Support for ROCm is planned for Psyche clients in the future.

Authorizer A Solana program that issues authorizations to specific users.

Authorization A specific role (scope) assigned to a single user (grantee) by a specific authority (grantor). The grantee can then delegate authorization to other keys (the delegates) that can act on its behalf. In practice, this is useful for managing permissions to nodes in data center clusters easily.

Batch A subset of the training data processed by clients in a single step within a Round. Identified by a BatchId.

BatchId A unique identifier for a specific Batch of training data.

Bloom Filter A probabilistic data structure used for efficient set membership testing (e.g., checking if a client's commitment has been witnessed). Used in WitnessBloom. Has a small chance of false positives.

BLOOM_FALSE_RATE The target false positive rate (1% in this case) for the Bloom Filters used in the witness protocol.

Checkpoint A saved state of the LLM being trained. Psyche uses checkpoints to allow runs to be paused, resumed, or recovered after interruptions. Checkpoints can be stored in a central HubRepo or shared between clients via P2P.

Checkpointers Designated, trusted participants responsible for saving the model Checkpoint during the Cooldown phase.

Client The software participants run on their own hardware (typically with a GPU) to contribute to the distributed training process. Clients perform computations, submit results (Commitments), and participate in Witnessing.

ClientState The status of a Client as tracked by the Coordinator. Key states include Healthy, Dropped, Withdrawn, and Ejected.

Commitment A cryptographic hash (SHA-256) of a client's computational results for a given Batch. Submitting commitments allows the Coordinator and Witnesses to verify work was done without transferring the full results initially.

Committee The particular role of a client in a given round. Can be one of Trainer, Verifier or TieBreaker.

Cooldown A phase (RunState and ActiveStep) at the end of an Epoch where model Checkpoints are saved and the system prepares for the next epoch.

Coordinator The central orchestrator of the Psyche training system, implemented as a Solana program. It manages the training life cycle (RunState), client participation (ClientState), data batch assignment, and Witnessing.

CoordinatorConfig The set of parameters defining how a specific training run operates (e.g., warmup_time, witness_quorum, rounds_per_epoch).

CUDA NVIDIA's parallel computing platform and programming model, required for running the Psyche client on NVIDIA GPUs.

Data Provider Component responsible for supplying the training data in organized Batches.

Desync An error state (StepError::Desync) occurring when a Client's ActiveStep falls out of synchronization with the Coordinator's RunState.

Docker A platform used to build, ship, and run applications in Containers. Psyche uses Docker to distribute and run the client software.

Dropped A ClientState indicating a client has become unresponsive or disconnected unexpectedly.

Ejected A ClientState indicating a client has been forcibly removed from the training run, typically due to failing health checks or malicious behavior. Ejected clients may be subject to Slashing.

Epoch A major cycle in the training process, composed of multiple Rounds. A Checkpoint starts with the WaitingForMembers and Warmup phases and ends with a Cooldown phase.

Exited Clients A buffer on the Coordinator holding records of clients that have recently left the run (Dropped, Withdrawn, Ejected).

Finished A RunState indicating that the training run has completed its configured total_steps.

Garnix CI (Continuous Integration) service based on Nix, used by Psyche.

Health Check A verification procedure (health_check()) initiated by designated witness clients. Its purpose is to monitor peer clients and confirm they are actively processing their assigned training batches. When a witness client detects a peer that appears unresponsive or failing (unhealthy), it notifies the central coordinator. The coordinator independently verifies the status of the reported peer by running its own health check. If this verification is verified then the peer is marked as unhealthy and is kicked.

Healthy The desired ClientState, indicating the client is connected, responsive, and participating correctly in the training process. Only Healthy clients typically receive Rewards.

HubRepo A centralized repository location (e.g., Hugging Face, S3 bucket) where the model Checkpoint can be stored, particularly when initializing or if P2P storage is unavailable.

Iroh A P2P library that Psyche uses for data-sharing between the clients.

Lightweight Hashing Using efficient hashing algorithms like SHA-256 for Commitments to allow for fast verification by the Coordinator and Witnesses.

Metal Apple's graphics and compute API. A future backend target for running the Psyche client on Mac hardware.

min_clients The minimum number of Healthy clients required for a training run to progress beyond the WaitingForMembers state.

Mining Pool A Solana program that implements a basic "mining" or lending pool mechanism where users (lenders) can deposit collateral into a pool to delegate funds to other participants with more compute power and eventually claim redeemable tokens proportionate to their share of the total deposited collateral.

NUM_STORED_ROUNDS A constant defining how many past rounds' states are kept in the Coordinator's history buffer (e.g., 4 rounds).

Nix Tool for declarative and reproducible builds used by Psyche.

Opportunistic Witnessing A feature that allows progressing early from the RoundTrain phase to the Witness phase, given that the witness quorum is reached.

Paused A RunState where the training process is temporarily stopped by manual intervention. Can be resumed later.

P2P Peer-to-Peer, meaning a client acts both as a client and as a server, sharing data with it's peers. This is the intended way of data-sharing during a stable run.

Psyche Nous Research's set of systems that enable distributed training of transformer-based AI models over the internet.

Round A smaller cycle within an Epoch. Involves a training phase (RoundTrain) and a validation phase (RoundWitness).

RoundTrain The phase (RunState and ActiveStep) where clients download assigned data Batches, perform training computations (e.g., calculate gradients), and submit Commitments.

RoundWitness The phase (RunState and ActiveStep) where clients act as Witnesses to validate the Commitments submitted by other clients during RoundTrain. Requires a witness_quorum to succeed.

rounds_per_epoch A configuration parameter (CoordinatorConfig) specifying how many Rounds make up one Epoch.

RunState The overall state of the training run as managed by the Coordinator. Examples include Uninitialized, WaitingForMembers, Warmup, RoundTrain, RoundWitness, Cooldown, Paused, Finished.

SHA-256 The specific cryptographic hash function used to create Commitments in Psyche.

Solana The blockchain platform on which the Psyche Coordinator program runs.

StepError A category of errors related to the Client's ActiveStep progression, such as Desync.

tick() A function periodically called on the Coordinator program to drive the state machine transitions (advancing RunState based on time limits, client counts, and submitted results). Specific versions exist for different states (e.g., tick_waiting_for_members, tick_round_witness).

total_steps A configuration parameter defining the total number of training steps or batches the run aims to complete before entering the Finished state.

Training The ActiveStep where the client actively computes gradients or other training operations on its assigned data Batch.

Treasurer A Solana program that runs on top of psyche's Coordinator managing the distribution of rewards to the clients and keeping track of the points earned by each client in the training process.

Uninitialized The default starting RunState of the Coordinator before a training run is configured and started.

WaitingForMembers The RunState where the Coordinator waits for the minimum number of clients (min_clients) to connect and become Healthy before starting the training process.

Warmup The initial phase (RunState and ActiveStep) of a training run where clients download the model Checkpoint and initialize their training environment.

Witness A Client selected to validate other client's work.

WitnessBloom The specific Bloom Filter used on the Coordinator to track which client Commitments have been successfully witnessed.

Witness Quorum The minimum number of clients that must successfully act as Witnesses and agree on the validity of results for a Round to be considered successful.

Withdrawn A ClientState indicating that a client has exited the run.

Psyche Development

As the Psyche project is large & complex, we'll walk you through some of the processes we use in development.

Setup & Useful Commands

Installation and Setup

Psyche uses nix + flakes to install every single dependency and development tool Psyche needs to run and be developed. This is the preferred way of working on Psyche, as it guarantees a consistent development and build process regardless of your machine's specific configuration.

If you can't / don't want to use Nix, it's also possible to manually install all the required deps for Psyche.

Linux & macOS via Nix

Installing Nix

To install Nix, simply run the ./setup-nix.sh script. This will install Nix and configure it appropriately.

Binary cache

If you already have Nix installed, or are installing it manually, To speed up your builds & your local dev shell, we recommend enabling the binary cache from garnix, our CI provider.

In order to use the cache that garnix provides, change your nix.conf, adding https://cache.garnix.io to substituters, and cache.garnix.io:CTFPyKSLcx5RMJKfLo5EEPUObbA78b0YQ2DTCJXqr9g= to trusted-public-keys.

If you've just installed Nix via the Determinate Systems installer above, you can do this by adding these lines to /etc/nix/nix.conf:

extra-substituters = https://cache.garnix.io
extra-trusted-public-keys = cache.garnix.io:CTFPyKSLcx5RMJKfLo5EEPUObbA78b0YQ2DTCJXqr9g=

Setup

Each time you open a new shell in the Psyche directory, run nix develop to enter the Psyche development shell with all the necessary dependencies.

Setup Using direnv

You can optionally use direnv to automatically enter a Nix environment when you cd into the Psyche folder.

  1. Install direnv from your system's package manager:
  • sudo apt install direnv on Debian-based systems
  • brew install direnv on macOS
  1. Run nix profile install nixpkgs#nix-direnv to install the direnv plugin in nix.
  2. Run echo "source ~/.nix-profile/share/nix-direnv/direnvrc" > ~/.direnvrc to enable the plugin.
  3. Add eval "$(direnv hook bash)" line to your shell configuration file (e.g., ~/.bashrc or ~/.zshrc).
  4. Run direnv allow in the Psyche directory once, your terminal will automatically enter a development shell when you subsequently cd into the Psyche directory.

Platform Differences

  • Linux: Uses CUDA for NVIDIA GPUs. Full distributed training support with NCCL.
  • macOS: Uses Metal Performance Shaders for Apple Silicon GPUs. Single-GPU only (parallelism features disabled).

Ubuntu

The following instructions are needed for a server with a fresh Ubuntu installation

1. Install drivers (if not already installed)

sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install

2. Create and enter a Python virtual env

sudo apt install -y python3-pip python3-venv
python3 -m venv .venv
source .venv/bin/activate

3. Install Torch 2.7.0 CUDA 12.8

pip3 install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128

4. Libtorch environment variables

Add the following section to .cargo/config.toml. Adjust LD_LIBRARY_PATH for your <repo_directory> and specific version of Python (3.10 shown here). NOTE: Don't commit these changes!

[env]
LIBTORCH_USE_PYTORCH = "1"
LD_LIBRARY_PATH = "<repo_directory>/.venv/lib/python3.10/site-packages/torch/lib"

5. Download & install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

6. (optional) Install just

sudo snap install just --edge --classic

7. (optional) Install Solana and Anchor

Install Solana

sh -c "$(curl -sSfL https://release.anza.xyz/beta/install)"

After installation, follow the instructions to add the Solana tools to PATH.

Install Anchor

cargo install --git https://github.com/coral-xyz/anchor --rev a7a23eea308440a9fa9cb79cee7bddd30ab163d5 anchor-cli

This may require

sudo apt install pkg-config libudev-dev libssl-dev libfontconfig-dev

Windows (outdated)

  1. Install CUDA libraries: https://developer.nvidia.com/cuda-12-4-1-download-archive?target_os=Windows&target_arch=x86_64&target_version=11

  2. Download libtorch & extract: https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcu124.zip

  3. Download OpenSSL: https://slproweb.com/download/Win64OpenSSL-3_3_3.exe

  4. Install Perl: https://github.com/StrawberryPerl/Perl-Dist-Strawberry/releases/download/SP_53822_64bit/strawberry-perl-5.38.2.2-64bit.msi

  5. Create a .cargo/config.toml file to set environment variables

NOTE: Building may take several minutes the first time as openssl-sys takes a long time (for some reason)

[env]
LIBTORCH = <path_to_libtorch>
OPENSSL_LIB_DIR = <path_to_openssl>/lib/VC/x64/MT
OPENSSL_INCLUDE_DIR = <path_to_openssl>/include

Useful commands

Psyche uses just to run some common tasks. You can run just to see the whole list of available commands!

Running checks

just check-client

Will run the psyche-solana-client package with the --help flag. If you see a list of commands, it means that the env compiles and can run the basic commands.

Formatting

requires Nix!

nix fmt

Format all the project files.

Running Psyche on-chain

To build the Solana programs, you’ll need a handful of Solana tools installed. See the setup if you’re not using Nix. If you’re using Nix, make sure you are in the development environment by running nix develop.

To start, you’ll need to create a Solana wallet to fund your transactions.

solana-keygen new

By default, the keypair will be generated in ~/.config/solana/id.json.

Run on a local validator (localnet)

To quickly test decentralized training, you can spin up a Solana validator locally and fund your Solana wallet with fake tokens to make transactions. To set up a new training run with this tool, in a new terminal run the following command:

just dev setup-solana-localnet-test-run run_id=<RUN_ID>

This will:

  • Set up a solana-test-validator
  • Deploy all the required programs (Coordinator and Authorizer)
  • Create a local run with the name <RUN_ID>. If no run name is provided, the name test will be used by default. The run ID should not exceed 32 characters; it will be truncated if it exceeds this limit.

Then, in another terminal, run a client to train the test model and join the run with the name <RUN_ID>.

just dev start-training-localnet-client run_id=<RUN_ID>

This will start a run to train a 1.1B parameter model with all the parallelism features enabled. This Psyche client will use a temporary private key, which will be generated and deleted automatically when running the command above. If you want to inspect these keys, they will be stored in ~/.config/solana/solana-keys. To run it with a specific private key, you can run the same command while adding the WALLET_FILE environment variable:

WALLET_FILE=/path/to/wallet.json just dev start-training-localnet-client run_id=<RUN_ID>

For a more lightweight run to avoid OOM errors, or just to use less of your hardware (we see you, 8 GB VRAM cards!), there’s also:

just dev setup-solana-localnet-light-test-run
just dev start-training-localnet-light-client

This will train a 12M parameter model, which should fit on most GPUs.

To spin up another client and join the run, you can run the same command as before:

just dev start-training-localnet-client run_id=<RUN_ID>

or

just dev start-training-localnet-light-client run_id=<RUN_ID>

This will create a new temporary Solana keypair in ~/.config/solana/solana-keys, which will be removed when the client is stopped, so you can spawn as many clients as you want.

Run on Solana Devnet

You’ll need to fund your wallet to make transactions on Devnet. You can request an airdrop from the Solana Foundation of up to 10 devnet SOL every 8 hours. To get your public key, run:

solana-keygen pubkey <PATH_TO_KEYPAIR>

If no path to a keypair is provided, it will use the default keypair located at ~/.config/solana/id.json. Paste the resulting key into the airdrop website to receive tokens.

You can then follow the same steps for deploying the programs, creating a run, and training as on localnet, but using the following just commands:

just dev setup-solana-devnet-test-run
just dev start-training-devnet-client

along with the -light variants:

just dev setup-solana-devnet-light-test-run
just dev start-training-devnet-light-client

Remember to set the WALLET_FILE environment variable to the path of your Solana keypair file when running the training commands, since this will be the wallet holding the devnet funds.

These commands work almost the same as the localnet ones, but they use the public Solana Devnet RPC endpoint (https://api.devnet.solana.com). Also, for all programs (Coordinator, Authorizer, and Treasurer), we need to generate new program IDs—basically the “addresses” where the contracts will be deployed—since the current IDs are used by the Psyche team for development and can’t be overridden. More details on how to update program IDs can be found in the changing contracts section.

Creating a permissioned run

All the commands and setups above use a permissionless run by default. In most testing cases, this is fine, since you can join any number of clients without restrictions. If a permissioned run is needed, you’ll have to take a few extra steps.

You have the same variants as before, but with the permissioned option enabled. These commands won’t create the permissionless authorization and will instead allow you to create the required authorizations manually. The commands are:

# Localnet
just dev setup-solana-localnet-permissioned-test-run
just dev setup-solana-localnet-permissioned-light-test-run
just dev setup-solana-localnet-permissioned-test-run-treasurer
just dev setup-solana-localnet-permissioned-light-test-run-treasurer

# Devnet
just dev setup-solana-devnet-permissioned-test-run
just dev setup-solana-devnet-permissioned-light-test-run
just dev setup-solana-devnet-permissioned-test-run-treasurer
just dev setup-solana-devnet-permissioned-light-test-run-treasurer

depending on your needs.

You can then create an authorization manually by specifying who grants the authorization (the run owner) and who receives it. Run:

cargo run --release --bin psyche-solana-client -- \
    join-authorization-create \
    --rpc [RPC] \
    --wallet-private-key-path [JOIN_AUTHORITY_KEYPAIR_FILE] \
    --authorizer [USER_MASTER_PUBKEY]

Here, the --wallet-private-key-path is the path to the Solana KeyPair that will handle authorization to join and the --authorizer is the pubkey of the account that will receive the authorization. To get the pubkey of a KeyPair file you can use the solana-keygen pubkey <FILE> command.

You can then join any authorized client by running the training commands described above, adding the authorized key as an environment variable, for example:

AUTHORIZER=<GRANTEE_PUBKEY> just dev start-training-localnet-light-client

Running a run with rewards

There’s another program that adds a new layer to the Psyche run called the Treasurer. When this program is deployed, it adds a rewards layer on top of the Coordinator, calculating how much of a specific token each client receives for their training time. This contract isn’t required to test a run, but it adds reward functionality if you want to test it. You can find a more in-depth explanation in the rewards section.

To test this, all the commands mentioned above also have variants that include the Treasurer, such as:

# Localnet
just dev setup-solana-localnet-test-run-treasurer
just dev setup-solana-localnet-light-test-run-treasurer

# Devnet
just dev setup-solana-devnet-test-run-treasurer
just dev setup-solana-devnet-light-test-run-treasurer

These commands deploy the Treasurer alongside the other contracts, create a new test token using the SPL Token tool on the selected network, and top up the run with rewards and collateral for clients that train for more than one epoch.

All these commands also have permissioned variants.

Recovering dev tokens

Most devnet tokens are used to deploy the various contracts. You can reclaim these tokens once you’ve finished testing, which is useful since the Solana devnet faucet is limited. Run:

just dev close-dev-programs

This will close all deployed accounts on devnet and return the tokens to the wallet used for deployment. Be aware that this is an irreversible action: once a program is closed, you can’t reuse the same program ID and must generate a new one.

Psyche decentralized client reference

All the commands above use the same psyche-solana-client package with specific parameters for quick local testing, but it supports many different configurations to test various scenarios.

Here’s a summary of the available commands and options:

Command-line options # Command-Line Help for `psyche-solana-client`

This document contains the help content for the psyche-solana-client command-line program.

Command Overview:

psyche-solana-client

Usage: psyche-solana-client <COMMAND>

Subcommands:
  • show-static-p2p-identity
  • create-static-p2p-identity
  • train
  • predownload

psyche-solana-client show-static-p2p-identity

Usage: psyche-solana-client show-static-p2p-identity [IDENTITY_SECRET_KEY_PATH]

Arguments:
  • <IDENTITY_SECRET_KEY_PATH>

psyche-solana-client create-static-p2p-identity

Usage: psyche-solana-client create-static-p2p-identity <SAVE_PATH>

Arguments:
  • <SAVE_PATH>

psyche-solana-client train

Usage: psyche-solana-client train [OPTIONS] --run-id <RUN_ID>

Options:
  • --rpc <RPC>

    Default value: http://127.0.0.1:8899

  • --ws-rpc <WS_RPC>

    Default value: ws://127.0.0.1:8900

  • -w, --wallet-private-key-path <WALLET_PRIVATE_KEY_PATH>

  • -i, --identity-secret-key-path <IDENTITY_SECRET_KEY_PATH> — Path to the clients secret key. Create a new random one running openssl rand 32 > secret.key. If not provided a random one will be generated

  • --bind-p2p-port <BIND_P2P_PORT> — Sets the port for the client's P2P network participation. If not provided, a random port will be chosen

  • --bind-p2p-interface <BIND_P2P_INTERFACE> — Sets the network interface for the client's P2P network participation. If not provided, will bind to all interfaces

  • --iroh-relay <IROH_RELAY> — What relays to use - public n0 or the private Psyche ones

    Default value: psyche

  • --iroh-discovery <IROH_DISCOVERY> — What discovery to use - public n0 or local

    Default value: n0

  • --logs <LOGS> — Sets clients logs interface tui: Enables a terminal-based graphical interface for monitoring analytics. console: standard logs json: standard logs with json format

    Default value: tui

    Possible values: tui, console, json, none

  • --oltp-auth-header <OLTP_AUTH_HEADER> — An auth header string for an opentelemetry endpoint. Used for both logging and metrics

  • --oltp-metrics-url <OLTP_METRICS_URL> — A URL for sending opentelemetry metrics. probably ends in /v1/metrics

  • --oltp-tracing-url <OLTP_TRACING_URL> — A URL for sending opentelemetry traces. probably ends in /v1/traces

  • --oltp-logs-url <OLTP_LOGS_URL> — A URL for sending opentelemetry logs. probably ends in /v1/logs

  • --oltp-report-interval <OLTP_REPORT_INTERVAL> — how often to report metrics thru opentelemetry

    Default value: 60.0

  • --metrics-local-port <METRICS_LOCAL_PORT> — If present, output some metrics & stats via this TCP port in JSON format. Useful for debugging or local integration

  • --run-id <RUN_ID> — A unique identifier for the training run. This ID allows the client to join a specific active run

  • --data-parallelism <DATA_PARALLELISM>

    Default value: 1

  • --tensor-parallelism <TENSOR_PARALLELISM>

    Default value: 1

  • --micro-batch-size <MICRO_BATCH_SIZE>

    Default value: 1

  • --write-gradients-dir <WRITE_GRADIENTS_DIR> — If provided, every shared gradient this client sees will be written to this directory

  • --eval-tasks <EVAL_TASKS>

  • --eval-seed <EVAL_SEED>

    Default value: 42

  • --eval-task-max-docs <EVAL_TASK_MAX_DOCS>

  • --prompt-task

  • --checkpoint-dir <CHECKPOINT_DIR> — If provided, every model parameters update will be save in this directory after each epoch

  • --hub-repo <HUB_REPO> — Path to the Hugging Face repository containing model data and configuration

  • --hub-max-concurrent-downloads <HUB_MAX_CONCURRENT_DOWNLOADS>

    Default value: 3

  • --wandb-project <WANDB_PROJECT>

  • --wandb-run <WANDB_RUN>

  • --wandb-group <WANDB_GROUP>

  • --wandb-entity <WANDB_ENTITY>

  • --write-log <WRITE_LOG>

  • --optim-stats-steps <OPTIM_STATS_STEPS>

  • --grad-accum-in-fp32

    Default value: false

  • --dummy-training-delay-secs <DUMMY_TRAINING_DELAY_SECS>

  • --max-concurrent-parameter-requests <MAX_CONCURRENT_PARAMETER_REQUESTS>

    Default value: 4

  • --max-concurrent-downloads <MAX_CONCURRENT_DOWNLOADS>

    Default value: 4

  • --device <DEVICE> — Device(s) to use: auto, cpu, mps, cuda, cuda:N, cuda:X,Y,Z

    Default value: auto

  • --sidecar-port <SIDECAR_PORT>

  • --delete-old-steps

    Default value: true

  • --keep-steps <KEEP_STEPS>

    Default value: 3

  • --rpc-2 <RPC_2>

    Default value: ``

  • --ws-rpc-2 <WS_RPC_2>

    Default value: ``

  • --rpc-3 <RPC_3>

    Default value: ``

  • --ws-rpc-3 <WS_RPC_3>

    Default value: ``

  • --authorizer <AUTHORIZER>

psyche-solana-client predownload

Usage: psyche-solana-client predownload [OPTIONS] --run-id <RUN_ID>

Options:
  • --rpc <RPC>

    Default value: http://127.0.0.1:8899

  • --ws-rpc <WS_RPC>

    Default value: ws://127.0.0.1:8900

  • -r, --run-id <RUN_ID>

  • --model

  • --eval-tasks <EVAL_TASKS>

  • --hub-max-concurrent-downloads <HUB_MAX_CONCURRENT_DOWNLOADS>

    Default value: 3


This document was generated automatically by clap-markdown.

Changing contracts

Psyche uses two main accounts deployed to Solana—the Coordinator and the Authorizer—and one optional account, the Treasurer. If you’re developing changes that modify the on-chain account layout, deploying an updated Coordinator program will likely break existing runs that already have coordinator accounts instantiated.

Because of this, changes to on-chain data structures require deploying a new Coordinator program under a new Program ID to avoid breaking existing runs.

To do this yourself, you’ll need to generate a new Program ID (and keypair).

To deploy a program to devnet or localnet with a new program keypair, regenerate its devnet/localnet keypair file (which is checked into the repo).

For the Solana Coordinator, run:

solana-keygen new -o architectures/decentralized/solana-coordinator/target/deploy/psyche_solana_coordinator-keypair.json -f

You can view the newly generated program ID with:

solana-keygen pubkey architectures/decentralized/solana-coordinator/target/deploy/psyche_solana_coordinator-keypair.json

Make sure to update the declare_id value with the new key before deploying the updated contracts, either manually or using anchor keys sync in the appropriate project folder.

If you want to push these changes to the repo, you’ll need to use git add -f, since these files are normally .gitignored.

Running Psyche offchain

When developing for Psyche, you might not want to spin up all the Solana infrastructure if you're working on a feature like the distributed networking or the training code.

To that end, we maintain a "centralized" client & server package that simply communicate over TCP instead of dealing with code deployed to a Solana network.

There's a server package, and a client package. To develop with them, you'd spin up one server with whatever run config you want

Local Testnet

The local testnet is a helper application designed to easily spin up a Server and multiple clients. It's useful for doing sample runs on your own hardware, and for development.

Pre-requisites

Since we want to run many clients and the server we'll need several terminal windows to monitor them. The tool uses tmux to create them.

If you're using the Nix devShell, tmux is already included.

Running

Since the local-testnet examples uses a local server to provide the data for the clients to train on, you'll need to download the data first. The best way to do it is install the HuggingFace CLI tool running curl -LsSf https://hf.co/cli/install.sh | bash, once installed just run the following command to get some random data and place it in the correct place for the local server to use it:

hf download emozilla/fineweb-10bt-tokenized-datatrove-llama2 --repo-type dataset --local-dir ./data/fineweb-10bt

A sample invocation that fires up 3 clients to train on a 20m model might look like this:

just local-testnet \
    --num-clients 3 \
    --config-path ./config/consilience-match-llama2-20m-fineweb-pretrain-dev/

This will run a server locally that acts as the coordinator and 3 clients that will connect to the server and start training on the downloaded data. We'll talk about the configuration of the run later on but this example will use the config located at ./config/consilience-match-llama2-20m-fineweb-pretrain-dev/state.toml, there you can have a glimpse of the configuration options.

There's a lot of options to configure the local testnet. Check em out below to configure runs as you see fit:

Command-line options # Command-Line Help for `psyche-centralized-local-testnet`

This document contains the help content for the psyche-centralized-local-testnet command-line program.

Command Overview:

psyche-centralized-local-testnet

Usage: psyche-centralized-local-testnet <COMMAND>

Subcommands:
  • start — Starts the local-testnet running each part of the system in a separate terminal pane

psyche-centralized-local-testnet start

Starts the local-testnet running each part of the system in a separate terminal pane

Usage: psyche-centralized-local-testnet start [OPTIONS] --num-clients <NUM_CLIENTS> --config-path <CONFIG_PATH>

Options:
  • --num-clients <NUM_CLIENTS> — Number of clients to start

  • --config-path <CONFIG_PATH> — File path to the configuration that the coordinator will need to start

  • --write-distro-data <WRITE_DISTRO_DATA> — If provided, write DisTrO data to disk in this path

  • --server-port <SERVER_PORT> — Port where the server for this testnet will be listen it to (this is the one that clients must use when connecting)

    Default value: 20000

  • --tui <TUI> — Enables a terminal-based graphical interface for monitoring analytics

    Default value: true

    Possible values: true, false

  • --random-kill-num <RANDOM_KILL_NUM> — Kill N clients randomly every <RANDOM_KILL_INTERVAL> seconds

  • --allowed-to-kill <ALLOWED_TO_KILL> — Which clients we're allowed to kill randomly

  • --random-kill-interval <RANDOM_KILL_INTERVAL> — Kill <RANDOM_KILL_NUM> clients randomly every N seconds

    Default value: 120

  • --log <LOG> — Sets the level of the logging for more granular information

    Default value: warn,psyche=debug

  • --first-client-checkpoint <FIRST_CLIENT_CHECKPOINT> — HF repo where the first client could get the model and the configuration to use

  • --hf-token <HF_TOKEN>

  • --write-log

    Default value: false

  • --wandb-project <WANDB_PROJECT>

  • --wandb-group <WANDB_GROUP>

  • --wandb-entity <WANDB_ENTITY>

  • --optim-stats <OPTIM_STATS>

  • --eval-tasks <EVAL_TASKS>


This document was generated automatically by clap-markdown.

Server & Client

Both of these applications can be spun up individually at your discretion instead of using the local testnet. We include all their command-line options for your reading pleasure:

Client # Command-Line Help for `psyche-centralized-client`

This document contains the help content for the psyche-centralized-client command-line program.

Command Overview:

psyche-centralized-client

Usage: psyche-centralized-client <COMMAND>

Subcommands:
  • show-identity — Displays the client's unique identifier, used to participate in training runs
  • train — Allows the client to join a training run and contribute to the model's training process

psyche-centralized-client show-identity

Displays the client's unique identifier, used to participate in training runs

Usage: psyche-centralized-client show-identity [OPTIONS]

Options:
  • --identity-secret-key-path <IDENTITY_SECRET_KEY_PATH> — Path to the clients secret key. Create a new random one running openssl rand 32 > secret.key or use the RAW_IDENTITY_SECRET_KEY environment variable

psyche-centralized-client train

Allows the client to join a training run and contribute to the model's training process

Usage: psyche-centralized-client train [OPTIONS] --run-id <RUN_ID> --server-addr <SERVER_ADDR>

Options:
  • -i, --identity-secret-key-path <IDENTITY_SECRET_KEY_PATH> — Path to the clients secret key. Create a new random one running openssl rand 32 > secret.key. If not provided a random one will be generated

  • --bind-p2p-port <BIND_P2P_PORT> — Sets the port for the client's P2P network participation. If not provided, a random port will be chosen

  • --bind-p2p-interface <BIND_P2P_INTERFACE> — Sets the network interface for the client's P2P network participation. If not provided, will bind to all interfaces

  • --iroh-relay <IROH_RELAY> — What relays to use - public n0 or the private Psyche ones

    Default value: psyche

  • --iroh-discovery <IROH_DISCOVERY> — What discovery to use - public n0 or local

    Default value: n0

  • --logs <LOGS> — Sets clients logs interface tui: Enables a terminal-based graphical interface for monitoring analytics. console: standard logs json: standard logs with json format

    Default value: tui

    Possible values: tui, console, json, none

  • --oltp-auth-header <OLTP_AUTH_HEADER> — An auth header string for an opentelemetry endpoint. Used for both logging and metrics

  • --oltp-metrics-url <OLTP_METRICS_URL> — A URL for sending opentelemetry metrics. probably ends in /v1/metrics

  • --oltp-tracing-url <OLTP_TRACING_URL> — A URL for sending opentelemetry traces. probably ends in /v1/traces

  • --oltp-logs-url <OLTP_LOGS_URL> — A URL for sending opentelemetry logs. probably ends in /v1/logs

  • --oltp-report-interval <OLTP_REPORT_INTERVAL> — how often to report metrics thru opentelemetry

    Default value: 60.0

  • --metrics-local-port <METRICS_LOCAL_PORT> — If present, output some metrics & stats via this TCP port in JSON format. Useful for debugging or local integration

  • --run-id <RUN_ID> — A unique identifier for the training run. This ID allows the client to join a specific active run

  • --data-parallelism <DATA_PARALLELISM>

    Default value: 1

  • --tensor-parallelism <TENSOR_PARALLELISM>

    Default value: 1

  • --micro-batch-size <MICRO_BATCH_SIZE>

    Default value: 1

  • --write-gradients-dir <WRITE_GRADIENTS_DIR> — If provided, every shared gradient this client sees will be written to this directory

  • --eval-tasks <EVAL_TASKS>

  • --eval-seed <EVAL_SEED>

    Default value: 42

  • --eval-task-max-docs <EVAL_TASK_MAX_DOCS>

  • --prompt-task

  • --checkpoint-dir <CHECKPOINT_DIR> — If provided, every model parameters update will be save in this directory after each epoch

  • --hub-repo <HUB_REPO> — Path to the Hugging Face repository containing model data and configuration

  • --hub-max-concurrent-downloads <HUB_MAX_CONCURRENT_DOWNLOADS>

    Default value: 3

  • --wandb-project <WANDB_PROJECT>

  • --wandb-run <WANDB_RUN>

  • --wandb-group <WANDB_GROUP>

  • --wandb-entity <WANDB_ENTITY>

  • --write-log <WRITE_LOG>

  • --optim-stats-steps <OPTIM_STATS_STEPS>

  • --grad-accum-in-fp32

    Default value: false

  • --dummy-training-delay-secs <DUMMY_TRAINING_DELAY_SECS>

  • --max-concurrent-parameter-requests <MAX_CONCURRENT_PARAMETER_REQUESTS>

    Default value: 4

  • --max-concurrent-downloads <MAX_CONCURRENT_DOWNLOADS>

    Default value: 4

  • --device <DEVICE> — Device(s) to use: auto, cpu, mps, cuda, cuda:N, cuda:X,Y,Z

    Default value: auto

  • --sidecar-port <SIDECAR_PORT>

  • --delete-old-steps

    Default value: true

  • --keep-steps <KEEP_STEPS>

    Default value: 3

  • --server-addr <SERVER_ADDR>


This document was generated automatically by clap-markdown.

Server # Command-Line Help for `psyche-centralized-server`

This document contains the help content for the psyche-centralized-server command-line program.

Command Overview:

psyche-centralized-server

Usage: psyche-centralized-server <COMMAND>

Subcommands:
  • validate-config — Checks that the configuration declared in the state.toml file is valid
  • run — Starts the server and launches the coordinator with the declared configuration

psyche-centralized-server validate-config

Checks that the configuration declared in the state.toml file is valid

Usage: psyche-centralized-server validate-config [OPTIONS] --state <STATE>

Options:
  • --state <STATE> — Path to the state.toml file to validate
  • --data-config <DATA_CONFIG> — Path to data.toml file to validate. If no provided then it will not be checked

psyche-centralized-server run

Starts the server and launches the coordinator with the declared configuration

Usage: psyche-centralized-server run [OPTIONS] --state <STATE>

Options:
  • --state <STATE> — Path to TOML of Coordinator state

  • -s, --server-port <SERVER_PORT> — Port for the server, which clients will use to connect. if not specified, a random free port will be chosen

  • --tui <TUI>

    Default value: true

    Possible values: true, false

  • --data-config <DATA_CONFIG> — Path to TOML of data server config

  • --save-state-dir <SAVE_STATE_DIR> — Path to save the server and coordinator state

  • --init-warmup-time <INIT_WARMUP_TIME> — Sets the warmup time for the run. This overrides the warmup_time declared in the state file

  • --withdraw-on-disconnect <WITHDRAW_ON_DISCONNECT> — Automatically withdraw clients that disconnect from the server

    Default value: true

    Possible values: true, false

  • --oltp-auth-header <OLTP_AUTH_HEADER> — An auth header string for an opentelemetry endpoint. Used for both logging and metrics

  • --oltp-metrics-url <OLTP_METRICS_URL> — A URL for sending opentelemetry metrics. probably ends in /v1/metrics

  • --oltp-tracing-url <OLTP_TRACING_URL> — A URL for sending opentelemetry traces. probably ends in /v1/traces

  • --oltp-logs-url <OLTP_LOGS_URL> — A URL for sending opentelemetry logs. probably ends in /v1/logs


This document was generated automatically by clap-markdown.

Implementing models

This codebase includes a set of sample programs that let you design, implement, and test model architectures without spinning up the whole Psyche p2p training architecture.

We currently only implement Llama and Deepseek (see shared/modeling/src/models/), but PRs are very welcome to add more architectures and model types.

The train example, documented below, is useful to test how your model trains using AdamW vs DisTrO.

Running

cargo run --example train -- ---help

You'll need a pre-tokenized dataset downloaded to your disk for training.

A PR is welcome to add an option to the trainer to use the HTTP data provider! You can refer to the http example in the data-provider crate for a sample implementation.

For a Llama 2 model, a pre-tokenized dataset to test with is available at https://huggingface.co/datasets/emozilla/fineweb-10bt-tokenized-datatrove-llama2/. Psyche only needs the .ds files, and will load any/all .ds files in the specified folder - you can download just one for smaller tests.

If you've downloaded part or all of the above dataset into a folder data/fineweb-10bt inside the Psyche repo, you can start a simple training run on a 20m parameter Llama 2 model:

cargo run --example train -- \
    --model emozilla/llama2-20m-init \
    --data-path ./data/fineweb-10bt/ \
    --total-batch 2 \
    --micro-batch 1

Adding a new model type

The train example currently asssumes your model is a Llama or Deepseek v2/v3 model, and instantiates it via (LlamaForCausalLM|DeepseekForCausalLM)::from_pretrained.

We currently only support causal language models - to implement a new one, you can create a file similar to llama_for_causal_lm and implement your model, ensuring you provide a trait impl for CausalLM.

There's alpha-level support for models written in Python. See the Python docs for more information.

You might also need to modify the data provider, if your data is structured in some way. Since you're implementing the forward pass yourself, you can serve and interpret data passed from the data provider however you need. The data provider currently only supports reading fixed-size batches from input files, so data batches with different sizes will require some additional work.

PRs welcome for any new kinds of dataset loading!

Python Integration

[!WARNING] Python support is still under development and not production-ready. The APIs used to write it are not documented because they are still subject to large amounts of change.

Overview

Psyche provides a Python integration that allows you to write modeling code in Python using libraries like Hugging Face Transformers while leveraging Psyche's Rust core for training orchestration. This integration is designed for research where you want the flexibility of Python modeling with Psyche's training infrastructure, and production-scale training where you want to take advantage of highly optimized training frameworks already built in Python.

The Python integration works through a "sidecar" process that Psyche spawns and communicates with during training.

Development Setup

To develop with the Python integration, we have a Nix development shell with Python available.

This shell provides:

  • The psyche Python module (built from Rust using PyO3)
  • PyTorch
  • Transformers library
  • Other required Python dependencies via pyproject.toml / uv.lock

Development Workflow

You can use uv pip to install arbitrary packages. Dependencies are tracked via uv.lock, so if you don't have direnv set up, you must exit and re-enter the development shell with nix develop .#dev-python.

When you enter the dev shell, it compiles the Rust extension that provides the psyche Python module. If you modify any Rust code in the Python extension or its dependencies, you must exit and re-enter the dev shell to recompile the extension.

We recommend running commands directly through the dev shell without entering it, which will recompile the extension as needed.

For example, to run the train program using python:

nix develop .#dev-python --command just train-model-python \
  --model emozilla/llama2-20m-init \
  --data-path ./data/fineweb-10bt/ \
  --total-batch 2 \
  --micro-batch 1 \
  --python

Alternatively, you could enter the shell and run the commands with:

nix develop .#dev-python

but this is likely to be a footgun as it's easy to forget to exit and re-enter the shell.

Architecture

The Python integration uses a sidecar architecture:

  1. Psyche Core (Rust): Handles data loading, distributed training coordination, and spawns Python processes
  2. Python Sidecar: Runs the modeling code using PyTorch and Transformers or any other Python code.

When you use the --python flag, Psyche automatically spawns Python sidecar processes using:

python -m psyche.sidecar --parent-pid <pid> --backend <backend> --init-method <method> --world-size <size> --rank <rank>

By default only one sidecar using one GPU will be spawned, the amount will change depending on two different arguments --data-parallelism and --tensor-parallelism. The first one will spawned one entire copy of the model per GPU and the latter will split the model across multiple GPUs. The amount of sidecars spawned will be the product of these two arguments. Take into account that you will need tensor_parallelism * data_parallelism GPUs to run that amount of sidecars.

Here's an overview of the different options that the psyche-sidecar provides in case you want to test sidecars with different configurations.

Command-line options # Command-Line Help for `psyche-sidecar`

This document contains the help content for the psyche-sidecar command-line program.

Command Overview:

psyche-sidecar

Multi-node sidecar for Psyche distributed training

Usage: psyche-sidecar <COMMAND>

Subcommands:
  • python
  • rust — Run Rust sidecar process (TODO: implement)

psyche-sidecar python

Usage: psyche-sidecar python [OPTIONS] --main-host <MAIN_HOST> --world-size <WORLD_SIZE> --start-rank <START_RANK>

Options:
  • --main-host <MAIN_HOST> — Address of the main node

  • --port <PORT> — Port for coordination

    Default value: 34567

  • --world-size <WORLD_SIZE> — World size for distributed training

  • --start-rank <START_RANK> — Start rank for distributed training

  • --start-device <START_DEVICE>

  • --num-local-ranks <NUM_LOCAL_RANKS>

  • --backend <BACKEND> — Backend for torch.distributed (default: nccl)

    Default value: nccl

psyche-sidecar rust

Run Rust sidecar process (TODO: implement)

Usage: psyche-sidecar rust


This document was generated automatically by clap-markdown.

Testing Your Changes

To test modifications to the Python integration:

  1. Modify the sidecar code in the Python extension
  2. Run the training example with the same just train-model-python command we outlined earlier.

How It Works

  1. Initialization: Psyche spawns Python sidecar processes for each rank
  2. Model Creation: The sidecar receives model architecture and source information via the distributed store
  3. Training Loop: Psyche coordinates training by sending operations (train, optimize, extract) to the sidecar
  4. Data Flow: Training data is broadcast to all processes, and gradients/parameters are communicated back through PyTorch's distributed primitives

The sidecar handles three main operations:

  • Train: Forward/backward pass with gradient accumulation
  • Optimize: Apply DisTrO results to the model being trained
  • Extract: Model state extraction for checkpointing

This architecture allows you to write complex modeling code in Python while integrating with Psyche's distributed training network.

Secrets

We manage secrets in our repo using agenix. These secrets are keyed to specific developers via SSH public keys. Some are used for deployments, and some can be used for development.

You can read more about agenix and how secrets are used in our deployment HERE

What secrets do we store?

{{#include ../../../secrets.nix}}

Editing a secret

you must have your pubkey listed in secrets.nix for a secret if you want to modify the existing one!

ask someone whose key is in secrets.nix to be added.

To edit the secret whatever.age, run

agenix -e secrets/whatever.age

Building the Psyche Book

That's the document you're reading! :D

Development

Simply run just serve_book to serve the book over http on localhost! This will also automatically rebuild the book when changes are made.

Building

nix build .#psyche-book

The book will be output to result/, which you can preview easily with python -m http.server -d ./result/

CI

Overview

We use Garnix as our CI provider. It:

  • Builds packages in our Nix flakes
  • Runs all Nix checks including formatting, lints, & Rust tests.

Deployment Branches

Some branches are configured for automatic deployment. These branches serve as dedicated testing environments.

Development Environments

These environments are stateful and accessible via SSH for developer troubleshooting. Public keys are listed in this repo.

Source BranchPurposeHostname
test-deploy-devnetIndexer/frontend for devnetdevnet-preview.psyche.network
test-deploy-mainnetIndexer/frontend for mainnetmainnet-preview.psyche.network
test-deploy-docsPreview docsdocs.preview.psyche.network

Production Environment

main automatically deploys the website/indexer to https://mainnet.psyche.network/ and the docs to https://docs.psyche.network/.

This is a stateful deploy, but with no SSH access for security reasons.

Contributing to Psyche

Found a bug?

  • Make sure we're not already aware of it by checking GitHub Issues.

  • If it seems your bug is new, open an issue. Describe the expected & actual behaviour in as much detail as possible, ensuring to include system information (CUDA? CPU?) and any relevant command-line params (data parallelism? tensor parallelism? compression ratio?).

Fixed a bug?

  • Submit a GitHub PR with your bugfix.

  • Make sure your PR clearly explains what was broken and how you fixed it. Reference any related issues.

  • Before submitting, check out our guidelines to keep things consistent.

Want to add a cool feature or change something?

  • First, share your idea on the Psyche forum and get some feedback.
  • Feel free to start developing whenever you want, but we generally won't accept a PR unless there's been some discussion and feedback about whether your feature fits Psyche's goals.

Have questions about how things work?

  • Post your questions on the Psyche forum - that's the best place to get answers!

Want to improve our docs?

  • We'd love that. Feel free to open PRs!

Thank you for your contributions to Psyche :heart:

PR guidelines

We prefer PRs to be made and merged using rebase, not merge commits. It's not a deal-breaker, but rebase makes us happy <3

Clean Linear History

Rebasing creates a linear commit history without merges going back and forth, making it much easier to identify the place a change was made. Fix-ups in merge commits that introduce bugs are no longer associated with the original code, whereas with with rebase you'd find the bug as part of its original commit.

Merge commits add extra noise to the history without adding meaningful content about what changed.

Better Bisect Experience

A linear history makes git bisect more effective for finding bugs, as each commit represents a coherent, working state of the codebase.

Preserving Meaningful Commits

While we advocate for rebase, we do not advocate for squashing all commits. Each commit should:

  1. Document a single logical step in your development process
  2. Be independently revertible if needed
  3. Separate concerns such as:
    • Refactoring (changing structure but not behavior)
    • Feature additions (changing behavior)
    • Bug fixes
    • Documentation updates
  4. Build & pass all checks if checked out individually.

What to Avoid

  • Don't squash meaningful commits together - this buries important changes in large diffs and loses the step-by-step narrative
  • Don't use merge commits within feature branches
  • Don't include "fix up" or "oops" commits in your final PR - these are fine to have during development, but before opening your PR, use git commit --amend or interactive rebase to clean these up. A typical rebase workflow is explained in this blog post. git absorb is also very useful for small fix-ups.