Overview

Psyche is a system that empowers strangers to collaboratively train a machine learning model in a decentralized and trustless manner.

Read the Psyche annoucement here.

The Psyche code is available on GitHub at PsycheFoundation/psyche.

The system is composed of three main actors:

  • Coordinator: Serves as a source of truth for global state available to all clients in a given training run. Each run has one coordinator that oversees the entire process. The coordinator is implemented as a program running on the Solana Blockchain.
  • Client: A user participating in a training run. Clients receive the model to be trained and a specific dataset for that run. They send information to the coordinator to progress the training run and use a peer-to-peer network to share their results at each training step with other clients.
  • Data Provider: An optional server that stores the data to be used for model training, to be serverd to clients. A run could use the data provider, an HTTP location for data, or make clients bring their own copy of the dataset.
flowchart TB
    subgraph run id: test_model_2
        direction TB
        subgraph Solana
            C(("Coordinator"))
        end
        C <--> C1(("Client")) & C2(("Client")) & C3(("Client"))
        C1 <-.-> C2
        C3 <-.-> C2 & C1
        DT["Data hosted on HTTP"] --> C1 & C2 & C3
    end
    subgraph run id: test_model_1
        direction TB
        subgraph Solana2["Solana"]
            CC(("Coordinator"))
        end
        CC <--> C11(("Client")) & C22(("Client")) & C33(("Client"))
        C11 <-.-> C22
        C33 <-.-> C22 & C11
        DTT["Data server"] --> C11 & C22 & C33
    end

What does the training process look like?

The training process for a given model is divided into small steps that incrementally train the model in a coordinated manner. A training run is divided into epochs, where clients can join and leave the run, and epochs are further divided into steps, where the model is incrementally trained.

During a training run, clients primarily perform three tasks:

  • Training: Train the model using an assigned subset of the data.
  • Witnessing: Verify the liveness and correctness of other participants.
  • Verifying: Recompute and compare results to identify and mitigate malicious participants.

Waiting for Clients & Warmup

At the start of an epoch, all clients have a window of time to join the run by requesting to be added by coordinator, and then connecting to the other participating clients.

Once a minimum threshold of clients has been met, the run will transition to the Warmup phase and begin a countdown to allow connected clients to update their copy of the model, at which point it will enter the Training phase.

Training

At the beginning of an epoch, after the Warmup phase ends, clients are assigned specific tasks that require them to train the model on a portion of the data.

The coordinator contains information that uniquely assigns pieces of training data to clients based on the current round.

If clients have already been training (i.e., it is not the first round of the epoch), they will apply the results from the previous round, then retrieve the data sample they need for the current round.

After completing the training on their assigned data, each client emits a p2p broadcast to all other clients containing their training results and a cryptographic commitment that binds them to those results.

As the training results are recieved from other clients, they are downloaded to be later incorporated into the current model.

Witnessing

At the start of each round, one or more clients are randomly selected as witnesses. The number of witnesses can be configured. Witnesses train the model as usual, but also build bloom filters that track which nodes they have recieved training results from, signifying that they are actively participating and providing valid results.

These bloom filters are sent to the coordinator, which then combines them into a provable consensus of which results to apply to the model.

Once a witness quorum is reached, the coordinator advances to the Training phase to allow all clients a brief window to download every training result.

Once the Witness phase concludes, the coordinator returns to the Training phase. Clients are assigned new data, and the process repeats. After a predefined number of rounds, a Cooldown round occurs, marking the end of an epoch.

The witness/train loop visualized

Here's a high-level overview of the process. Additional details exist, but this captures the overall flow:

sequenceDiagram
    participant Client1
    participant Client2
    participant Coordinator
    participant DataServer
    Client1->>DataServer: get_data
    Client2->>DataServer: get_data
    Coordinator->>Client2: witness
    Note over Client1: Train
    Note over Client2: Train
    Client1->>Client2: Send results
    Client2->>Client1: Send results
    Note over Client1: Download results
    Note over Client2: Download results
    Client2->>Coordinator: Send witness
    Note over Coordinator: Quorum reached
    Note over Coordinator: Starting Witness phase

Psyche In Depth

This section provides a detailed explanation of the various components of Psyche, their behavior, and their implementation.

Coordinator

The Coordinator stores metadata about the run and a list of participants. It handles the transition between each Phase of a Round, and provides a random seed that's used to determine data assignments, witnesses, and more.

It's responsible for providing a point of synchronization for all clients within a run.

Ticks

When certain events occur or time-based conditions are met, the Coordinator can be "ticked" forwards to transition from one Phase to another Phase.

sequenceDiagram
    loop
        Note over Backend, Coordinator: Wait for a timeout or backend state
        Backend->>Coordinator: Tick
        Coordinator->>Backend: New state produced
        Backend->Client1: New coordinator state consumed by Client
        Backend->Client2: New coordinator state consumed by Client
    end

Beginning an Epoch

The Coordinator begins in the WaitingForMembers phase, with no clients connected.

Whatever backend you're running the Coordinator in should accept pending clients to be added to upcoming Epochs.

When inside the WaitingForMembers phase, your backend will pass new clients to the Coordinator until a configured min_clients threshold is met, at which point the coordinator's tick will transition it to the Warmup phase.

sequenceDiagram
    Note over Coordinator: min_clients = 2
    Client1->>Coordinator: Join
    Client2->>Coordinator: Join
    Note over Coordinator: Entering Warmup
    Client1->>Client2: Connect
    Client2->>Client1: Connect
    Note over Coordinator: The Warmup countdown elapses
    Note over Coordinator: Entering Training

Warmup

This phase is designed to let all clients download the model & load it onto their GPUs.

If a client has dropped whilst waiting for the warmup time, the Backend then removes the client from the Coordinator's clients list. If the number of clients falls below min_clients, the Coordinator goes back to the WaitingForMembers phase.

Once the Warmup time passes, the Coordinator loads all the information for the next training round and change its phase to RoundTrain. The Server will broadcast this Training Coordinator state to all clients.

Training

In this phase, the Coordinator provides a random seed. Each client can use this seed, alongside the current round index and epoch index to determine which indicies of the training data to use.

Witnessing

As clients complete their training, they send their results to all other clients, including the Witnesses. The witnesses will each send a witness proof to the Coordinator, building towards a witness quorum.

A witness proof contains a bloom filter describing which pieces of data the witness recieved training results for, and which clients did that work. Elected witnesses are responsible for creating these witness proofs and and sending them to the Coordinator.

The witnesses for each round are chosen randomly from all the clients, using the same random seed as for data assignments. A witness will attempt to send an opportunistic witness message once it's seen a recieved a training result for every single batch in the current round.

Witness Quorum

The Coordinator advances the run from the Training phase to the Witness phase in one of two ways:

  • If enough witnesses observe all results and reach a witness quorum for the round, they notify the Coordinator that it is safe to advance. This process, named opportunistic witnessing, accelerates the transition to the Witness phase, rather than having to wait a fixed time for training results.
  • If witnesses do not receive all required results from other clients before the maximum time specified for the Training phase, the Coordinator will nontheless transition to the Witness phase after the maximum Training time elapses.

Witness phase

This phase exists to give the witnesses an opportunity to send their proofs to the Coordinator in the event that they have not received enough training results from other clients to have reached the quorum and send their proofs opportunistically.

There is also brief slack period for non-witness nodes to catch up by downloading any remaining results they might have not recieved.

When the Witness phase finishes via timeout, the Coordinator transitions from Witness to the Cooldown phase in three cases:

  • If we are in the last round of the epoch.
  • If the clients have dropped to less than the minimum required by the config.
  • If the number of witnesses for the round is less than the quorum specified by the config.

Any clients that have failed health checks will also be removed from the current epoch.

Cooldown

The Cooldown phase is the last phase of an epoch, during which the Cooordinator waits for either the Cooldown period to elapse, or a checkpoint to have happened.

When the Cooldown phase begins, the Coordinator resets the current model checkpoint state to Checkpoint::P2P, signifying that new joiners should download the latest copy of the model from the other participants.

Upon exiting the Cooldown phase, the Coordinator transitions to the next epoch, saving the previous epoch state, and moving back to the WaitingForMembers phase.

It all comes together!

sequenceDiagram
    Backend->>Coordinator: tick
    Coordinator->>Backend: Change state to `RoundTrain`
    Backend->>Client1: New state
    Backend->>Client2: New state
    par Start training
        Client1->>Client1: Start training
        Client2->>Client2: Start training
    end
    Client1->>Committee: get_witness
    Client2->>Committee: get_witness
    Committee->>Client1: false
    Committee->>Client2: true
    Note over Client1: Train
    Note over Client2: Train
    Note over Client2: Fill bloom filters
    Client2->>Backend: try send optimistic witness
    Backend->>Coordinator: Witness message
    Note over Coordinator: Enough witnesses for round
    Coordinator->>Coordinator: Update state to RoundWitness
    Note over Coordinator: Timeout round witness time
    alt step > total steps
        Coordinator->>Coordinator: Update state to Waitingformembers
    else height == rounds per epoch 
        Coordinator->>Coordinator: Update state to Cooldown
    else
        Coordinator->>Coordinator: Update state to RoundTrain with step + 1
    end

Centralized Backend

In this Backend, the Coordinator is owned and ticked forwards by a Server that communicates via clients over TCP.

The Server's Coordinator is initially configured in main.rs. It's loaded using the configuration file state.toml.

flowchart LR
    S[Server] --run--> A[App]
    S --new--> C[Coordinator]
    C --run_id
        init warmup
        min clients
        model--> A

The Server uses some parts of the Coordinator configuration, like the data server configuration, if enabled, to boot up all the functionality it needs.

When a new client joins the run it has to communicate the run_id that it wants to join, to ensure the client's joining the correct run. After processing the join message, the client is added to a pending clients list, and runs the Coordinator's tick function to potentially add the client into the run.

When a tick condition is met, the Server ticks the Coordinator forwards, then broadcasts the Coordinator's new state to all connected clients.

Health checks

In the start function, the client spawns a new task to repeatedly send health checks to the server. Nodes, also known as trainers in this state, are assigned a score determined by the Coordinator using the trainer_healthy_score_by_witnesses method. This score increases as a client sends the required data to be added to the participants' bloom filters, allowing the Coordinator to confirm that the client is actively participating in the training.

A node also sends a list of other nodes it considers unhealthy to the server using the HealthCheck message. The Coordinator processes this information to determine whether those nodes are healthy. Nodes deemed inactive or non-participatory are marked for removal in the next round.

Decentralized Backend

TODO

Data Provider

The data provider is the interface that will be implemented by the different structures responsible for parsing the training data and creating samples for the clients to use when training a model. There's three different structures ways to declare a data provider depending on how they will host the data and how the clients will request them:

  • Local data provider where the client already contains the data to be used on training.
  • HTTP data provider the clients will recieve a list of urls to make request on where all the necessary data of training is hosted.
  • TCP data provider a local server separated from the client that the clients could communicate using TCP to get the training data.

Overview

The data provider acts as a server that can be accessed via TCP by clients to obtain the data they need for training.

When a client starts a round of training, it receives an ID or a range of IDs from the coordinator, representing all the batches that will be used for that round. Each batch contains a specific range of the overall data. The client can then call the data provider with the assigned IDs for the run and fetch the corresponding data to begin training.

To better understand how the data is partitioned for each client, refer to the following diagram:

flowchart TD
    C((Coordinator))
    C1[Client]
    C2[Client]
    C --Batch IDs--> C1
    C --Batch IDs--> C2
    subgraph Data Provider
        B1["Batch
            1. Data
            2. Data
            3. Data
            "]
        B2["Batch
            1. Data
            2. Data
            3. Data
            "]
        B3["Batch
            1. Data
            2. Data
            3. Data
            "]
        B4["Batch
            1. Data
            2. Data
            3. Data
        "]
        B5["Batch
            1. Data
            2. Data
            3. Data
        "]
        B6["Batch
            1. Data
            2. Data
            3. Data
        "]
        B4 ~~~ B1
        B5 ~~~ B2
        B6 ~~~ B3
    end
    B1 --> C1
    B2 --> C1
    B3 --> C1
    B4 --> C2
    B5 --> C2
    B6 --> C2

The number of batches used for training in a run, as well as the indexes of data that each batch contains, can be configured.

Deep Dive

For the coordinator's initial state, the state.toml file contains configuration details for the entire run. A key section to consider is [model.LLM.data_location], which specifies whether the data will be hosted on a TCP server, accessed via HTTP, or stored in a local folder.

When loading a model, the required configuration depends on the data provider implementation being used:

  1. TCP Server:

    • If the data provider is configured as a TCP server, an additional file named data.toml is required.
    • This file contains configurations for local training, including:
      • Data location
      • Token size
      • Sequence length
      • A seed to shuffle the data if necessary
    • An example data.toml file can be found in psyche/config within the various initial state examples.
  2. HTTP Provider:

    • For the HTTP data provider, no additional configuration file is needed.
    • The required fields for this setup include:
      • The URL (or a set of URLs) from which the data will be fetched
      • Token size (in bytes)
      • A shuffle seed, if data shuffling is desired.
  3. Client Hosting the Data:

    • In this case, the client must simply provide the URL where the data is hosted.

The init_run function initializes the data provider using the configuration and creates a DataFetcher, the structure responsible for managing the data fetching process. The data fetcher is part of the TrainingStepMetadata, which holds the internal data for the training step within the StepStateMachine, along with other metadata—one for each step.

Once the data provider is created and included in the state machine, it will be used at the start of the epoch and during every training step. The client monitors changes in the coordinator's state, and upon detecting a step transition, it calls the apply_state function for the RunManager. This, in turn, calls the apply_state function for the StepStateMachine. If the state indicates that a training round is starting, the start function for the TrainingStepMetadata is invoked.

The start function initiates the actual training process on the client side. Its first task is to fetch the data required for training using the assign_data_for_state function. This function determines the number of batches for the round and the indices of data within each batch. The client is then assigned an interval of batch IDs, called data_assignments, which it fetches from the data provider using the fetch_data function of the DataFetcher.

The fetch_data function parses the batch IDs using the data indices per batch to calculate the actual intervals of data to use. It creates a channel to send and receive batches. Once the data intervals are calculated, the client calls the get_samples function on the data provider to retrieve the raw data for those IDs. This process repeats in a loop until all batch IDs are requested and sent through the channel.

On the other end, the receiver is used in the train function. It continuously receives data from the channel and uses it for training until all data is consumed.

Model sharing

When an epoch starts, all clients must have an identical model to train with.

At the beginning of a run, all clients must download the model parameters, tokenizer configuration, and model configuration from HuggingFace, where the model must have been previously uploaded (TODO: add more details on uploading a model). Each client will then modify their copy of the model by receiving new training results from other clients and applying them. This keeps everyone's copy of model identical within an epoch without an additional full synchronization step.

When a new client joins a run that has already progressed past its first epoch, it would not be correct for the client to download the original model from HuggingFace, as the model parameters would have already been updated during training. Instead, the new client must acquire a copy of the model from the peers who have been actively training it.

This synchronization process occurs during the Warmup phase, while the coordinator waits to begin the next Training phase.

To address this, we checkpoint the model at the end of an epoch, where clients save and share the entire model for new peers to join. There are two checkpointing variants:

  1. HuggingFace checkpoint:
    In this approach, a client or a set of clients are designated as the checkpointers for the run. These clients upload their copy of updated model to HuggingFace after each epoch, and send the URL for this checkpoint to the coordinator. When a new client joins the run, it retrieves the checkpoint URL from the coordinator, and connects to HuggingFace to download the latest copy of the model parameters and configuration files.

  2. P2P checkpoint:
    In the peer-to-peer (P2P) approach, a new client synchronizes by obtaining the latest model directly from other peers. It receives the model information and parameters from any available peer, requesting a set of parameters for each layer from different clients. This process allows the client to assemble the latest model state and participate in the training without an explicit upload step to a central server occuring.

    Here's an example of a P2P model sharing interaction:

flowchart TB
        C((Coordinator))
        C1[Client]
        C2[Client]
        C3[Client]
        C4[Client]
        HF[/Hugging Face\]
        C --warmup---> C1
        C --warmup---> C2
        C --warmup---> C3
        HF --Get model config--> C4
        C4 -.Join.-> C
        C1 -.Layer 1 weights.-> C4
        C2 -.Layer 2 weights.-> C4
        C3 -.Layer 3 weights.-> C4

Psyche Tooling

This section aims to introduce developers and users of Psyche to the various tools available for testing the network and gaining a better understanding of how a real run operates, all within a local setup.

Psyche Centralized Client

The Psyche Centralized Client is responsible for joining and participating in a training run, contributing to the model's training process, and sharing results with other peers. It is a CLI application with various configurable options.

Command-Line Help for psyche-centralized-client

This document contains the help content for the psyche-centralized-client command-line program.

Command Overview:

psyche-centralized-client

Usage: psyche-centralized-client <COMMAND>

Subcommands:
  • show-identity — Displays the client's unique identifier, used to participate in training runs
  • train — Allows the client to join a training run and contribute to the model's training process

psyche-centralized-client show-identity

Displays the client's unique identifier, used to participate in training runs

Usage: psyche-centralized-client show-identity [OPTIONS]

Options:
  • --identity-secret-key-path <IDENTITY_SECRET_KEY_PATH> — Path to the clients secret key. Create a new random one running openssl rand 32 > secret.key or use the RAW_IDENTITY_SECRET_KEY environment variable

psyche-centralized-client train

Allows the client to join a training run and contribute to the model's training process

Usage: psyche-centralized-client train [OPTIONS] --run-id <RUN_ID> --server-addr <SERVER_ADDR>

Options:
  • -i, --identity-secret-key-path <IDENTITY_SECRET_KEY_PATH> — Path to the clients secret key. Create a new random one running openssl rand 32 > secret.key. If not provided a random one will be generated

  • -b, --bind-p2p-port <BIND_P2P_PORT> — Sets the port for the client's P2P network participation. If not provided, a random port will be chosen

  • --tui <TUI> — Enables a terminal-based graphical interface for monitoring analytics

    Default value: true

    Possible values: true, false

  • --run-id <RUN_ID> — A unique identifier for the training run. This ID allows the client to join a specific active run

  • --data-parallelism <DATA_PARALLELISM>

    Default value: 1

  • --tensor-parallelism <TENSOR_PARALLELISM>

    Default value: 1

  • --micro-batch-size <MICRO_BATCH_SIZE>

  • --write-gradients-dir <WRITE_GRADIENTS_DIR> — If provided, every shared gradient this client sees will be written to this directory

  • --eval-tasks <EVAL_TASKS>

  • --eval-fewshot <EVAL_FEWSHOT>

    Default value: 0

  • --eval-seed <EVAL_SEED>

    Default value: 42

  • --eval-task-max-docs <EVAL_TASK_MAX_DOCS>

  • --checkpoint-dir <CHECKPOINT_DIR> — If provided, every model parameters update will be save in this directory after each epoch

  • --hub-repo <HUB_REPO> — Path to the Hugging Face repository containing model data and configuration

  • --wandb-project <WANDB_PROJECT>

  • --wandb-run <WANDB_RUN>

  • --wandb-group <WANDB_GROUP>

  • --wandb-entity <WANDB_ENTITY>

  • --write-log <WRITE_LOG>

  • --optim-stats-steps <OPTIM_STATS_STEPS>

  • --grad-accum-in-fp32

    Default value: false

  • --dummy-training-delay-secs <DUMMY_TRAINING_DELAY_SECS>

  • --max-concurrent-parameter-requests <MAX_CONCURRENT_PARAMETER_REQUESTS>

    Default value: 10

  • --server-addr <SERVER_ADDR>


This document was generated automatically by clap-markdown.

Psyche Centralized Server

The Psyche Centralized Server is responsible for hosting the coordinator and a data provider locally to enable testing the network and training a test model. The server requires a configuration file named state.toml to load the initial settings for the coordinator.

Command-Line Help for psyche-centralized-server

This document contains the help content for the psyche-centralized-server command-line program.

Command Overview:

psyche-centralized-server

Usage: psyche-centralized-server <COMMAND>

Subcommands:
  • validate-config — Checks that the configuration declared in the state.toml file is valid
  • run — Starts the server and launches the coordinator with the declared configuration

psyche-centralized-server validate-config

Checks that the configuration declared in the state.toml file is valid

Usage: psyche-centralized-server validate-config [OPTIONS] --state <STATE>

Options:
  • --state <STATE> — Path to the state.toml file to validate
  • --data-config <DATA_CONFIG> — Path to data.toml file to validate. If no provided then it will not be checked

psyche-centralized-server run

Starts the server and launches the coordinator with the declared configuration

Usage: psyche-centralized-server run [OPTIONS] --state <STATE>

Options:
  • --state <STATE> — Path to TOML of Coordinator state

  • -s, --server-port <SERVER_PORT> — Port for the server, which clients will use to connect. if not specified, a random free port will be chosen

  • --tui <TUI>

    Default value: true

    Possible values: true, false

  • --data-config <DATA_CONFIG> — Path to TOML of data server config

  • --save-state-dir <SAVE_STATE_DIR> — Path to save the server and coordinator state

  • --init-warmup-time <INIT_WARMUP_TIME> — Sets the warmup time for the run. This overrides the warmup_time declared in the state file

  • --init-min-clients <INIT_MIN_CLIENTS> — Sets the minimum number of clients required to start a run. This overrides the min_clients declared in the state file

  • --withdraw-on-disconnect <WITHDRAW_ON_DISCONNECT> — Allows clients to withdraw if they need to disconnect from the run (this option has no effect in the centralized version)

    Default value: true

    Possible values: true, false


This document was generated automatically by clap-markdown.

Running a Local Testnet

The local testnet is a helper application designed to easily spin up a coordinator and multiple clients. It's useful for doing sample runs on your own hardware, and for development.

Pre-requisites

Since we want to run many clients and the coordinator we'll need several terminal windows to monitor them. The tool uses tmux to create them.

If you're using the Nix flake, tmux is already included.

Command-Line Help for psyche-centralized-local-testnet

This document contains the help content for the psyche-centralized-local-testnet command-line program.

Command Overview:

psyche-centralized-local-testnet

Usage: psyche-centralized-local-testnet <COMMAND>

Subcommands:
  • start — Starts the local-testnet running each part of the system in a separate terminal pane

psyche-centralized-local-testnet start

Starts the local-testnet running each part of the system in a separate terminal pane

Usage: psyche-centralized-local-testnet start [OPTIONS] --num-clients <NUM_CLIENTS> --config-path <CONFIG_PATH>

Options:
  • --num-clients <NUM_CLIENTS> — Number of clients to start

  • --config-path <CONFIG_PATH> — File path to the configuration that the coordinator will need to start

  • --write-distro-data <WRITE_DISTRO_DATA> — If provided, write DisTrO data to disk in this path

  • --server-port <SERVER_PORT> — Port where the server for this testnet will be listen it to (this is the one that clients must use when connecting)

    Default value: 20000

  • --tui <TUI> — Enables a terminal-based graphical interface for monitoring analytics

    Default value: true

    Possible values: true, false

  • --random-kill-num <RANDOM_KILL_NUM> — Kill N clients randomly every <RANDOM_KILL_INTERVAL> seconds

  • --allowed-to-kill <ALLOWED_TO_KILL> — Which clients we're allowed to kill randomly

  • --random-kill-interval <RANDOM_KILL_INTERVAL> — Kill <RANDOM_KILL_NUM> clients randomly every N seconds

    Default value: 120

  • --log <LOG> — Sets the level of the logging for more granular information

    Default value: info,psyche=debug

  • --first-client-checkpoint <FIRST_CLIENT_CHECKPOINT> — HF repo where the first client could get the model and the configuration to use

  • --hf-token <HF_TOKEN>

  • --write-log

    Default value: false

  • --wandb-project <WANDB_PROJECT>

  • --wandb-group <WANDB_GROUP>

  • --wandb-entity <WANDB_ENTITY>

  • --optim-stats <OPTIM_STATS>

  • --eval-tasks <EVAL_TASKS>


This document was generated automatically by clap-markdown.