Overview
Psyche is a system that empowers strangers to collaboratively train a machine learning model in a decentralized and trustless manner.
Read the Psyche annoucement here.
The Psyche code is available on GitHub at PsycheFoundation/psyche.
The system is composed of three main actors:
- Coordinator: Serves as a source of truth for global state available to all clients in a given training run. Each run has one coordinator that oversees the entire process. The coordinator is implemented as a program running on the Solana Blockchain.
- Client: A user participating in a training run. Clients receive the model to be trained and a specific dataset for that run. They send information to the coordinator to progress the training run and use a peer-to-peer network to share their results at each training step with other clients.
- Data Provider: An optional server that stores the data to be used for model training, to be serverd to clients. A run could use the data provider, an HTTP location for data, or make clients bring their own copy of the dataset.
flowchart TB subgraph run id: test_model_2 direction TB subgraph Solana C(("Coordinator")) end C <--> C1(("Client")) & C2(("Client")) & C3(("Client")) C1 <-.-> C2 C3 <-.-> C2 & C1 DT["Data hosted on HTTP"] --> C1 & C2 & C3 end subgraph run id: test_model_1 direction TB subgraph Solana2["Solana"] CC(("Coordinator")) end CC <--> C11(("Client")) & C22(("Client")) & C33(("Client")) C11 <-.-> C22 C33 <-.-> C22 & C11 DTT["Data server"] --> C11 & C22 & C33 end
What does the training process look like?
The training process for a given model is divided into small steps that incrementally train the model in a coordinated manner. A training run is divided into epochs, where clients can join and leave the run, and epochs are further divided into steps, where the model is incrementally trained.
During a training run, clients primarily perform three tasks:
- Training: Train the model using an assigned subset of the data.
- Witnessing: Verify the liveness and correctness of other participants.
- Verifying: Recompute and compare results to identify and mitigate malicious participants.
Waiting for Clients & Warmup
At the start of an epoch, all clients have a window of time to join the run by requesting to be added by coordinator, and then connecting to the other participating clients.
Once a minimum threshold of clients has been met, the run will transition to the Warmup phase and begin a countdown to allow connected clients to update their copy of the model, at which point it will enter the Training phase.
Training
At the beginning of an epoch, after the Warmup phase ends, clients are assigned specific tasks that require them to train the model on a portion of the data.
The coordinator contains information that uniquely assigns pieces of training data to clients based on the current round.
If clients have already been training (i.e., it is not the first round of the epoch), they will apply the results from the previous round, then retrieve the data sample they need for the current round.
After completing the training on their assigned data, each client emits a p2p broadcast to all other clients containing their training results and a cryptographic commitment that binds them to those results.
As the training results are recieved from other clients, they are downloaded to be later incorporated into the current model.
Witnessing
At the start of each round, one or more clients are randomly selected as witnesses. The number of witnesses can be configured. Witnesses train the model as usual, but also build bloom filters that track which nodes they have recieved training results from, signifying that they are actively participating and providing valid results.
These bloom filters are sent to the coordinator, which then combines them into a provable consensus of which results to apply to the model.
Once a witness quorum is reached, the coordinator advances to the Training phase to allow all clients a brief window to download every training result.
Once the Witness phase concludes, the coordinator returns to the Training phase. Clients are assigned new data, and the process repeats. After a predefined number of rounds, a Cooldown round occurs, marking the end of an epoch.
The witness/train loop visualized
Here's a high-level overview of the process. Additional details exist, but this captures the overall flow:
sequenceDiagram participant Client1 participant Client2 participant Coordinator participant DataServer Client1->>DataServer: get_data Client2->>DataServer: get_data Coordinator->>Client2: witness Note over Client1: Train Note over Client2: Train Client1->>Client2: Send results Client2->>Client1: Send results Note over Client1: Download results Note over Client2: Download results Client2->>Coordinator: Send witness Note over Coordinator: Quorum reached Note over Coordinator: Starting Witness phase
Psyche In Depth
This section provides a detailed explanation of the various components of Psyche, their behavior, and their implementation.
Coordinator
The Coordinator stores metadata about the run and a list of participants. It handles the transition between each Phase of a Round, and provides a random seed that's used to determine data assignments, witnesses, and more.
It's responsible for providing a point of synchronization for all clients within a run.
Ticks
When certain events occur or time-based conditions are met, the Coordinator can be "ticked" forwards to transition from one Phase to another Phase.
sequenceDiagram loop Note over Backend, Coordinator: Wait for a timeout or backend state Backend->>Coordinator: Tick Coordinator->>Backend: New state produced Backend->Client1: New coordinator state consumed by Client Backend->Client2: New coordinator state consumed by Client end
Beginning an Epoch
The Coordinator begins in the WaitingForMembers
phase, with no clients connected.
Whatever backend you're running the Coordinator in should accept pending clients to be added to upcoming Epochs.
When inside the WaitingForMembers
phase, your backend will pass new clients to the Coordinator until a configured min_clients
threshold is met, at which point the coordinator's tick
will transition it to the Warmup
phase.
sequenceDiagram Note over Coordinator: min_clients = 2 Client1->>Coordinator: Join Client2->>Coordinator: Join Note over Coordinator: Entering Warmup Client1->>Client2: Connect Client2->>Client1: Connect Note over Coordinator: The Warmup countdown elapses Note over Coordinator: Entering Training
Warmup
This phase is designed to let all clients download the model & load it onto their GPUs.
If a client has dropped whilst waiting for the warmup time, the Backend then removes the client from the Coordinator's clients list.
If the number of clients falls below min_clients, the Coordinator goes back to the WaitingForMembers
phase.
Once the Warmup
time passes, the Coordinator loads all the information for the next training round and change its phase to RoundTrain
. The Server will broadcast this Training
Coordinator state to all clients.
Training
In this phase, the Coordinator provides a random seed. Each client can use this seed, alongside the current round index and epoch index to determine which indicies of the training data to use.
Witnessing
As clients complete their training, they send their results to all other clients, including the Witnesses. The witnesses will each send a witness proof to the Coordinator, building towards a witness quorum.
A witness proof contains a bloom filter describing which pieces of data the witness recieved training results for, and which clients did that work. Elected witnesses are responsible for creating these witness proofs and and sending them to the Coordinator.
The witnesses for each round are chosen randomly from all the clients, using the same random seed as for data assignments. A witness will attempt to send an opportunistic witness message once it's seen a recieved a training result for every single batch in the current round.
Witness Quorum
The Coordinator advances the run from the Training phase to the Witness phase in one of two ways:
- If enough witnesses observe all results and reach a witness quorum for the round, they notify the Coordinator that it is safe to advance. This process, named opportunistic witnessing, accelerates the transition to the Witness phase, rather than having to wait a fixed time for training results.
- If witnesses do not receive all required results from other clients before the maximum time specified for the Training phase, the Coordinator will nontheless transition to the Witness phase after the maximum Training time elapses.
Witness phase
This phase exists to give the witnesses an opportunity to send their proofs to the Coordinator in the event that they have not received enough training results from other clients to have reached the quorum and send their proofs opportunistically.
There is also brief slack period for non-witness nodes to catch up by downloading any remaining results they might have not recieved.
When the Witness phase finishes via timeout, the Coordinator transitions from Witness to the Cooldown phase in three cases:
- If we are in the last round of the epoch.
- If the clients have dropped to less than the minimum required by the config.
- If the number of witnesses for the round is less than the quorum specified by the config.
Any clients that have failed health checks will also be removed from the current epoch.
Cooldown
The Cooldown phase is the last phase of an epoch, during which the Cooordinator waits for either the Cooldown period to elapse, or a checkpoint to have happened.
When the Cooldown phase begins, the Coordinator resets the current model checkpoint state to Checkpoint::P2P
, signifying that new joiners should download the latest copy of the model from the other participants.
Upon exiting the Cooldown phase, the Coordinator transitions to the next epoch, saving the previous epoch state, and moving back to the WaitingForMembers phase.
It all comes together!
sequenceDiagram Backend->>Coordinator: tick Coordinator->>Backend: Change state to `RoundTrain` Backend->>Client1: New state Backend->>Client2: New state par Start training Client1->>Client1: Start training Client2->>Client2: Start training end Client1->>Committee: get_witness Client2->>Committee: get_witness Committee->>Client1: false Committee->>Client2: true Note over Client1: Train Note over Client2: Train Note over Client2: Fill bloom filters Client2->>Backend: try send optimistic witness Backend->>Coordinator: Witness message Note over Coordinator: Enough witnesses for round Coordinator->>Coordinator: Update state to RoundWitness Note over Coordinator: Timeout round witness time alt step > total steps Coordinator->>Coordinator: Update state to Waitingformembers else height == rounds per epoch Coordinator->>Coordinator: Update state to Cooldown else Coordinator->>Coordinator: Update state to RoundTrain with step + 1 end
Centralized Backend
In this Backend, the Coordinator is owned and ticked forwards by a Server that communicates via clients over TCP.
The Server's Coordinator is initially configured in main.rs
.
It's loaded using the configuration file state.toml
.
flowchart LR S[Server] --run--> A[App] S --new--> C[Coordinator] C --run_id init warmup min clients model--> A
The Server uses some parts of the Coordinator configuration, like the data server configuration, if enabled, to boot up all the functionality it needs.
When a new client joins the run it has to communicate the run_id
that it wants to join, to ensure the client's joining the correct run. After processing the join message, the client is added to a pending clients list, and runs the Coordinator's tick function to potentially add the client into the run.
When a tick condition is met, the Server ticks the Coordinator forwards, then broadcasts the Coordinator's new state to all connected clients.
Health checks
In the start
function, the client spawns a new task to repeatedly send health checks to the server. Nodes, also known as trainers in this state, are assigned a score determined by the Coordinator using the trainer_healthy_score_by_witnesses
method. This score increases as a client sends the required data to be added to the participants' bloom filters, allowing the Coordinator to confirm that the client is actively participating in the training.
A node also sends a list of other nodes it considers unhealthy to the server using the HealthCheck
message. The Coordinator processes this information to determine whether those nodes are healthy. Nodes deemed inactive or non-participatory are marked for removal in the next round.
Decentralized Backend
TODO
Data Provider
The data provider is the interface that will be implemented by the different structures responsible for parsing the training data and creating samples for the clients to use when training a model. There's three different structures ways to declare a data provider depending on how they will host the data and how the clients will request them:
- Local data provider where the client already contains the data to be used on training.
- HTTP data provider the clients will recieve a list of urls to make request on where all the necessary data of training is hosted.
- TCP data provider a local server separated from the client that the clients could communicate using TCP to get the training data.
Overview
The data provider acts as a server that can be accessed via TCP by clients to obtain the data they need for training.
When a client starts a round of training, it receives an ID or a range of IDs from the coordinator, representing all the batches that will be used for that round. Each batch contains a specific range of the overall data. The client can then call the data provider with the assigned IDs for the run and fetch the corresponding data to begin training.
To better understand how the data is partitioned for each client, refer to the following diagram:
flowchart TD C((Coordinator)) C1[Client] C2[Client] C --Batch IDs--> C1 C --Batch IDs--> C2 subgraph Data Provider B1["Batch 1. Data 2. Data 3. Data "] B2["Batch 1. Data 2. Data 3. Data "] B3["Batch 1. Data 2. Data 3. Data "] B4["Batch 1. Data 2. Data 3. Data "] B5["Batch 1. Data 2. Data 3. Data "] B6["Batch 1. Data 2. Data 3. Data "] B4 ~~~ B1 B5 ~~~ B2 B6 ~~~ B3 end B1 --> C1 B2 --> C1 B3 --> C1 B4 --> C2 B5 --> C2 B6 --> C2
The number of batches used for training in a run, as well as the indexes of data that each batch contains, can be configured.
Deep Dive
For the coordinator's initial state, the state.toml
file contains configuration details for the entire run. A key section to consider is [model.LLM.data_location]
, which specifies whether the data will be hosted on a TCP server, accessed via HTTP, or stored in a local folder.
When loading a model, the required configuration depends on the data provider implementation being used:
-
TCP Server:
- If the data provider is configured as a TCP server, an additional file named
data.toml
is required. - This file contains configurations for local training, including:
- Data location
- Token size
- Sequence length
- A seed to shuffle the data if necessary
- An example
data.toml
file can be found inpsyche/config
within the various initial state examples.
- If the data provider is configured as a TCP server, an additional file named
-
HTTP Provider:
- For the HTTP data provider, no additional configuration file is needed.
- The required fields for this setup include:
- The URL (or a set of URLs) from which the data will be fetched
- Token size (in bytes)
- A shuffle seed, if data shuffling is desired.
-
Client Hosting the Data:
- In this case, the client must simply provide the URL where the data is hosted.
The init_run
function initializes the data provider using the configuration
and creates a DataFetcher
, the structure responsible for managing the data
fetching process. The data fetcher is part of the TrainingStepMetadata
, which
holds the internal data for the training step within the StepStateMachine
,
along with other metadata—one for each step.
Once the data provider is created and included in the state machine, it will be
used at the start of the epoch and during every training step. The client
monitors changes in the coordinator's state, and upon detecting a step
transition, it calls the apply_state
function for the RunManager
. This, in
turn, calls the apply_state
function for the StepStateMachine
. If the state
indicates that a training round is starting, the start
function for the
TrainingStepMetadata
is invoked.
The start
function initiates the actual training process on the client side.
Its first task is to fetch the data required for training using the
assign_data_for_state
function. This function determines the number of
batches for the round and the indices of data within each batch. The client is
then assigned an interval of batch IDs, called data_assignments
, which it
fetches from the data provider using the fetch_data
function of the
DataFetcher
.
The fetch_data
function parses the batch IDs using the data indices per batch
to calculate the actual intervals of data to use. It creates a channel to send
and receive batches. Once the data intervals are calculated, the client calls
the get_samples
function on the data provider to retrieve the raw data for
those IDs. This process repeats in a loop until all batch IDs are requested and
sent through the channel.
On the other end, the receiver is used in the train
function. It continuously
receives data from the channel and uses it for training until all data is
consumed.
Model sharing
When an epoch starts, all clients must have an identical model to train with.
At the beginning of a run, all clients must download the model parameters, tokenizer configuration, and model configuration from HuggingFace, where the model must have been previously uploaded (TODO: add more details on uploading a model). Each client will then modify their copy of the model by receiving new training results from other clients and applying them. This keeps everyone's copy of model identical within an epoch without an additional full synchronization step.
When a new client joins a run that has already progressed past its first epoch, it would not be correct for the client to download the original model from HuggingFace, as the model parameters would have already been updated during training. Instead, the new client must acquire a copy of the model from the peers who have been actively training it.
This synchronization process occurs during the Warmup phase, while the coordinator waits to begin the next Training phase.
To address this, we checkpoint the model at the end of an epoch, where clients save and share the entire model for new peers to join. There are two checkpointing variants:
-
HuggingFace checkpoint:
In this approach, a client or a set of clients are designated as the checkpointers for the run. These clients upload their copy of updated model to HuggingFace after each epoch, and send the URL for this checkpoint to the coordinator. When a new client joins the run, it retrieves the checkpoint URL from the coordinator, and connects to HuggingFace to download the latest copy of the model parameters and configuration files. -
P2P checkpoint:
In the peer-to-peer (P2P) approach, a new client synchronizes by obtaining the latest model directly from other peers. It receives the model information and parameters from any available peer, requesting a set of parameters for each layer from different clients. This process allows the client to assemble the latest model state and participate in the training without an explicit upload step to a central server occuring.Here's an example of a P2P model sharing interaction:
flowchart TB C((Coordinator)) C1[Client] C2[Client] C3[Client] C4[Client] HF[/Hugging Face\] C --warmup---> C1 C --warmup---> C2 C --warmup---> C3 HF --Get model config--> C4 C4 -.Join.-> C C1 -.Layer 1 weights.-> C4 C2 -.Layer 2 weights.-> C4 C3 -.Layer 3 weights.-> C4
Psyche Tooling
This section aims to introduce developers and users of Psyche to the various tools available for testing the network and gaining a better understanding of how a real run operates, all within a local setup.
Psyche Centralized Client
The Psyche Centralized Client is responsible for joining and participating in a training run, contributing to the model's training process, and sharing results with other peers. It is a CLI application with various configurable options.
Command-Line Help for psyche-centralized-client
This document contains the help content for the psyche-centralized-client
command-line program.
Command Overview:
psyche-centralized-client
↴psyche-centralized-client show-identity
↴psyche-centralized-client train
↴
psyche-centralized-client
Usage: psyche-centralized-client <COMMAND>
Subcommands:
show-identity
— Displays the client's unique identifier, used to participate in training runstrain
— Allows the client to join a training run and contribute to the model's training process
psyche-centralized-client show-identity
Displays the client's unique identifier, used to participate in training runs
Usage: psyche-centralized-client show-identity [OPTIONS]
Options:
--identity-secret-key-path <IDENTITY_SECRET_KEY_PATH>
— Path to the clients secret key. Create a new random one runningopenssl rand 32 > secret.key
or use theRAW_IDENTITY_SECRET_KEY
environment variable
psyche-centralized-client train
Allows the client to join a training run and contribute to the model's training process
Usage: psyche-centralized-client train [OPTIONS] --run-id <RUN_ID> --server-addr <SERVER_ADDR>
Options:
-
-i
,--identity-secret-key-path <IDENTITY_SECRET_KEY_PATH>
— Path to the clients secret key. Create a new random one runningopenssl rand 32 > secret.key
. If not provided a random one will be generated -
-b
,--bind-p2p-port <BIND_P2P_PORT>
— Sets the port for the client's P2P network participation. If not provided, a random port will be chosen -
--tui <TUI>
— Enables a terminal-based graphical interface for monitoring analyticsDefault value:
true
Possible values:
true
,false
-
--run-id <RUN_ID>
— A unique identifier for the training run. This ID allows the client to join a specific active run -
--data-parallelism <DATA_PARALLELISM>
Default value:
1
-
--tensor-parallelism <TENSOR_PARALLELISM>
Default value:
1
-
--micro-batch-size <MICRO_BATCH_SIZE>
-
--write-gradients-dir <WRITE_GRADIENTS_DIR>
— If provided, every shared gradient this client sees will be written to this directory -
--eval-tasks <EVAL_TASKS>
-
--eval-fewshot <EVAL_FEWSHOT>
Default value:
0
-
--eval-seed <EVAL_SEED>
Default value:
42
-
--eval-task-max-docs <EVAL_TASK_MAX_DOCS>
-
--checkpoint-dir <CHECKPOINT_DIR>
— If provided, every model parameters update will be save in this directory after each epoch -
--hub-repo <HUB_REPO>
— Path to the Hugging Face repository containing model data and configuration -
--wandb-project <WANDB_PROJECT>
-
--wandb-run <WANDB_RUN>
-
--wandb-group <WANDB_GROUP>
-
--wandb-entity <WANDB_ENTITY>
-
--write-log <WRITE_LOG>
-
--optim-stats-steps <OPTIM_STATS_STEPS>
-
--grad-accum-in-fp32
Default value:
false
-
--dummy-training-delay-secs <DUMMY_TRAINING_DELAY_SECS>
-
--max-concurrent-parameter-requests <MAX_CONCURRENT_PARAMETER_REQUESTS>
Default value:
10
-
--server-addr <SERVER_ADDR>
This document was generated automatically by
clap-markdown
.
Psyche Centralized Server
The Psyche Centralized Server is responsible for hosting the coordinator and a data provider locally to enable testing the network and training a test model. The server requires a configuration file named state.toml
to load the initial settings for the coordinator.
Command-Line Help for psyche-centralized-server
This document contains the help content for the psyche-centralized-server
command-line program.
Command Overview:
psyche-centralized-server
↴psyche-centralized-server validate-config
↴psyche-centralized-server run
↴
psyche-centralized-server
Usage: psyche-centralized-server <COMMAND>
Subcommands:
validate-config
— Checks that the configuration declared in thestate.toml
file is validrun
— Starts the server and launches the coordinator with the declared configuration
psyche-centralized-server validate-config
Checks that the configuration declared in the state.toml
file is valid
Usage: psyche-centralized-server validate-config [OPTIONS] --state <STATE>
Options:
--state <STATE>
— Path to thestate.toml
file to validate--data-config <DATA_CONFIG>
— Path todata.toml
file to validate. If no provided then it will not be checked
psyche-centralized-server run
Starts the server and launches the coordinator with the declared configuration
Usage: psyche-centralized-server run [OPTIONS] --state <STATE>
Options:
-
--state <STATE>
— Path to TOML of Coordinator state -
-s
,--server-port <SERVER_PORT>
— Port for the server, which clients will use to connect. if not specified, a random free port will be chosen -
--tui <TUI>
Default value:
true
Possible values:
true
,false
-
--data-config <DATA_CONFIG>
— Path to TOML of data server config -
--save-state-dir <SAVE_STATE_DIR>
— Path to save the server and coordinator state -
--init-warmup-time <INIT_WARMUP_TIME>
— Sets the warmup time for the run. This overrides thewarmup_time
declared in the state file -
--init-min-clients <INIT_MIN_CLIENTS>
— Sets the minimum number of clients required to start a run. This overrides themin_clients
declared in the state file -
--withdraw-on-disconnect <WITHDRAW_ON_DISCONNECT>
— Allows clients to withdraw if they need to disconnect from the run (this option has no effect in the centralized version)Default value:
true
Possible values:
true
,false
This document was generated automatically by
clap-markdown
.
Running a Local Testnet
The local testnet is a helper application designed to easily spin up a coordinator and multiple clients. It's useful for doing sample runs on your own hardware, and for development.
Pre-requisites
Since we want to run many clients and the coordinator we'll need several terminal windows to monitor them. The tool uses tmux to create them.
If you're using the Nix flake, tmux is already included.
Command-Line Help for psyche-centralized-local-testnet
This document contains the help content for the psyche-centralized-local-testnet
command-line program.
Command Overview:
psyche-centralized-local-testnet
Usage: psyche-centralized-local-testnet <COMMAND>
Subcommands:
start
— Starts the local-testnet running each part of the system in a separate terminal pane
psyche-centralized-local-testnet start
Starts the local-testnet running each part of the system in a separate terminal pane
Usage: psyche-centralized-local-testnet start [OPTIONS] --num-clients <NUM_CLIENTS> --config-path <CONFIG_PATH>
Options:
-
--num-clients <NUM_CLIENTS>
— Number of clients to start -
--config-path <CONFIG_PATH>
— File path to the configuration that the coordinator will need to start -
--write-distro-data <WRITE_DISTRO_DATA>
— If provided, write DisTrO data to disk in this path -
--server-port <SERVER_PORT>
— Port where the server for this testnet will be listen it to (this is the one that clients must use when connecting)Default value:
20000
-
--tui <TUI>
— Enables a terminal-based graphical interface for monitoring analyticsDefault value:
true
Possible values:
true
,false
-
--random-kill-num <RANDOM_KILL_NUM>
— Kill N clients randomly every <RANDOM_KILL_INTERVAL> seconds -
--allowed-to-kill <ALLOWED_TO_KILL>
— Which clients we're allowed to kill randomly -
--random-kill-interval <RANDOM_KILL_INTERVAL>
— Kill <RANDOM_KILL_NUM> clients randomly every N secondsDefault value:
120
-
--log <LOG>
— Sets the level of the logging for more granular informationDefault value:
info,psyche=debug
-
--first-client-checkpoint <FIRST_CLIENT_CHECKPOINT>
— HF repo where the first client could get the model and the configuration to use -
--hf-token <HF_TOKEN>
-
--write-log
Default value:
false
-
--wandb-project <WANDB_PROJECT>
-
--wandb-group <WANDB_GROUP>
-
--wandb-entity <WANDB_ENTITY>
-
--optim-stats <OPTIM_STATS>
-
--eval-tasks <EVAL_TASKS>
This document was generated automatically by
clap-markdown
.