Overview

Psyche is a system that empowers strangers to collaboratively train a machine learning model in a decentralized and trustless manner.

Read the Psyche annoucement here.

The Psyche code is available on GitHub at PsycheFoundation/psyche.

The system is composed of three main actors:

  • Coordinator: Serves as a source of truth for global state available to all clients in a given training run. Each run has one coordinator that oversees the entire process. The coordinator is implemented as a program running on the Solana Blockchain.
  • Client: A user participating in a training run. Clients receive the model to be trained and a specific dataset for that run. They send information to the coordinator to progress the training run and use a peer-to-peer network to share their results at each training step with other clients.
  • Data Provider: An optional server that stores the data to be used for model training, to be serverd to clients. A run could use the data provider, an HTTP location for data, or make clients bring their own copy of the dataset.
flowchart TB
    subgraph run id: test_model_2
        direction TB
        subgraph Solana
            C(("Coordinator"))
        end
        C <--> C1(("Client")) & C2(("Client")) & C3(("Client"))
        C1 <-.-> C2
        C3 <-.-> C2 & C1
        DT["Data hosted on HTTP"] --> C1 & C2 & C3
    end
    subgraph run id: test_model_1
        direction TB
        subgraph Solana2["Solana"]
            CC(("Coordinator"))
        end
        CC <--> C11(("Client")) & C22(("Client")) & C33(("Client"))
        C11 <-.-> C22
        C33 <-.-> C22 & C11
        DTT["Data server"] --> C11 & C22 & C33
    end

What does the training process look like?

The training process for a given model is divided into small steps that incrementally train the model in a coordinated manner. A training run is divided into epochs, where clients can join and leave the run, and epochs are further divided into steps, where the model is incrementally trained.

During a training run, clients primarily perform three tasks:

  • Training: Train the model using an assigned subset of the data.
  • Witnessing: Verify the liveness and correctness of other participants.
  • Verifying: Recompute and compare results to identify and mitigate malicious participants.

Waiting for Clients & Warmup

At the start of an epoch, all clients have a window of time to join the run by requesting to be added by coordinator, and then connecting to the other participating clients.

Once a minimum threshold of clients has been met, the run will transition to the Warmup phase and begin a countdown to allow connected clients to update their copy of the model, at which point it will enter the Training phase.

Training

At the beginning of an epoch, after the Warmup phase ends, clients are assigned specific tasks that require them to train the model on a portion of the data.

The coordinator contains information that uniquely assigns pieces of training data to clients based on the current round.

If clients have already been training (i.e., it is not the first round of the epoch), they will apply the results from the previous round, then retrieve the data sample they need for the current round.

After completing the training on their assigned data, each client emits a p2p broadcast to all other clients containing their training results and a cryptographic commitment that binds them to those results.

As the training results are recieved from other clients, they are downloaded to be later incorporated into the current model.

Witnessing

At the start of each round, one or more clients are randomly selected as witnesses. The number of witnesses can be configured. Witnesses train the model as usual, but also build bloom filters that track which nodes they have recieved training results from, signifying that they are actively participating and providing valid results.

These bloom filters are sent to the coordinator, which then combines them into a provable consensus of which results to apply to the model.

Once a witness quorum is reached, the coordinator advances to the Training phase to allow all clients a brief window to download every training result.

Once the Witness phase concludes, the coordinator returns to the Training phase. Clients are assigned new data, and the process repeats. After a predefined number of rounds, a Cooldown round occurs, marking the end of an epoch.

The witness/train loop visualized

Here's a high-level overview of the process. Additional details exist, but this captures the overall flow:

sequenceDiagram
    participant Client1
    participant Client2
    participant Coordinator
    participant DataServer
    Client1->>DataServer: get_data
    Client2->>DataServer: get_data
    Coordinator->>Client2: witness
    Note over Client1: Train
    Note over Client2: Train
    Client1->>Client2: Send results
    Client2->>Client1: Send results
    Note over Client1: Download results
    Note over Client2: Download results
    Client2->>Coordinator: Send witness
    Note over Coordinator: Quorum reached
    Note over Coordinator: Starting Witness phase