Coordinator

The Coordinator stores metadata about the run and a list of participants. It handles the transition between each Phase of a Round, and provides a random seed that's used to determine data assignments, witnesses, and more.

It's responsible for providing a point of synchronization for all clients within a run.

Ticks

When certain events occur or time-based conditions are met, the Coordinator can be "ticked" forwards to transition from one Phase to another Phase.

sequenceDiagram
    loop
        Note over Backend, Coordinator: Wait for a timeout or backend state
        Backend->>Coordinator: Tick
        Coordinator->>Backend: New state produced
        Backend->Client1: New coordinator state consumed by Client
        Backend->Client2: New coordinator state consumed by Client
    end

Beginning an Epoch

The Coordinator begins in the WaitingForMembers phase, with no clients connected.

Whatever backend you're running the Coordinator in should accept pending clients to be added to upcoming Epochs.

When inside the WaitingForMembers phase, your backend will pass new clients to the Coordinator until a configured min_clients threshold is met, at which point the coordinator's tick will transition it to the Warmup phase.

sequenceDiagram
    Note over Coordinator: min_clients = 2
    Client1->>Coordinator: Join
    Client2->>Coordinator: Join
    Note over Coordinator: Entering Warmup
    Client1->>Client2: Connect
    Client2->>Client1: Connect
    Note over Coordinator: The Warmup countdown elapses
    Note over Coordinator: Entering Training

Warmup

This phase is designed to let all clients download the model & load it onto their GPUs.

If a client has dropped whilst waiting for the warmup time, the Backend then removes the client from the Coordinator's clients list. If the number of clients falls below min_clients, the Coordinator goes back to the WaitingForMembers phase.

Once the Warmup time passes, the Coordinator loads all the information for the next training round and change its phase to RoundTrain. The Server will broadcast this Training Coordinator state to all clients.

Training

In this phase, the Coordinator provides a random seed. Each client can use this seed, alongside the current round index and epoch index to determine which indicies of the training data to use.

Witnessing

As clients complete their training, they send their results to all other clients, including the Witnesses. The witnesses will each send a witness proof to the Coordinator, building towards a witness quorum.

A witness proof contains a bloom filter describing which pieces of data the witness recieved training results for, and which clients did that work. Elected witnesses are responsible for creating these witness proofs and and sending them to the Coordinator.

The witnesses for each round are chosen randomly from all the clients, using the same random seed as for data assignments. A witness will attempt to send an opportunistic witness message once it's seen a recieved a training result for every single batch in the current round.

Witness Quorum

The Coordinator advances the run from the Training phase to the Witness phase in one of two ways:

  • If enough witnesses observe all results and reach a witness quorum for the round, they notify the Coordinator that it is safe to advance. This process, named opportunistic witnessing, accelerates the transition to the Witness phase, rather than having to wait a fixed time for training results.
  • If witnesses do not receive all required results from other clients before the maximum time specified for the Training phase, the Coordinator will nontheless transition to the Witness phase after the maximum Training time elapses.

Witness phase

This phase exists to give the witnesses an opportunity to send their proofs to the Coordinator in the event that they have not received enough training results from other clients to have reached the quorum and send their proofs opportunistically.

There is also brief slack period for non-witness nodes to catch up by downloading any remaining results they might have not recieved.

When the Witness phase finishes via timeout, the Coordinator transitions from Witness to the Cooldown phase in three cases:

  • If we are in the last round of the epoch.
  • If the clients have dropped to less than the minimum required by the config.
  • If the number of witnesses for the round is less than the quorum specified by the config.

Any clients that have failed health checks will also be removed from the current epoch.

Cooldown

The Cooldown phase is the last phase of an epoch, during which the Cooordinator waits for either the Cooldown period to elapse, or a checkpoint to have happened.

When the Cooldown phase begins, the Coordinator resets the current model checkpoint state to Checkpoint::P2P, signifying that new joiners should download the latest copy of the model from the other participants.

Upon exiting the Cooldown phase, the Coordinator transitions to the next epoch, saving the previous epoch state, and moving back to the WaitingForMembers phase.

It all comes together!

sequenceDiagram
    Backend->>Coordinator: tick
    Coordinator->>Backend: Change state to `RoundTrain`
    Backend->>Client1: New state
    Backend->>Client2: New state
    par Start training
        Client1->>Client1: Start training
        Client2->>Client2: Start training
    end
    Client1->>Committee: get_witness
    Client2->>Committee: get_witness
    Committee->>Client1: false
    Committee->>Client2: true
    Note over Client1: Train
    Note over Client2: Train
    Note over Client2: Fill bloom filters
    Client2->>Backend: try send optimistic witness
    Backend->>Coordinator: Witness message
    Note over Coordinator: Enough witnesses for round
    Coordinator->>Coordinator: Update state to RoundWitness
    Note over Coordinator: Timeout round witness time
    alt step > total steps
        Coordinator->>Coordinator: Update state to Waitingformembers
    else height == rounds per epoch 
        Coordinator->>Coordinator: Update state to Cooldown
    else
        Coordinator->>Coordinator: Update state to RoundTrain with step + 1
    end

Centralized Backend

In this Backend, the Coordinator is owned and ticked forwards by a Server that communicates via clients over TCP.

The Server's Coordinator is initially configured in main.rs. It's loaded using the configuration file state.toml.

flowchart LR
    S[Server] --run--> A[App]
    S --new--> C[Coordinator]
    C --run_id
        init warmup
        min clients
        model--> A

The Server uses some parts of the Coordinator configuration, like the data server configuration, if enabled, to boot up all the functionality it needs.

When a new client joins the run it has to communicate the run_id that it wants to join, to ensure the client's joining the correct run. After processing the join message, the client is added to a pending clients list, and runs the Coordinator's tick function to potentially add the client into the run.

When a tick condition is met, the Server ticks the Coordinator forwards, then broadcasts the Coordinator's new state to all connected clients.

Health checks

In the start function, the client spawns a new task to repeatedly send health checks to the server. Nodes, also known as trainers in this state, are assigned a score determined by the Coordinator using the trainer_healthy_score_by_witnesses method. This score increases as a client sends the required data to be added to the participants' bloom filters, allowing the Coordinator to confirm that the client is actively participating in the training.

A node also sends a list of other nodes it considers unhealthy to the server using the HealthCheck message. The Coordinator processes this information to determine whether those nodes are healthy. Nodes deemed inactive or non-participatory are marked for removal in the next round.

Decentralized Backend

TODO