Joining a training run
This guide is for end-users who wish to participate in a training run, it assumes you have a predistributed run-manager binary. If you are looking for more in-depth documentation or how to run from source you can refer to the development documentation
Prerequisites
Before joining a run you need to make sure you meet a few requisites:
Linux Operating System
The Psyche client currently only runs on modern Linux distributions.
NVIDIA GPU and Drivers
Psyche requires an NVIDIA CUDA-capable GPU for model training. Your system must have NVIDIA drivers installed.
To check if you have NVIDIA drivers:
nvidia-smi
If this command doesn't work or shows an error, you need to install NVIDIA drivers. Follow NVIDIA's installation guide for your Linux distribution.
Docker
The Psyche client runs inside a Docker container. You need Docker Engine installed on your system.
To check if Docker is installed:
docker --version
If Docker isn't installed, follow the Docker Engine installation guide for your Linux distribution.
NVIDIA Container Toolkit
The NVIDIA Container Toolkit is required to enable GPU access inside Docker containers, which Psyche uses for model training.
To install the NVIDIA Container Toolkit, follow the NVIDIA Container Toolkit installation guide for your Linux distribution.
Solana Wallet/Keypair
You need a Solana keypair (wallet) to participate in training. This keypair identifies your client on the blockchain.
If you need to create a new keypair, you can use the Solana CLI, specifying where you want to create it
solana-keygen new --outfile <path/to/keypair/file.json>
Quick Start
The recommended way to run a Psyche client is through the run-manager, which should have been distributed to you. The run manager will handle downloading the correct Docker image, starting your client, and keeping it updated automatically.
Before running it, you should create an environment file with some needed variables.
The .env file should have at least this defined:
WALLET_PATH=/path/to/your/keypair.json
# Required: Solana RPC Endpoints
RPC=https://your-primary-rpc-provider.com
WS_RPC=wss://your-primary-rpc-provider.com
# Required: Which run id to join
RUN_ID=your_run_id_here
# Recommended: Fallback RPC Endpoints (for reliability)
RPC_2=https://your-backup-rpc-provider.com
WS_RPC_2=wss://your-backup-rpc-provider.com
Then, you can start training through the run manager running:
./run-manager --env-file /path/to/your/.env
After the initial setup, you'll see the Psyche client logs streaming in real-time. These logs show training progress, network status, and other important information.
To stop the client, press Ctrl+C in the terminal.
RPC Hosts
We recommend using a dedicated RPC service such as Helius, QuickNode, Triton, or self-hosting your own Solana RPC node.
Additional config variables
In general it's not neccesary to change these variables to join a run since we provide sensible defaults, though you might need to.
NVIDIA_DRIVER_CAPABILITIES - An environment variable that the NVIDIA Container Toolkit uses to determine which compute capabilities should be provided to your container. It is recommended to set it to 'all', e.g. NVIDIA_DRIVER_CAPABILITIES=all.
DATA_PARALLELISM - Number of GPUs to distribute training data across.
- If you have multiple GPUs, you can set this to 2, 4, etc. to speed up training
- If you have 1 GPU, set this to
1
TENSOR_PARALLELISM - Number of GPUs to distribute the model across, this lets you train a model you can't fit on one single GPU.
- If you have 1 GPU, set this to
1 - If your have
nGPUs you can distribute the model across all of them by setting it ton.
MICRO_BATCH_SIZE - Number of samples processed per GPU per training step
- Set as high as your GPU memory allows
AUTHORIZER - The Solana address that authorized your wallet to join this run
- See Authentication for more details
Testing Authorization
Before joining a run, you can verify that your client is authorized by using the run-manager command:
run-manager can-join --run-id <RUN_ID> --authorizer <AUTHORIZER> --address <PUBKEY>
Where:
<RUN_ID>is the run ID you want to join (from your.envfile)<AUTHORIZER>is the Solana authorizer address (from your.envfile)<PUBKEY>is your wallet's public key
You can find your wallet's public key by running:
solana address
This command will return successfully if your wallet is authorized to join the run. This helps debug authorization issues before attempting to join.
Troubleshooting
Docker Not Found
Error: Failed to execute docker command. Is Docker installed and accessible?
Solution: Install Docker using the Docker installation guide. Make sure your user is in the docker group:
sudo usermod -aG docker $USER
Then log out and back in for the group change to take effect.
NVIDIA Drivers Not Working
Error: Container starts but crashes immediately, or you see GPU-related errors in logs
Solution:
- Verify drivers are installed:
nvidia-smi
RPC Connection Failures
Error: RPC error: failed to get account or connection timeouts
Solution:
- Verify your RPC endpoints are correct in your
.envfile - Check that your RPC provider API key is valid, if present
- Try your backup RPC endpoints (
RPC_2,WS_RPC_2)
Wallet/Keypair Not Found
Error: Failed to read wallet file from: /path/to/keypair.json
Solution:
- Verify the file exists:
ls -l /path/to/keypair.json - Check file permissions:
chmod 600 /path/to/keypair.json - If using default location, ensure
~/.config/solana/id.jsonexists - Verify the path in your
.envfile matches the actual file location
Container Fails to Start
Error: Docker run failed: ... or container exits immediately
Solution:
- Check Docker logs for more details
- Ensure all required variables are in your
.envfile - Verify GPU access:
docker run --rm --gpus all ubuntu nvidia-smi - Check disk space:
df -h - Verify you have enough VRAM for your
MICRO_BATCH_SIZEsetting (try reducing it)
Process Appears Stuck
Error: No new logs appearing, process seems frozen, stuck with error messages.
Solution:
- The run manager will attempt to restart the client, but sometimes this can fail and hang.
- Press
Ctrl+Cto stop run-manager and wait a few seconds. - If for some reason this fails to stop it, you can check the running containers with
docker psand force stop the container manually withdocker stop.
Version Mismatch Loop
Symptom: Container keeps restarting every few seconds with "version mismatch"
Solution:
- This usually means there's an issue with pulling the new Docker image
- Check your internet connection
- Verify Docker can be run
docker --version - Verify Docker Hub is accessible:
docker pull hello-world - Check disk space for Docker images:
docker system df
Checking Container Logs Manually
If run-manager exits but you want to see what happened, you can view Docker logs:
# List recent containers (including stopped ones)
docker ps -a
# View logs for a specific container
docker logs CONTAINER_ID
Claiming Rewards
After participating in training and accumulating rewards, you can claim them using the run-manager command:
run-manager treasurer-claim-rewards \
--rpc <RPC> \
--run-id <RUN_ID> \
--wallet-private-key-path <JSON_PRIVATE_KEY_PATH>
Where:
<RPC>is your Solana RPC endpoint (same as in your.envfile)<RUN_ID>is the run ID you participated in<JSON_PRIVATE_KEY_PATH>is the path to your wallet keypair file (e.g.,~/.config/solana/id.json)
This command will claim any rewards you've earned from contributing to the training run.
Building from source
If you wish to run the run-manager from source, first make sure that you have followed the development setup, are inside the nix environment, and run just run-manager path/to/.env.file