Distributed Proving

Generating a ZisK proof can be computationally intensive, especially for large programs. The distributed proving system lets you split the workload across multiple machines, reducing proof generation time by parallelizing the work.

This chapter covers how to set up and run a distributed proving cluster, from launching a coordinator to connecting workers and submitting proof requests.

How It Works

A distributed proving cluster consists of two roles:

  • A Coordinator that receives proof requests and orchestrates the work.
  • One or more Workers that execute the actual proof computation.

When you submit a proof request, the process unfolds in three phases:

  1. Partial Contributions — The coordinator assigns segments of the work to available workers based on their compute capacity. Each worker computes its partial challenges independently.
  2. Prove — Workers compute the global challenge and generate their respective partial proofs.
  3. Aggregation — The first worker to finish is selected as the aggregator. It collects all partial proofs and produces the final proof.

The coordinator returns the final proof to the client once aggregation completes.

Workers report their compute capacity when they register. The coordinator selects workers sequentially from the available pool until the requested capacity is met. While assigned to a job, a worker is marked as busy and won't receive new tasks.

Getting Started

Building

From the project root, build both binaries:

cargo build --release --bin zisk-coordinator --bin zisk-worker

Running Locally

1. Start the coordinator:

cargo run --release --bin zisk-coordinator

2. Start a worker (in a separate terminal):

cargo run --release --bin zisk-worker -- --elf <elf-file-path> --inputs-folder <inputs-folder>

3. Submit a proof request (in a separate terminal):

cargo run --release --bin zisk-coordinator prove --inputs-uri <input-filename> --compute-capacity 10

The --compute-capacity flag specifies how many compute units the proof requires. The coordinator assigns workers until this capacity is covered.

Docker Deployment

For multi-machine setups, Docker simplifies deployment:

# Build the image (CPU-only)
docker build -t zisk-distributed:latest -f distributed/Dockerfile .

# For GPU support
docker build --build-arg GPU=true -t zisk-distributed:gpu -f distributed/Dockerfile .

# Create a network for container DNS resolution
docker network create zisk-net || true

Start the coordinator:

LOGS_DIR="<logs-folder>"
docker run -d --rm --name zisk-coordinator \
  --network zisk-net \
  -v "$LOGS_DIR:/var/log/distributed" \
  -e RUST_LOG=info \
  zisk-distributed:latest \
  zisk-coordinator --config /app/config/coordinator/dev.toml

Start a worker:

LOGS_DIR="<logs-folder>"
PROVING_KEY_DIR="<provingKey-folder>"
ELF_DIR="<elf-folder>"
INPUTS_DIR="<inputs-folder>"
docker run -d --rm --name zisk-worker-1 \
  --network zisk-net --shm-size=20g \
  -v "$LOGS_DIR:/var/log/distributed" \
  -v "$HOME/.zisk/cache:/app/.zisk/cache:ro" \
  -v "$PROVING_KEY_DIR:/app/proving-keys:ro" \
  -v "$ELF_DIR:/app/elf:ro" \
  -v "$INPUTS_DIR:/app/inputs:ro" \
  -e RUST_LOG=info \
  zisk-distributed:latest zisk-worker --coordinator-url http://zisk-coordinator:50051 \
    --elf /app/elf/zec.elf --proving-key /app/proving-keys --inputs-folder /app/inputs

Submit a proof:

docker exec -it zisk-coordinator \
  zisk-coordinator prove --inputs-uri <input-filename> --compute-capacity 10

Note: Use the filename only when submitting proofs, not the full path. Workers resolve files relative to their --inputs-folder.

Container paths reference:

PathPurpose
/app/config/{coordinator,worker}/Configuration files
/app/bin/Binaries
/app/.zisk/cache/Cache (mount from host $HOME/.zisk/cache)
/var/log/distributed/Log files

Coordinator

The coordinator is responsible for managing the distributed proof generation process. It receives proof requests from clients and assigns work to available workers.

To start a coordinator instance with default settings:

cargo run --release --bin zisk-coordinator

Coordinator Configuration

The coordinator can be configured using either a TOML configuration file or command-line arguments. If no configuration file is explicitly provided, the system falls back to the ZISK_COORDINATOR_CONFIG_PATH environment variable to locate one. If neither the CLI argument nor environment variable is set, built-in defaults are used.

Example:

# You can specify the configuration file path using a command line argument:
cargo run --release --bin zisk-coordinator -- --config /path/to/my-config.toml

# You can specify the configuration file path using an environment variable:
export ZISK_COORDINATOR_CONFIG_PATH="/path/to/my-config.toml"
cargo run --release --bin zisk-coordinator

The table below lists the available configuration options for the Coordinator:

TOML KeyCLI ArgumentEnvironment VariableTypeDefaultDescription
service.name--StringZisK Distributed CoordinatorService name
service.environment--StringdevelopmentService environment (development, staging, production)
server.host--String0.0.0.0Server host
server.port--port-Number50051Server port
server.proofs_dir--proofs-dir-StringproofsDirectory to save generated proofs (conflicts with --no-save-proofs)
---no-save-proofs-BooleanfalseDisable saving proofs (conflicts with --proofs-dir)
--c, --compressed-proofs-BooleanfalseGenerate compressed proofs
server.shutdown_timeout_seconds--Number30Graceful shutdown timeout in seconds
logging.level-RUST_LOGStringdebugLogging level (error, warn, info, debug, trace)
logging.format--StringprettyLogging format (pretty, json, compact)
logging.file_path--String-Optional. Log file path (enables file logging)
coordinator.max_workers_per_job--Number10Maximum workers per proof job
coordinator.max_total_workers--Number1000Maximum total registered workers
coordinator.phase1_timeout_seconds--Number300Phase 1 timeout in seconds
coordinator.phase2_timeout_seconds--Number600Phase 2 timeout in seconds
coordinator.webhook_url--webhook-url-String-Optional. Webhook URL to notify on job completion

Configuration Files examples

Example development configuration file:

[service]
name = "ZisK Distributed Coordinator"
environment = "development"

[logging]
level = "debug"
format = "pretty"

Example production configuration file:

[service]
name = "ZisK Distributed Coordinator"  
environment = "production"

[server]
host = "0.0.0.0"
port = 50051
proofs_dir = "proofs"

[logging]
level = "info"
format = "json"
file_path = "/var/log/distributed/coordinator.log"

[coordinator]
max_workers_per_job = 20      # Maximum workers per proof job
max_total_workers = 5000      # Maximum total registered workers  
phase1_timeout_seconds = 600  # 10 minutes for phase 1
phase2_timeout_seconds = 1200 # 20 minutes for phase 2
webhook_url = "http://webhook.example.com/notify?job_id={$job_id}"

Webhook URL

The Coordinator can notify an external service when a job finishes by sending a request to a configured webhook URL. The placeholder {$job_id} can be included in the URL and will be replaced with the finished job’s ID. If no placeholder is provided, the Coordinator automatically appends /{job_id} to the end of the URL.

All webhook notifications are sent as JSON POST requests with the following structure:

{
  "job_id": "job_12345",
  "success": true,
  "duration_ms": 45000,
  "proof": <array of u64...>,
  "timestamp": "2025-10-03T14:30:00Z",
  "error": null
}
Fields Description
FieldTypeDescription
job_idstringUnique identifier for the proof generation job
successbooleantrue if proof generation completed successfully, false if it failed
duration_msnumberTotal execution time in milliseconds from job start to completion
proofarray<u64> | nullFinal proof data as array of integers (only present on success)
timestampstringISO 8601 timestamp when the notification was sent
errorobject | nullError details (only present on failure)
Error Object Structure

When success is false, the error field contains:

{
  "code": "WORKER_FAILURE",
  "message": "Worker node-003 failed during proof generation: Out of memory"
}

Successful Proof Generation Example:

{
  "job_id": "job_abc123",
  "success": true,
  "duration_ms": 32500,
  "proof": [1234567890, 9876543210, 1357924680, ...],
  "timestamp": "2025-10-03T14:30:25Z",
  "error": null
}

Failed Job Example:

{
  "job_id": "job_def456",
  "success": false,
  "duration_ms": 15000,
  "proof": null,
  "timestamp": "2025-10-03T14:31:10Z",
  "error": {
    "code": "WORKER_ERROR",
    "message": "Memory exhaustion during proof generation"
  }
}

Webhook Implementation Guidelines

HTTP Requirements:

  • Method: POST
  • Content-Type: application/json
  • Timeout: 10 seconds (configurable)
  • Retry: Currently no automatic retries (implement idempotency)

Recommended Response:

Your webhook endpoint should respond with:

  • Success: HTTP 200-299 status code
  • Body: Any valid response (ignored by coordinator)
HTTP/1.1 200 OK
Content-Type: application/json

{"received": true, "job_id": "job_abc123"}

If your webhook endpoint is unavailable or returns an error:

  • The coordinator logs the failure but continues operation
  • No automatic retries are performed
  • Consider implementing your own retry mechanism or message queue

Command Line Arguments

# Show help
cargo run --release --bin zisk-coordinator -- --help

# Run coordinator with custom port
cargo run --release --bin zisk-coordinator -- --port 50051

# Run with specific configuration
cargo run --release --bin zisk-coordinator -- --config production.toml

# Run with webhook URL  
cargo run --release --bin zisk-coordinator -- --webhook-url http://webhook.example.com/notify --port 50051

Worker

The worker is responsible for executing proof generation tasks assigned by the coordinator. It registers with the coordinator, reports its compute capacity, and waits for tasks to be assigned.

To start a worker instance with default settings:

cargo run --release --bin zisk-worker -- --elf <elf-file-path> --inputs-folder <inputs-folder>

Worker Configuration

The worker can be configured using either a TOML configuration file or command-line arguments. If no configuration file is explicitly provided, the system falls back to the ZISK_WORKER_CONFIG_PATH environment variable to locate one. If neither the CLI argument nor environment variable is set, built-in defaults are used.

Example:

# You can specify the configuration file path using a command line argument:
cargo run --release --bin zisk-worker -- --config /path/to/my-config.toml

# You can specify the configuration file path using an environment variable:
export ZISK_WORKER_CONFIG_PATH="/path/to/my-config.toml"
cargo run --release --bin zisk-worker

Input Files Handling

Workers need to know where to find input files for proof generation. The --inputs-folder parameter specifies the base directory where input files are stored:

  • Default: Current working directory (.) if not specified
  • Usage: When the coordinator sends a prove command with an input filename, the worker combines --inputs-folder + filename to locate the file
  • Benefits: Allows input files to be organized in a dedicated directory, separate from the worker executable

Example:

# Worker with inputs in specific folder
cargo run --release --bin zisk-worker -- --elf program.elf --inputs-folder /data/inputs/

# Coordinator requests proof for "input.bin" -> Worker looks for "/data/inputs/input.bin"
cargo run --release --bin zisk-coordinator -- prove --inputs-uri input.bin --compute-capacity 10

The table below lists the available configuration options for the Worker:

TOML KeyCLI ArgumentEnvironment VariableTypeDefaultDescription
worker.worker_id--worker-id-StringAuto-generated UUIDUnique worker identifier
worker.compute_capacity.compute_units--compute-capacity-Number10Worker compute capacity (in compute units)
worker.environment--StringdevelopmentService environment (development, staging, production)
worker.inputs_folder--inputs-folder-String.Path to folder containing input files
coordinator.url--coordinator-url-Stringhttp://127.0.0.1:50051Coordinator server URL
connection.reconnect_interval_seconds--Number5Reconnection interval in seconds
connection.heartbeat_timeout_seconds--Number30Heartbeat timeout in seconds
logging.level-RUST_LOGStringdebugLogging level (error, warn, info, debug, trace)
logging.format--StringprettyLogging format (pretty, json, compact)
logging.file_path--String-Optional. Log file path (enables file logging)
---proving-key-String~/.zisk/provingKeyPath to setup folder
---elf-String-Path to ELF file
---asm-String~/.zisk/cachePath to ASM file (mutually exclusive with --emulator)
---emulator-BooleanfalseUse prebuilt emulator (mutually exclusive with --asm)
---asm-port-Number23115Base port for Assembly microservices
---shared-tables-BooleanfalseWhether to share tables when worker is running in a cluster
--v, -vv, -vvv, ...-Number0Verbosity level (0=error, 1=warn, 2=info, 3=debug, 4=trace)
--d, --debug-String-Enable debug mode with optional component filter
---verify-constraints-BooleanfalseWhether to verify constraints
---unlock-mapped-memory-BooleanfalseUnlock memory map for the ROM file (mutually exclusive with --emulator)
---hints-BooleanfalseEnable precompile hints processing
--m, --minimal-memory-BooleanfalseUse minimal memory mode
--r, --rma-BooleanfalseEnable RMA mode
--z, --preallocate-BooleanfalseGPU preallocation flag
--t, --max-streams-Number-Maximum number of GPU streams
--n, --number-threads-witness-Number-Number of threads for witness computation
--x, --max-witness-stored-Number-Maximum number of witnesses to store in memory

Configuration Files examples

Example development configuration file:

[worker]
compute_capacity.compute_units = 10
environment = "development"

[logging]
level = "debug"
format = "pretty"

Example production configuration file:

[worker]
worker_id = "my-worker-001"
compute_capacity.compute_units = 10
environment = "production"
inputs_folder = "/app/inputs"

[coordinator]
url = "http://127.0.0.1:50051"

[connection]
reconnect_interval_seconds = 5
heartbeat_timeout_seconds = 30

[logging]
level = "info"
format = "pretty"
file_path = "/var/log/distributed/worker-001.log"

Launching a Proof

To launch a proof generation request, use the prove subcommand of the zisk-coordinator binary. This sends an RPC request to a running coordinator instance.

cargo run --release --bin zisk-coordinator -- prove --inputs-uri <input_filename> --compute-capacity 10

The --compute-capacity flag indicates the total compute units required to generate a proof. The coordinator will assign one or more workers to meet this capacity, distributing the workload if multiple workers are needed. Requests exceeding the combined capacity of available workers will not be processed and an error will be returned.

Prove Subcommand Arguments

CLI ArgumentShortTypeDefaultDescription
--inputs-uri-String-Path to the input file for proof generation
--compute-capacity-cNumberrequiredTotal compute units required for the proof
--coordinator-url-Stringhttp://127.0.0.1:50051URL of the coordinator to send the request to
--data-id-StringAuto (from filename or UUID)Custom identifier for the proof job
--hints-uri-String-Path/URI to the precompile hints source
--stream-hints-BooleanfalseStream hints from the coordinator to workers via gRPC (see Hints Stream)
--direct-inputs-xBooleanfalseSend input data inline via gRPC instead of as a file path
--minimal-compute-capacity-mNumberSame as --compute-capacityMinimum acceptable compute capacity (allows partial worker allocation)
--simulated-node-Number-Simulated node ID (for testing)

Input and Hints Modes

The prove subcommand supports two modes for delivering inputs and hints to workers:

Input modes (controlled by --inputs-uri and --direct-inputs):

  • Path mode (default): The coordinator sends the input file path to workers. Workers must have access to the file at the specified path.
  • Data mode (--direct-inputs): The coordinator reads the input file and sends its contents inline via gRPC. Workers do not need local access to the file.

Hints modes (controlled by --hints-uri and --stream-hints):

  • Path mode (default): The coordinator sends the hints URI to workers. Each worker loads hints from the specified path independently.
  • Streaming mode (--stream-hints): The coordinator reads hints from the URI and broadcasts them to all workers in real-time via gRPC. See the Hints Stream documentation for details.

Examples:

# Basic proof with file path inputs
zisk-coordinator prove --inputs-uri /data/inputs/my_input.bin --compute-capacity 10

# Send input data directly (workers don't need local file access)
zisk-coordinator prove --inputs-uri /data/inputs/my_input.bin -x --compute-capacity 10

# With precompile hints in path mode (workers load hints locally)
zisk-coordinator prove --inputs-uri input.bin --hints-uri /data/hints/hints.bin --compute-capacity 10

# With precompile hints in streaming mode (coordinator broadcasts to workers)
zisk-coordinator prove --inputs-uri input.bin --hints-uri unix:///tmp/hints.sock --stream-hints --compute-capacity 10

Administrative Operations

Health Checks and Monitoring

The coordinator exposes administrative endpoints for monitoring:

# Basic health check
grpcurl -plaintext 127.0.0.1:50051 zisk.distributed.api.v1.ZiskDistributedApi/HealthCheck

# System status
grpcurl -plaintext 127.0.0.1:50051 zisk.distributed.api.v1.ZiskDistributedApi/SystemStatus

# List active jobs
grpcurl -plaintext -d '{"active_only": true}' \
  127.0.0.1:50051 zisk.distributed.api.v1.ZiskDistributedApi/JobsList

# List connected workers
grpcurl -plaintext -d '{"available_only": true}' \
  127.0.0.1:50051 zisk.distributed.api.v1.ZiskDistributedApi/WorkersList

Troubleshooting

Common Issues

Worker can't connect to coordinator:

  • Verify coordinator is running and accessible on the specified port
  • Check firewall settings if coordinator and worker are on different machines
  • Ensure correct URL format: http://host:port (not https:// for default setup)

Configuration not loading:

  • Verify TOML syntax with a TOML validator
  • Check file permissions on configuration files
  • Use CLI overrides to test specific values

Worker not receiving tasks:

  • Check worker registration in coordinator logs
  • Verify compute capacity is appropriate for available tasks
  • Ensure worker ID is unique if running multiple workers
  • Confirm coordinator has active jobs to distribute

Input file not found errors:

  • Verify the input file exists in the worker's --inputs-folder directory
  • Check file permissions - worker needs read access to input files
  • Ensure you're using the filename only (not full path) when launching proofs
  • Confirm --inputs-folder path is correct and accessible

Port conflicts:

  • Use --port flag or update configuration file to change ports
  • Check for other services using the same ports

Debug Mode

Enable detailed logging for troubleshooting by modifying configuration files or using CLI arguments:

# Coordinator with debug logging (via config file)
cargo run --release --bin zisk-coordinator -- --config debug-coordinator.toml

# Worker with debug logging (via config file)
cargo run --release --bin zisk-worker -- --config debug-worker.toml

Where debug-coordinator.toml or debug-worker.toml contains:

[logging]
level = "debug"
format = "pretty"

Log Files

When file logging is enabled, logs are written into specified paths in the configuration files. Ensure the application has write permissions to these paths.

[logging]
file_path = "/var/log/distributed/coordinator.log"