Skip to main content

Introduction

A distributed prover is one coordinator process and a fleet of worker processes. The coordinator accepts proof requests from clients, fans the work out across the workers, and returns a single aggregated proof. This page walks through the moving parts before you stand any of them up.

Overview

Generating a ZisK proof means proving the full execution trace of a guest program. For real workloads that trace is far too large and too slow to prove on a single machine. A ZisK cluster solves this by splitting the trace into segments, proving each segment in parallel across a pool of workers, and aggregating the partial proofs back into a single final proof. Throughput and latency scale with the number of machines you give it. From the client's point of view the cluster is still a single gRPC endpoint that takes an ELF and an input and gives back a proof. Internally, that endpoint is backed by a small distributed system you operate yourself. Two binaries make up the system:

BinaryRole
zisk-coordinatorOrchestrator. Exactly one per cluster.
zisk-workerProving process. N per cluster.

The client speaks to the coordinator over the public gRPC API. The coordinator hands work to the worker fleet over a separate bidirectional gRPC stream and exposes a Prometheus endpoint for monitoring. The rest of this page explains how each piece works and what state lives where:


Components

Coordinator

The coordinator is the only stateful process in the cluster. It owns the job table, the worker pool, and a cache of proving setups derived from each uploaded guest ELF — subsequent jobs for the same program skip the expensive setup step and reuse the cached keys. It splits each incoming job into segments and assigns them to workers, but performs no cryptographic work itself; its CPU and memory needs are modest. What it does need is network headroom, because every byte of input and every partial proof flows through it.

Worker

A worker is a proving process. It connects outbound to the coordinator on the cluster port, registers itself, advertises its compute capacity, and waits for assignments. Workers are stateless across jobs: they hold only the segments they are currently proving, and you can add, remove, or restart them without touching coordinator state. The first worker to deliver its partial proof for a given job is promoted to aggregator and assembles the final proof on the coordinator's behalf.

Client

The client that talks to the coordinator's public gRPC API via cargo-zisk or the SDK. You use it to register a guest ELF, generate its setup, submit a proving job, and pull the resulting proof.

Metrics

The coordinator exposes a plain HTTP server alongside its gRPC ports for observability. GET /metrics returns a Prometheus text payload with cluster, job, and worker counters; GET /health returns 200 OK while the coordinator is live and is the canonical liveness probe. Point your Prometheus scraper at this endpoint and keep it reachable only from your monitoring stack.


Job lifecycle

A job is always initiated by a client request: a JobRequest RPC sent to the coordinator's public gRPC API. The coordinator owns the job from that point on, and the job moves through a small state machine until it terminates:

When a JobRequest arrives the coordinator places it in Queued, picks workers from the idle pool until the requested compute_capacity is satisfied, and transitions the job to Running. The running phase has three internal sub-phases:

PhaseWhat happens
Partial contributionsEach assigned worker processes its segments and streams back its partial challenges; the coordinator collects them and derives a single global challenge.
ProveThe coordinator broadcasts the global challenge to all workers. Each worker computes the STARK proof for its segments and returns the partial proofs.
AggregateThe first worker to deliver its partial proof is promoted to aggregator and builds a binary aggregation tree, folding in the remaining partial proofs as they land.

When aggregation finishes the coordinator returns the final Proof to the client and the job moves to Completed. The exchange across the three phases looks like this:

The aggregator isn't pre-assigned: whichever worker is first to finish and deliver its partial proof to the coordinator is the one promoted to aggregator for that job. Picking the fastest responder means the machine driving the aggregation tree is, by construction, the one with the most idle capacity left, the rest of the fleet keeps streaming its partial proofs to it while it folds them in.