Introduction
A distributed prover is one coordinator process and a fleet of worker processes. The coordinator accepts proof requests from clients, fans the work out across the workers, and returns a single aggregated proof. This page walks through the moving parts before you stand any of them up.
Overview
Generating a ZisK proof means proving the full execution trace of a guest program. For real workloads that trace is far too large and too slow to prove on a single machine. A ZisK cluster solves this by splitting the trace into segments, proving each segment in parallel across a pool of workers, and aggregating the partial proofs back into a single final proof. Throughput and latency scale with the number of machines you give it. From the client's point of view the cluster is still a single gRPC endpoint that takes an ELF and an input and gives back a proof. Internally, that endpoint is backed by a small distributed system you operate yourself. Two binaries make up the system:
| Binary | Role |
|---|---|
zisk-coordinator | Orchestrator. Exactly one per cluster. |
zisk-worker | Proving process. N per cluster. |
The client speaks to the coordinator over the public gRPC API. The coordinator hands work to the worker fleet over a separate bidirectional gRPC stream and exposes a Prometheus endpoint for monitoring. The rest of this page explains how each piece works and what state lives where:
Components
Coordinator
The coordinator is the only stateful process in the cluster. It owns the job table, the worker pool, and a cache of proving setups derived from each uploaded guest ELF — subsequent jobs for the same program skip the expensive setup step and reuse the cached keys. It splits each incoming job into segments and assigns them to workers, but performs no cryptographic work itself; its CPU and memory needs are modest. What it does need is network headroom, because every byte of input and every partial proof flows through it.
Worker
A worker is a proving process. It connects outbound to the coordinator on the cluster port, registers itself, advertises its compute capacity, and waits for assignments. Workers are stateless across jobs: they hold only the segments they are currently proving, and you can add, remove, or restart them without touching coordinator state. The first worker to deliver its partial proof for a given job is promoted to aggregator and assembles the final proof on the coordinator's behalf.
Client
The client that talks to the coordinator's public gRPC API via cargo-zisk or the SDK. You use it to register a guest ELF, generate its setup, submit a proving job, and pull the resulting proof.
Metrics
The coordinator exposes a plain HTTP server alongside its gRPC ports
for observability. GET /metrics returns a Prometheus text payload
with cluster, job, and worker counters; GET /health returns 200 OK
while the coordinator is live and is the canonical liveness probe.
Point your Prometheus scraper at this endpoint and keep it reachable
only from your monitoring stack.
Job lifecycle
A job is always initiated by a client request: a JobRequest
RPC sent to the coordinator's public gRPC API. The
coordinator owns the job from that point on, and the job moves
through a small state machine until it terminates:
When a JobRequest arrives the coordinator places it in Queued,
picks workers from the idle pool until the requested compute_capacity
is satisfied, and transitions the job to Running. The running phase
has three internal sub-phases:
| Phase | What happens |
|---|---|
| Partial contributions | Each assigned worker processes its segments and streams back its partial challenges; the coordinator collects them and derives a single global challenge. |
| Prove | The coordinator broadcasts the global challenge to all workers. Each worker computes the STARK proof for its segments and returns the partial proofs. |
| Aggregate | The first worker to deliver its partial proof is promoted to aggregator and builds a binary aggregation tree, folding in the remaining partial proofs as they land. |
When aggregation finishes the coordinator returns the final Proof
to the client and the job moves to Completed. The exchange across the three
phases looks like this:
The aggregator isn't pre-assigned: whichever worker is first to finish and deliver its partial proof to the coordinator is the one promoted to aggregator for that job. Picking the fastest responder means the machine driving the aggregation tree is, by construction, the one with the most idle capacity left, the rest of the fleet keeps streaming its partial proofs to it while it folds them in.