Troubleshooting
A flat catalogue of the failure modes operators hit most often, each starting from a symptom an alert or user report would surface, followed by the first command to run, the likely causes, and the concrete fix in roughly the order to suspect them.
How to use this page
Sections are independent. Skip to the heading that matches what you are seeing. Each one leads with the cheapest first check, then a small set of likely causes paired with the fix.
A worker cannot connect to the coordinator
Symptom. The worker logs a repeated reconnect attempt; the
coordinator's coordinator_workers_connected gauge stays below
expected.
First check:
sudo journalctl -u zisk-worker -p warn -f
nc -zv <coordinator-host> 50051
nc -zv confirms TCP reachability to the worker-facing gRPC
port. If TCP fails, no application-level fix will help.
Likely causes:
-
Wrong
coordinator.urlon the worker. The worker dialscoordinator.urlfrom itsworker.toml(defaulthttp://127.0.0.1:50051). On a multi-host deployment this must point at the coordinator host's reachable address, with the cluster port (default50051,50052on the Compose stack), not the public API port7000. -
Firewall or security group dropping the cluster port. Open port
50051(or50052for Compose) from worker hosts to the coordinator host. The public7000is for the client API; do not confuse it with the worker port. -
Coordinator not listening. Verify the coordinator is up and bound to
0.0.0.0(not127.0.0.1) if workers live off the coordinator host:coordinator hostsudo ss -tlnp | grep 50051 -
Wrong scheme. The worker accepts
http://host:port. A barehost:portwithout scheme silently fails to parse on some builds.
The coordinator fails to start
Symptom. The systemd unit is failed or stuck in
activating; docker compose up exits the coordinator
container; the foreground binary exits before printing the
"listening on" lines.
First check:
sudo systemctl status zisk-coordinator
sudo journalctl -u zisk-coordinator -n 50
If the logs are not informative, restart with debug logging:
sudo systemctl stop zisk-coordinator
sudo RUST_LOG=debug zisk-coordinator \
--config /etc/zisk/coordinator.toml \
--log-level debug
Likely causes:
-
Port conflict on 7000, 9090, or 50051. Another process owns one of the three ports. Identify it:
bashsudo ss -tlnp | grep -E ':(7000|9090|50051)'Free the port or change the coordinator's binding via
--api-port,--metrics-port,--cluster-port, or the[server] port/[metrics] port/[coordinator] portfields incoordinator.toml. -
TOML parse error. The log includes the offending line and column. Configs shipped by
install.share valid by default; errors usually appear after a manual edit. -
Wrong
backend.mode. The two valid values arecoordinatorandmock. Anything else fails validation at start.
Workers are connected but not receiving tasks
Symptom. coordinator_workers_connected is non-zero, but
coordinator_active_jobs is also non-zero and never drains; the
worker is idle in top.
First check:
sudo journalctl -u zisk-coordinator | grep -E 'register|assign'
The coordinator logs every worker registration with the
worker's reported compute_capacity and a one-time confirmation
when a task is assigned.
Likely causes:
-
Duplicate
worker_id. Every worker must have a unique ID within the cluster. Ifworker_idis unset the worker auto-generates a UUID, which is safe. Hand-set IDs that collide cause the coordinator to reject the second registrant silently. -
Insufficient
compute_capacity. The coordinator picks workers until the requestedcompute_unitsfor a job are satisfied. A worker withcompute_capacity.compute_units = 1may never be picked alongside larger peers. Tune[worker.compute_capacity] compute_unitsinworker.toml. -
Job needs more workers than the pool offers. Check
max_workers_per_jobin the coordinator core config (default10,20inprod.toml). A job will sit inQueueduntil the worker pool grows.
Jobs fail with a phase timeout
Symptom. The job moves from Queued to Running and back
to Failed after a fixed duration; the coordinator logs
phase timeout and cancels every participating worker.
Cause. Three independent timeouts gate the job state machine:
| Phase | Default | Field |
|---|---|---|
| Contributions (challenges) | 300s | phase1_timeout_seconds |
| Prove | 600s | phase2_timeout_seconds |
| Aggregate | 100s | phase3_timeout_seconds |
When a phase exceeds its timeout, the coordinator fails the job and cancels all workers participating in it.
Fix. Raise the limits in the coordinator core TOML
(coordinator.config_file):
[coordinator]
phase1_timeout_seconds = 600
phase2_timeout_seconds = 1200
phase3_timeout_seconds = 300
Raising the timeouts hides slowness rather than fixing it. If a job that used to fit comfortably is now hitting timeouts, the root cause is usually a slow worker (memory pressure, hardware variance) or a slow shared resource (storage, network). Raising the timeout is the safety valve; the lasting fix is to find the underperforming component.
A worker is being declared dead but seems healthy
Symptom. The coordinator logs that a worker dropped, the worker logs it never noticed an outage and reconnected immediately.
Cause. The coordinator declares a worker dead when it
misses heartbeat_max_missed heartbeats in a row. At defaults
this gives 30s × 3 = 90s of grace. A separate
reconnect_grace_period_ms (default 500ms) masks transient
drops in the middle of a job so a flapping TCP connection does
not fail the job.
| Field | Default | Effect |
|---|---|---|
heartbeat_interval_seconds | 30 | How often heartbeats fire |
heartbeat_max_missed | 3 | How many can be missed before declaring dead |
reconnect_grace_period_ms | 500 | Mid-job tolerance for transient drops |
Fix. Raise the heartbeat budget if your network has high-latency periods (cross-AZ deployment, oversubscribed links):
[coordinator]
heartbeat_interval_seconds = 30
heartbeat_max_missed = 5 # 30s * 5 = 150s of grace
reconnect_grace_period_ms = 2000 # tolerate 2s of TCP drop
If heartbeats are timing out but nc -zv to the cluster port
succeeds in tight loops, the bottleneck is usually CPU on the
coordinator itself, not the network. Heartbeat processing
shares a thread pool with job state transitions; a saturated
coordinator drops heartbeats first.
A client RPC returned an error code
The public coordinator API returns gRPC status codes with a small set of well-defined application errors. The three you will see most are:
| Code | When | Action |
|---|---|---|
JOB_NOT_FOUND | The supplied job_id is not known to the coordinator. Either it was never created or the coordinator restarted and lost it. | Re-submit the job. If the coordinator restarted, expect this for every in-flight job from before the restart. |
WORKER_UNAVAILABLE | The job cannot be scheduled because no worker meets the requirements. | Add workers, raise compute_capacity, or lower min_compute_units. |
SYSTEM_UNAVAILABLE | The coordinator is shutting down or starting up. | Retry with backoff. |
The coordinator restarted and all jobs are lost
Symptom. After a coordinator restart (manual, crash, or
upgrade) every previously-submitted job returns JOB_NOT_FOUND.
Cause. The coordinator does not currently persist job state across restarts and has no HA story in-tree. A restart drops every in-flight job. Workers reconnect automatically, but the work they were doing is abandoned.
Mitigation.
- Plan restarts during quiet windows. The
shutdown_timeout_seconds(default30) gives the coordinator a grace window to let in-flight RPCs drain, but it does not preserve jobs. - Make clients retry-aware. Treat
JOB_NOT_FOUNDafter a coordinator restart as "re-submit", not "fatal". - Roll workers, not the coordinator. Worker restarts are
safe (
reconnect_grace_period_mscovers transient drops); the coordinator restart is the destructive one.
There is no built-in coordinator HA. A second coordinator is not a hot standby; running two against the same worker pool will produce undefined behaviour. Plan capacity for the restart-loss window accordingly.
A GPU worker is not using the GPU
Symptom. Worker started with --gpu but proving is
CPU-bound; nvidia-smi shows no process.
First check:
nvidia-smi
sudo journalctl -u zisk-worker | grep -iE 'cuda|gpu|nvidia'
Likely causes:
-
Worker built with the
cpu-onlycargo feature. The default build is GPU-capable; thecpu-onlyfeature compiles GPU support out entirely and--gpubecomes a no-op. Rebuild without--features cpu-only. -
Driver/CUDA mismatch with the binary. The worker is dynamically linked against CUDA runtime libs. A driver older than the CUDA runtime fails at load time. Update the driver to match the CUDA version the binary was built against.
-
Kubernetes pod has no GPU resource limit. The Helm chart requires
resources.limits."nvidia.com/gpu": 1on the pod and a GPU-tagged image. Without the limit, the device plugin does not mount the GPU into the pod even on GPU-enabled nodes.
A worker errors out reading the proving key
Symptom. Worker logs an error mentioning provingKey,
fails to start, or fails on the first job.
Cause. The worker expects the bundle layout at
--proving-key (default ~/.zisk/provingKey):
/opt/zisk/
├── bin/
├── zisk/
├── provingKey/
└── provingKeySnark/ # only with --proving-key-snark
Mismatches break the worker.
Likely causes:
-
Wrong
--proving-keypath. The worker looks for the key at the path you pass. Confirm the bundle exists at that location and the files inside match a successful host. -
SNARK key missing on a SNARK-enabled worker. If you pass
--proving-key-snark, the SNARK key must exist at the specified path. The default is alongside the STARK key in the bundle. -
Kubernetes PVC not mounted, wrong UID, or wrong mode. The Helm chart runs as UID
998/ GID999. The PVC bundle ownership must match, and the PVC must beReadOnlyManyso multiple pods can mount it concurrently. IfprovingKey.strategyisdownload, the init container needs enough emptyDir space (default budget80Gi). -
Version mismatch between proving key and prover binary. Proving keys are versioned alongside the prover. A key produced by
ziskupfrom one release used against a binary from a different release fails. Regenerate the bundle with the matchingziskup --system --prefix ...invocation.
Where to go next
This is the end of the monitoring track. From here:
- Revisit Metrics and alerts for the Prometheus catalogue used by every section above.
- Revisit Logs for the filters that turn a vague alert into a specific failing job.
- If you are still stuck after walking this page, capture the coordinator and worker logs around the failure window, the coordinator config, and the failing client request, and open an issue on the ZisK repository with all three.