Skip to main content

Troubleshooting

A flat catalogue of the failure modes operators hit most often, each starting from a symptom an alert or user report would surface, followed by the first command to run, the likely causes, and the concrete fix in roughly the order to suspect them.

How to use this page

Sections are independent. Skip to the heading that matches what you are seeing. Each one leads with the cheapest first check, then a small set of likely causes paired with the fix.


A worker cannot connect to the coordinator

Symptom. The worker logs a repeated reconnect attempt; the coordinator's coordinator_workers_connected gauge stays below expected.

First check:

worker host
sudo journalctl -u zisk-worker -p warn -f
nc -zv <coordinator-host> 50051

nc -zv confirms TCP reachability to the worker-facing gRPC port. If TCP fails, no application-level fix will help.

Likely causes:

  • Wrong coordinator.url on the worker. The worker dials coordinator.url from its worker.toml (default http://127.0.0.1:50051). On a multi-host deployment this must point at the coordinator host's reachable address, with the cluster port (default 50051, 50052 on the Compose stack), not the public API port 7000.

  • Firewall or security group dropping the cluster port. Open port 50051 (or 50052 for Compose) from worker hosts to the coordinator host. The public 7000 is for the client API; do not confuse it with the worker port.

  • Coordinator not listening. Verify the coordinator is up and bound to 0.0.0.0 (not 127.0.0.1) if workers live off the coordinator host:

    coordinator host
    sudo ss -tlnp | grep 50051
  • Wrong scheme. The worker accepts http://host:port. A bare host:port without scheme silently fails to parse on some builds.


The coordinator fails to start

Symptom. The systemd unit is failed or stuck in activating; docker compose up exits the coordinator container; the foreground binary exits before printing the "listening on" lines.

First check:

coordinator host
sudo systemctl status zisk-coordinator
sudo journalctl -u zisk-coordinator -n 50

If the logs are not informative, restart with debug logging:

coordinator host
sudo systemctl stop zisk-coordinator
sudo RUST_LOG=debug zisk-coordinator \
--config /etc/zisk/coordinator.toml \
--log-level debug

Likely causes:

  • Port conflict on 7000, 9090, or 50051. Another process owns one of the three ports. Identify it:

    bash
    sudo ss -tlnp | grep -E ':(7000|9090|50051)'

    Free the port or change the coordinator's binding via --api-port, --metrics-port, --cluster-port, or the [server] port / [metrics] port / [coordinator] port fields in coordinator.toml.

  • TOML parse error. The log includes the offending line and column. Configs shipped by install.sh are valid by default; errors usually appear after a manual edit.

  • Wrong backend.mode. The two valid values are coordinator and mock. Anything else fails validation at start.


Workers are connected but not receiving tasks

Symptom. coordinator_workers_connected is non-zero, but coordinator_active_jobs is also non-zero and never drains; the worker is idle in top.

First check:

coordinator host
sudo journalctl -u zisk-coordinator | grep -E 'register|assign'

The coordinator logs every worker registration with the worker's reported compute_capacity and a one-time confirmation when a task is assigned.

Likely causes:

  • Duplicate worker_id. Every worker must have a unique ID within the cluster. If worker_id is unset the worker auto-generates a UUID, which is safe. Hand-set IDs that collide cause the coordinator to reject the second registrant silently.

  • Insufficient compute_capacity. The coordinator picks workers until the requested compute_units for a job are satisfied. A worker with compute_capacity.compute_units = 1 may never be picked alongside larger peers. Tune [worker.compute_capacity] compute_units in worker.toml.

  • Job needs more workers than the pool offers. Check max_workers_per_job in the coordinator core config (default 10, 20 in prod.toml). A job will sit in Queued until the worker pool grows.


Jobs fail with a phase timeout

Symptom. The job moves from Queued to Running and back to Failed after a fixed duration; the coordinator logs phase timeout and cancels every participating worker.

Cause. Three independent timeouts gate the job state machine:

PhaseDefaultField
Contributions (challenges)300sphase1_timeout_seconds
Prove600sphase2_timeout_seconds
Aggregate100sphase3_timeout_seconds

When a phase exceeds its timeout, the coordinator fails the job and cancels all workers participating in it.

Fix. Raise the limits in the coordinator core TOML (coordinator.config_file):

coordinator core TOML
[coordinator]
phase1_timeout_seconds = 600
phase2_timeout_seconds = 1200
phase3_timeout_seconds = 300
warning

Raising the timeouts hides slowness rather than fixing it. If a job that used to fit comfortably is now hitting timeouts, the root cause is usually a slow worker (memory pressure, hardware variance) or a slow shared resource (storage, network). Raising the timeout is the safety valve; the lasting fix is to find the underperforming component.


A worker is being declared dead but seems healthy

Symptom. The coordinator logs that a worker dropped, the worker logs it never noticed an outage and reconnected immediately.

Cause. The coordinator declares a worker dead when it misses heartbeat_max_missed heartbeats in a row. At defaults this gives 30s × 3 = 90s of grace. A separate reconnect_grace_period_ms (default 500ms) masks transient drops in the middle of a job so a flapping TCP connection does not fail the job.

FieldDefaultEffect
heartbeat_interval_seconds30How often heartbeats fire
heartbeat_max_missed3How many can be missed before declaring dead
reconnect_grace_period_ms500Mid-job tolerance for transient drops

Fix. Raise the heartbeat budget if your network has high-latency periods (cross-AZ deployment, oversubscribed links):

coordinator core TOML
[coordinator]
heartbeat_interval_seconds = 30
heartbeat_max_missed = 5 # 30s * 5 = 150s of grace
reconnect_grace_period_ms = 2000 # tolerate 2s of TCP drop
tip

If heartbeats are timing out but nc -zv to the cluster port succeeds in tight loops, the bottleneck is usually CPU on the coordinator itself, not the network. Heartbeat processing shares a thread pool with job state transitions; a saturated coordinator drops heartbeats first.


A client RPC returned an error code

The public coordinator API returns gRPC status codes with a small set of well-defined application errors. The three you will see most are:

CodeWhenAction
JOB_NOT_FOUNDThe supplied job_id is not known to the coordinator. Either it was never created or the coordinator restarted and lost it.Re-submit the job. If the coordinator restarted, expect this for every in-flight job from before the restart.
WORKER_UNAVAILABLEThe job cannot be scheduled because no worker meets the requirements.Add workers, raise compute_capacity, or lower min_compute_units.
SYSTEM_UNAVAILABLEThe coordinator is shutting down or starting up.Retry with backoff.

The coordinator restarted and all jobs are lost

Symptom. After a coordinator restart (manual, crash, or upgrade) every previously-submitted job returns JOB_NOT_FOUND.

Cause. The coordinator does not currently persist job state across restarts and has no HA story in-tree. A restart drops every in-flight job. Workers reconnect automatically, but the work they were doing is abandoned.

Mitigation.

  • Plan restarts during quiet windows. The shutdown_timeout_seconds (default 30) gives the coordinator a grace window to let in-flight RPCs drain, but it does not preserve jobs.
  • Make clients retry-aware. Treat JOB_NOT_FOUND after a coordinator restart as "re-submit", not "fatal".
  • Roll workers, not the coordinator. Worker restarts are safe (reconnect_grace_period_ms covers transient drops); the coordinator restart is the destructive one.
warning

There is no built-in coordinator HA. A second coordinator is not a hot standby; running two against the same worker pool will produce undefined behaviour. Plan capacity for the restart-loss window accordingly.


A GPU worker is not using the GPU

Symptom. Worker started with --gpu but proving is CPU-bound; nvidia-smi shows no process.

First check:

worker host
nvidia-smi
sudo journalctl -u zisk-worker | grep -iE 'cuda|gpu|nvidia'

Likely causes:

  • Worker built with the cpu-only cargo feature. The default build is GPU-capable; the cpu-only feature compiles GPU support out entirely and --gpu becomes a no-op. Rebuild without --features cpu-only.

  • Driver/CUDA mismatch with the binary. The worker is dynamically linked against CUDA runtime libs. A driver older than the CUDA runtime fails at load time. Update the driver to match the CUDA version the binary was built against.

  • Kubernetes pod has no GPU resource limit. The Helm chart requires resources.limits."nvidia.com/gpu": 1 on the pod and a GPU-tagged image. Without the limit, the device plugin does not mount the GPU into the pod even on GPU-enabled nodes.


A worker errors out reading the proving key

Symptom. Worker logs an error mentioning provingKey, fails to start, or fails on the first job.

Cause. The worker expects the bundle layout at --proving-key (default ~/.zisk/provingKey):

/opt/zisk/
├── bin/
├── zisk/
├── provingKey/
└── provingKeySnark/ # only with --proving-key-snark

Mismatches break the worker.

Likely causes:

  • Wrong --proving-key path. The worker looks for the key at the path you pass. Confirm the bundle exists at that location and the files inside match a successful host.

  • SNARK key missing on a SNARK-enabled worker. If you pass --proving-key-snark, the SNARK key must exist at the specified path. The default is alongside the STARK key in the bundle.

  • Kubernetes PVC not mounted, wrong UID, or wrong mode. The Helm chart runs as UID 998 / GID 999. The PVC bundle ownership must match, and the PVC must be ReadOnlyMany so multiple pods can mount it concurrently. If provingKey.strategy is download, the init container needs enough emptyDir space (default budget 80Gi).

  • Version mismatch between proving key and prover binary. Proving keys are versioned alongside the prover. A key produced by ziskup from one release used against a binary from a different release fails. Regenerate the bundle with the matching ziskup --system --prefix ... invocation.


Where to go next

This is the end of the monitoring track. From here:

  • Revisit Metrics and alerts for the Prometheus catalogue used by every section above.
  • Revisit Logs for the filters that turn a vague alert into a specific failing job.
  • If you are still stuck after walking this page, capture the coordinator and worker logs around the failure window, the coordinator config, and the failing client request, and open an issue on the ZisK repository with all three.