Skip to main content

Monitoring

A running ZisK cluster exposes two observability surfaces: a Prometheus `/metrics` endpoint on the coordinator and structured logs from every binary. This section walks through both, shows the scrape config and dashboards shipped with the repo, and ends with a runbook for the failure modes operators hit most often.

Metrics and alerts

Scrape the coordinator's Prometheus endpoint, walk the full metric catalogue, load the bundled Grafana dashboard, and bootstrap alerting rules from a starter set of PromQL queries.

Read coordinator and worker logs on every deployment shape (systemd, launchd, Docker Compose, Kubernetes), switch to JSON for production aggregation, and filter by level, phase, or job ID.

Troubleshooting

Concrete diagnoses and fixes for the failure modes operators hit most: stuck workers, port conflicts, phase timeouts, heartbeat drops, mismatched proving keys, and coordinator restart loss.