Metrics and alerts
The coordinator exposes a Prometheus /metrics endpoint that
captures everything an operator needs to keep an eye on a running
ZisK cluster: API traffic, job throughput, worker fleet health, and
end-to-end proving latency. This guide walks through scraping the
endpoint, the meaning of every series the coordinator publishes,
loading the shipped Grafana dashboard, and a starter set of
alerting rules you can paste into Prometheus.
The metrics endpoint
The coordinator exposes Prometheus text-format metrics on a small
HTTP server, separate from the public gRPC API and the worker
cluster port. The default address is 0.0.0.0:9090, which means
/metrics on port 9090 of the coordinator host.
| Deployment | Address | Override |
|---|---|---|
| Bare-metal / systemd / launchd | :9090/metrics | --metrics-port flag, [metrics] port in coordinator.toml, or ZISK_COORDINATOR_METRICS_PORT |
| Docker Compose | :9091/metrics | The shipped compose.yaml remaps to avoid clashing with the Prometheus container that scrapes it |
A quick smoke test from the same host:
curl -s http://127.0.0.1:9090/metrics | head -40
You should see standard Prometheus output: # HELP and # TYPE
lines followed by coordinator_* series. The same server also
serves a GET /health liveness probe that returns 200 OK when
the process is up.
Workers do not currently expose a /metrics endpoint. Every
series listed below originates on the coordinator. The
coordinator does, however, publish per-worker series labelled by
worker_id, so you still get worker-level visibility, just from
one scrape target instead of N.
Metrics catalogue
The coordinator publishes eight series. Memorise the labels: most
useful alert and dashboard queries are slices of these by method,
outcome, or worker_id.
Request-level series
| Metric | Type | Labels | Meaning |
|---|---|---|---|
coordinator_requests_total | counter | method, status | Count of inbound public gRPC calls. method is the RPC name (JobRequest, WaitJobResult, etc.); status is the gRPC status code returned to the client. |
coordinator_request_duration_seconds | histogram | method | Wall-clock latency of each public API call, including time spent waiting on workers for long-poll RPCs. |
Job-level series
| Metric | Type | Labels | Meaning |
|---|---|---|---|
coordinator_active_jobs | gauge | none | Number of jobs currently in Running (any phase) on the coordinator. The natural dashboard headline. |
coordinator_jobs_total | counter | kind, outcome | Lifetime job count, partitioned by kind (prove, execute) and outcome (completed, failed, cancelled). |
coordinator_job_duration_seconds | histogram | outcome | End-to-end job latency from JobRequest to terminal state. Buckets: [1, 2, 5, 10, 20, 60, 300] seconds. |
coordinator_registered_programs_total | gauge | none | Number of guest programs currently registered via RegisterGuestProgram. |
Worker-fleet series
| Metric | Type | Labels | Meaning |
|---|---|---|---|
coordinator_workers_connected | gauge | none | Workers with an open WorkerStream to the coordinator. The single most important signal: if it drops, no proving is happening. |
coordinator_worker_jobs_total | counter | worker_id, outcome | Per-worker job count, partitioned by outcome. Useful for spotting one bad worker dragging the cluster down. |
The histogram buckets on coordinator_job_duration_seconds
([1, 2, 5, 10, 20, 60, 300]) are tuned for typical STARK
proving jobs. If your workload sits at the long tail (very large
inputs or constrained CPU workers), the top bucket will saturate
and you will need a custom buckets config rather than reading
histogram_quantile literally above 300 s.
Scrape with Prometheus
The Docker Compose deployment ships a Prometheus container that scrapes the coordinator automatically. For every other deployment shape, point your existing Prometheus at the coordinator's metrics port:
global:
scrape_interval: 5s
scrape_configs:
- job_name: zisk-coordinator
static_configs:
- targets: ['coordinator:9091'] # 9090 on systemd/launchd/k8s
The 5-second interval matches the value the in-repo Compose stack
uses (distributed/deploy/prometheus/prometheus.yml). It is a
good default: job state evolves on the order of seconds, and the
endpoint is cheap to serve.
The example targets coordinator:9091, the Compose port. On
the bare-metal install paths and Kubernetes, the metrics port is
9090. Update the target accordingly when you copy this
snippet.
Once Prometheus is scraping, sanity-check the join from the Prometheus side:
up{job="zisk-coordinator"}
A value of 1 means the scrape succeeded. 0 means the
coordinator is up but unreachable from Prometheus, or that the
metrics server itself crashed.
Dashboards
A Grafana dashboard ships with the repo at
distributed/deploy/grafana/dashboards/zisk-overview.json, along
with datasource provisioning under the same deploy/grafana/
directory. The Compose stack mounts both automatically, so a
docker compose up brings up Grafana on http://localhost:3000
with the dashboard already loaded.
To load it into an existing Grafana instance:
- In Grafana, go to Dashboards → Import.
- Upload
distributed/deploy/grafana/dashboards/zisk-overview.json. - Pick the Prometheus datasource scraping the coordinator.
The dashboard surfaces the headline series: active jobs over time,
job outcome breakdown, connected worker count, p50/p95/p99 of
coordinator_job_duration_seconds, and per-method API latency.
Starter alert rules
Five PromQL rules cover the failures that almost always warrant a page or an investigation. Each one is rare in normal operation and unambiguous when it fires.
| Alert | Expression | For | Severity |
|---|---|---|---|
| Coordinator unreachable | up{job="zisk-coordinator"} == 0 | 2m | Page |
| No workers connected | coordinator_workers_connected == 0 | 5m | Page |
| Worker fleet degraded | coordinator_workers_connected < <expected_replicas> | 10m | Investigate |
| Job failures spiking | rate(coordinator_jobs_total{outcome="failed"}[5m]) > 0.1 | 10m | Investigate |
| p99 latency over budget | histogram_quantile(0.99, sum by (le) (rate(coordinator_job_duration_seconds_bucket[10m]))) > <budget_seconds> | 15m | Investigate |
Translated to a Prometheus rule file:
groups:
- name: zisk-coordinator
rules:
- alert: ZiskCoordinatorUnreachable
expr: up{job="zisk-coordinator"} == 0
for: 2m
labels: { severity: page }
annotations:
summary: "Coordinator scrape failed for 2m"
- alert: ZiskNoWorkers
expr: coordinator_workers_connected == 0
for: 5m
labels: { severity: page }
annotations:
summary: "No workers connected to the coordinator"
- alert: ZiskWorkerFleetDegraded
expr: coordinator_workers_connected < 4
for: 10m
labels: { severity: warning }
annotations:
summary: "Worker count below expected baseline"
- alert: ZiskJobFailuresSpiking
expr: rate(coordinator_jobs_total{outcome="failed"}[5m]) > 0.1
for: 10m
labels: { severity: warning }
annotations:
summary: "Job failure rate above 0.1/s"
- alert: ZiskJobP99Slow
expr: |
histogram_quantile(0.99,
sum by (le) (rate(coordinator_job_duration_seconds_bucket[10m]))) > 300
for: 15m
labels: { severity: warning }
annotations:
summary: "p99 job duration above 300s"
Replace < 4 and the 300 second p99 threshold with values that
match your deployment's expected worker count and proving SLO.
The thresholds are illustrative, not universal.
coordinator_jobs_total{outcome="failed"} includes jobs that
failed because the client asked for something invalid (e.g.
an unknown hash_id), not just infrastructure problems. If your
client traffic is noisy, slice the alert by kind or correlate
with coordinator_requests_total{status="..."} before paging.
Next steps
Metrics tell you the cluster is unhealthy. To find out why a specific job failed, continue to Logs and then to Troubleshooting for the runbook that maps common symptoms to fixes.