Skip to main content

Metrics and alerts

The coordinator exposes a Prometheus /metrics endpoint that captures everything an operator needs to keep an eye on a running ZisK cluster: API traffic, job throughput, worker fleet health, and end-to-end proving latency. This guide walks through scraping the endpoint, the meaning of every series the coordinator publishes, loading the shipped Grafana dashboard, and a starter set of alerting rules you can paste into Prometheus.

The metrics endpoint

The coordinator exposes Prometheus text-format metrics on a small HTTP server, separate from the public gRPC API and the worker cluster port. The default address is 0.0.0.0:9090, which means /metrics on port 9090 of the coordinator host.

DeploymentAddressOverride
Bare-metal / systemd / launchd:9090/metrics--metrics-port flag, [metrics] port in coordinator.toml, or ZISK_COORDINATOR_METRICS_PORT
Docker Compose:9091/metricsThe shipped compose.yaml remaps to avoid clashing with the Prometheus container that scrapes it

A quick smoke test from the same host:

bash
curl -s http://127.0.0.1:9090/metrics | head -40

You should see standard Prometheus output: # HELP and # TYPE lines followed by coordinator_* series. The same server also serves a GET /health liveness probe that returns 200 OK when the process is up.

note

Workers do not currently expose a /metrics endpoint. Every series listed below originates on the coordinator. The coordinator does, however, publish per-worker series labelled by worker_id, so you still get worker-level visibility, just from one scrape target instead of N.


Metrics catalogue

The coordinator publishes eight series. Memorise the labels: most useful alert and dashboard queries are slices of these by method, outcome, or worker_id.

Request-level series

MetricTypeLabelsMeaning
coordinator_requests_totalcountermethod, statusCount of inbound public gRPC calls. method is the RPC name (JobRequest, WaitJobResult, etc.); status is the gRPC status code returned to the client.
coordinator_request_duration_secondshistogrammethodWall-clock latency of each public API call, including time spent waiting on workers for long-poll RPCs.

Job-level series

MetricTypeLabelsMeaning
coordinator_active_jobsgaugenoneNumber of jobs currently in Running (any phase) on the coordinator. The natural dashboard headline.
coordinator_jobs_totalcounterkind, outcomeLifetime job count, partitioned by kind (prove, execute) and outcome (completed, failed, cancelled).
coordinator_job_duration_secondshistogramoutcomeEnd-to-end job latency from JobRequest to terminal state. Buckets: [1, 2, 5, 10, 20, 60, 300] seconds.
coordinator_registered_programs_totalgaugenoneNumber of guest programs currently registered via RegisterGuestProgram.

Worker-fleet series

MetricTypeLabelsMeaning
coordinator_workers_connectedgaugenoneWorkers with an open WorkerStream to the coordinator. The single most important signal: if it drops, no proving is happening.
coordinator_worker_jobs_totalcounterworker_id, outcomePer-worker job count, partitioned by outcome. Useful for spotting one bad worker dragging the cluster down.
tip

The histogram buckets on coordinator_job_duration_seconds ([1, 2, 5, 10, 20, 60, 300]) are tuned for typical STARK proving jobs. If your workload sits at the long tail (very large inputs or constrained CPU workers), the top bucket will saturate and you will need a custom buckets config rather than reading histogram_quantile literally above 300 s.


Scrape with Prometheus

The Docker Compose deployment ships a Prometheus container that scrapes the coordinator automatically. For every other deployment shape, point your existing Prometheus at the coordinator's metrics port:

prometheus.yml
global:
scrape_interval: 5s

scrape_configs:
- job_name: zisk-coordinator
static_configs:
- targets: ['coordinator:9091'] # 9090 on systemd/launchd/k8s

The 5-second interval matches the value the in-repo Compose stack uses (distributed/deploy/prometheus/prometheus.yml). It is a good default: job state evolves on the order of seconds, and the endpoint is cheap to serve.

note

The example targets coordinator:9091, the Compose port. On the bare-metal install paths and Kubernetes, the metrics port is 9090. Update the target accordingly when you copy this snippet.

Once Prometheus is scraping, sanity-check the join from the Prometheus side:

up{job="zisk-coordinator"}

A value of 1 means the scrape succeeded. 0 means the coordinator is up but unreachable from Prometheus, or that the metrics server itself crashed.


Dashboards

A Grafana dashboard ships with the repo at distributed/deploy/grafana/dashboards/zisk-overview.json, along with datasource provisioning under the same deploy/grafana/ directory. The Compose stack mounts both automatically, so a docker compose up brings up Grafana on http://localhost:3000 with the dashboard already loaded.

To load it into an existing Grafana instance:

  1. In Grafana, go to Dashboards → Import.
  2. Upload distributed/deploy/grafana/dashboards/zisk-overview.json.
  3. Pick the Prometheus datasource scraping the coordinator.

The dashboard surfaces the headline series: active jobs over time, job outcome breakdown, connected worker count, p50/p95/p99 of coordinator_job_duration_seconds, and per-method API latency.


Starter alert rules

Five PromQL rules cover the failures that almost always warrant a page or an investigation. Each one is rare in normal operation and unambiguous when it fires.

AlertExpressionForSeverity
Coordinator unreachableup{job="zisk-coordinator"} == 02mPage
No workers connectedcoordinator_workers_connected == 05mPage
Worker fleet degradedcoordinator_workers_connected < <expected_replicas>10mInvestigate
Job failures spikingrate(coordinator_jobs_total{outcome="failed"}[5m]) > 0.110mInvestigate
p99 latency over budgethistogram_quantile(0.99, sum by (le) (rate(coordinator_job_duration_seconds_bucket[10m]))) > <budget_seconds>15mInvestigate

Translated to a Prometheus rule file:

zisk-alerts.yml
groups:
- name: zisk-coordinator
rules:
- alert: ZiskCoordinatorUnreachable
expr: up{job="zisk-coordinator"} == 0
for: 2m
labels: { severity: page }
annotations:
summary: "Coordinator scrape failed for 2m"

- alert: ZiskNoWorkers
expr: coordinator_workers_connected == 0
for: 5m
labels: { severity: page }
annotations:
summary: "No workers connected to the coordinator"

- alert: ZiskWorkerFleetDegraded
expr: coordinator_workers_connected < 4
for: 10m
labels: { severity: warning }
annotations:
summary: "Worker count below expected baseline"

- alert: ZiskJobFailuresSpiking
expr: rate(coordinator_jobs_total{outcome="failed"}[5m]) > 0.1
for: 10m
labels: { severity: warning }
annotations:
summary: "Job failure rate above 0.1/s"

- alert: ZiskJobP99Slow
expr: |
histogram_quantile(0.99,
sum by (le) (rate(coordinator_job_duration_seconds_bucket[10m]))) > 300
for: 15m
labels: { severity: warning }
annotations:
summary: "p99 job duration above 300s"
tip

Replace < 4 and the 300 second p99 threshold with values that match your deployment's expected worker count and proving SLO. The thresholds are illustrative, not universal.

warning

coordinator_jobs_total{outcome="failed"} includes jobs that failed because the client asked for something invalid (e.g. an unknown hash_id), not just infrastructure problems. If your client traffic is noisy, slice the alert by kind or correlate with coordinator_requests_total{status="..."} before paging.


Next steps

Metrics tell you the cluster is unhealthy. To find out why a specific job failed, continue to Logs and then to Troubleshooting for the runbook that maps common symptoms to fixes.