Metrics and alerts

The coordinator exposes a Prometheus /metrics endpoint that captures everything an operator needs to keep an eye on a running ZisK cluster: API traffic, job throughput, worker fleet health, and end-to-end proving latency. This guide walks through scraping the endpoint, the meaning of every series the coordinator publishes, loading the shipped Grafana dashboard, and a starter set of alerting rules you can paste into Prometheus.

The metrics endpoint

The coordinator exposes Prometheus text-format metrics on a small HTTP server, separate from the public gRPC API and the worker cluster port. The default address is 0.0.0.0:9090, which means /metrics on port 9090 of the coordinator host.

Deployment	Address	Override
Bare-metal / systemd / launchd	`:9090/metrics`	`--metrics-port` flag, `[metrics] port` in `coordinator.toml`, or `ZISK_COORDINATOR_METRICS_PORT`
Docker Compose	`:9091/metrics`	The shipped `compose.yaml` remaps to avoid clashing with the Prometheus container that scrapes it

A quick smoke test from the same host:

bash
curl -s http://127.0.0.1:9090/metrics | head -40

You should see standard Prometheus output: # HELP and # TYPE lines followed by coordinator_* series. The same server also serves a GET /health liveness probe that returns 200 OK when the process is up.

note

Workers do not currently expose a /metrics endpoint. Every series listed below originates on the coordinator. The coordinator does, however, publish per-worker series labelled by worker_id, so you still get worker-level visibility, just from one scrape target instead of N.

Metrics catalogue

The coordinator publishes eight series. Memorise the labels: most useful alert and dashboard queries are slices of these by method, outcome, or worker_id.

Request-level series

Metric	Type	Labels	Meaning
`coordinator_requests_total`	counter	`method`, `status`	Count of inbound public gRPC calls. `method` is the RPC name (`JobRequest`, `WaitJobResult`, etc.); `status` is the gRPC status code returned to the client.
`coordinator_request_duration_seconds`	histogram	`method`	Wall-clock latency of each public API call, including time spent waiting on workers for long-poll RPCs.

Job-level series

Metric	Type	Labels	Meaning
`coordinator_active_jobs`	gauge	none	Number of jobs currently in `Running` (any phase) on the coordinator. The natural dashboard headline.
`coordinator_jobs_total`	counter	`kind`, `outcome`	Lifetime job count, partitioned by `kind` (`prove`, `execute`) and `outcome` (`completed`, `failed`, `cancelled`).
`coordinator_job_duration_seconds`	histogram	`outcome`	End-to-end job latency from `JobRequest` to terminal state. Buckets: `[1, 2, 5, 10, 20, 60, 300]` seconds.
`coordinator_registered_programs_total`	gauge	none	Number of guest programs currently registered via `RegisterGuestProgram`.

Worker-fleet series

Metric	Type	Labels	Meaning
`coordinator_workers_connected`	gauge	none	Workers with an open `WorkerStream` to the coordinator. The single most important signal: if it drops, no proving is happening.
`coordinator_worker_jobs_total`	counter	`worker_id`, `outcome`	Per-worker job count, partitioned by outcome. Useful for spotting one bad worker dragging the cluster down.

tip

The histogram buckets on coordinator_job_duration_seconds ([1, 2, 5, 10, 20, 60, 300]) are tuned for typical STARK proving jobs. If your workload sits at the long tail (very large inputs or constrained CPU workers), the top bucket will saturate and you will need a custom buckets config rather than reading histogram_quantile literally above 300 s.

Scrape with Prometheus

The Docker Compose deployment ships a Prometheus container that scrapes the coordinator automatically. For every other deployment shape, point your existing Prometheus at the coordinator's metrics port:

prometheus.yml
global:
  scrape_interval: 5s

scrape_configs:
  - job_name: zisk-coordinator
    static_configs:
      - targets: ['coordinator:9091']    # 9090 on systemd/launchd/k8s

The 5-second interval matches the value the in-repo Compose stack uses (distributed/deploy/prometheus/prometheus.yml). It is a good default: job state evolves on the order of seconds, and the endpoint is cheap to serve.

note

The example targets coordinator:9091, the Compose port. On the bare-metal install paths and Kubernetes, the metrics port is 9090. Update the target accordingly when you copy this snippet.

Once Prometheus is scraping, sanity-check the join from the Prometheus side:

up{job="zisk-coordinator"}

A value of 1 means the scrape succeeded. 0 means the coordinator is up but unreachable from Prometheus, or that the metrics server itself crashed.

Dashboards

A Grafana dashboard ships with the repo at distributed/deploy/grafana/dashboards/zisk-overview.json, along with datasource provisioning under the same deploy/grafana/ directory. The Compose stack mounts both automatically, so a docker compose up brings up Grafana on http://localhost:3000 with the dashboard already loaded.

To load it into an existing Grafana instance:

In Grafana, go to Dashboards → Import.
Upload distributed/deploy/grafana/dashboards/zisk-overview.json.
Pick the Prometheus datasource scraping the coordinator.

The dashboard surfaces the headline series: active jobs over time, job outcome breakdown, connected worker count, p50/p95/p99 of coordinator_job_duration_seconds, and per-method API latency.

Starter alert rules

Five PromQL rules cover the failures that almost always warrant a page or an investigation. Each one is rare in normal operation and unambiguous when it fires.

Alert	Expression	For	Severity
Coordinator unreachable	`up{job="zisk-coordinator"} == 0`	`2m`	Page
No workers connected	`coordinator_workers_connected == 0`	`5m`	Page
Worker fleet degraded	`coordinator_workers_connected < <expected_replicas>`	`10m`	Investigate
Job failures spiking	`rate(coordinator_jobs_total{outcome="failed"}[5m]) > 0.1`	`10m`	Investigate
p99 latency over budget	`histogram_quantile(0.99, sum by (le) (rate(coordinator_job_duration_seconds_bucket[10m]))) > <budget_seconds>`	`15m`	Investigate

Translated to a Prometheus rule file:

zisk-alerts.yml
groups:
  - name: zisk-coordinator
    rules:
      - alert: ZiskCoordinatorUnreachable
        expr: up{job="zisk-coordinator"} == 0
        for: 2m
        labels: { severity: page }
        annotations:
          summary: "Coordinator scrape failed for 2m"

      - alert: ZiskNoWorkers
        expr: coordinator_workers_connected == 0
        for: 5m
        labels: { severity: page }
        annotations:
          summary: "No workers connected to the coordinator"

      - alert: ZiskWorkerFleetDegraded
        expr: coordinator_workers_connected < 4
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Worker count below expected baseline"

      - alert: ZiskJobFailuresSpiking
        expr: rate(coordinator_jobs_total{outcome="failed"}[5m]) > 0.1
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Job failure rate above 0.1/s"

      - alert: ZiskJobP99Slow
        expr: |
          histogram_quantile(0.99,
            sum by (le) (rate(coordinator_job_duration_seconds_bucket[10m]))) > 300
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "p99 job duration above 300s"

tip

Replace < 4 and the 300 second p99 threshold with values that match your deployment's expected worker count and proving SLO. The thresholds are illustrative, not universal.

warning

coordinator_jobs_total{outcome="failed"} includes jobs that failed because the client asked for something invalid (e.g. an unknown hash_id), not just infrastructure problems. If your client traffic is noisy, slice the alert by kind or correlate with coordinator_requests_total{status="..."} before paging.

Next steps

Metrics tell you the cluster is unhealthy. To find out why a specific job failed, continue to Logs and then to Troubleshooting for the runbook that maps common symptoms to fixes.

The metrics endpoint​

Metrics catalogue​

Request-level series​

Job-level series​

Worker-fleet series​

Scrape with Prometheus​

Dashboards​

Starter alert rules​

Next steps​