Logs
Where metrics tell you the cluster is unhappy, logs tell you why a
specific job failed. The coordinator and worker both emit
tracing logs to stdout/stderr and your deployment routes them
into whatever log surface the host uses. This page walks through
log levels and formats, the three ways to configure them, where
the logs land on each supported deployment, and the filters worth
keeping in muscle memory.
Log levels
Both binaries use the standard five-level tracing hierarchy.
Each level includes everything noisier than itself, so picking
debug includes info, warn, and error.
| Level | When to use |
|---|---|
error | Production-quiet operation. Only failures. |
warn | Production default for noisy environments. |
info | Recommended production default. One line per significant lifecycle event. |
debug | Active investigation. Phase transitions, worker assignments, heartbeats. |
trace | Deep debugging. Very verbose; can noticeably slow hot proving paths. |
trace and debug levels on a worker generate a lot of output
per proving segment. On a busy cluster this can slow proving and
overwhelm the log aggregator. Raise temporarily, drop back to
info as soon as you have what you need.
Log formats
The same logs can be rendered three ways:
| Format | Best for | Notes |
|---|---|---|
pretty | Terminals during development | Default. Colored, multi-line, human-readable. |
compact | Local non-TTY pipes | Single line per event, no color. |
json | Production aggregation | Every field stays structured; aggregators index without regex. |
JSON is the right choice in production because it lets your aggregator query by job ID, level, or any other emitted field without a fragile regex pass.
Configuring logging
Logging is controlled in three places, in precedence order from lowest to highest:
- The
[logging]section incoordinator.tomlorworker.toml. This is the persistent default. RUST_LOGenvironment variable. Standardtracing-subscriberenv filter syntax; overrides the TOML level per crate.- The
--log-levelCLI flag on the coordinator. Overrides both of the above.
Persistent config
Put this block in both coordinator.toml and worker.toml
before promoting them to production:
[logging]
level = "info"
format = "json"
file_path = "" # empty = stdout/stderr
Temporary override with RUST_LOG
RUST_LOG follows the standard tracing-subscriber env-filter
syntax. The most useful pattern is "everything at info, this
one crate at debug":
sudo systemctl stop zisk-coordinator
sudo RUST_LOG="info,zisk_coordinator=debug" systemctl start zisk-coordinator
For a one-shot foreground run during an investigation:
RUST_LOG="info,zisk_worker=debug" \
./target/release/zisk-worker \
--config /etc/zisk/worker.toml
CLI flag (coordinator only)
zisk-coordinator --config /etc/zisk/coordinator.toml --log-level debug
Always restore info after an incident is resolved. Forgetting
to drop the level is the single most common cause of unexpected
log-storage bills and degraded worker throughput.
Where the logs land
The binaries write to stdout/stderr; each deployment path inherits its host's log routing. Use the table to pick the right tail command.
Linux with systemd
The bare-metal install script registers both binaries as systemd units, so logs flow into journald:
sudo journalctl -u zisk-coordinator -f
sudo journalctl -u zisk-worker -f
macOS with launchd
The macOS install path registers launchd plists and pipes logs
to flat files under /var/log/, rotated by newsyslog at
100 MB with 5 rotations kept:
tail -f /var/log/zisk-coordinator/zisk-coordinator.log
tail -f /var/log/zisk-worker/zisk-worker.log
Docker Compose
Logs go to the Docker daemon. Use the Compose subcommand to follow them with service names rather than container IDs:
docker compose logs -f coordinator
docker compose logs -f worker
To follow both at once interleaved:
docker compose logs -f coordinator worker
Kubernetes
The Helm chart ships only the worker; the coordinator is deployed separately. Tail across all worker pods at once with a label selector:
kubectl logs -n zisk -l app.kubernetes.io/name=zisk-worker -f
For a single worker pod or to inspect the previous container after a crash loop:
kubectl logs -n zisk <pod-name> -f
kubectl logs -n zisk <pod-name> --previous
Useful filters
journald and Docker logs both filter without external tools. The recipes below are the ones to remember.
Filter by level
# Errors only on the coordinator
sudo journalctl -u zisk-coordinator -p err -f
# Warnings and errors on a worker
sudo journalctl -u zisk-worker -p warn -f
-p err and -p warn accept standard syslog priorities and work
for any systemd-managed service on the host.
On Docker Compose, filter by piping through grep (or, with
JSON format, jq):
docker compose logs --since 10m coordinator | grep -E 'ERROR|WARN'
Filter by phase
The coordinator emits phase= fields on every relevant log line
during a job. With JSON logs, the cleanest filter uses jq:
sudo journalctl -u zisk-coordinator -o cat \
| jq 'select(.fields.phase == "Prove")'
Without JSON, fall back to plain grep:
sudo journalctl -u zisk-coordinator | grep 'phase=Prove'
The three phase names emitted by the coordinator are
Contributions, Prove, and Aggregate, matching the job state
machine in the cluster API.
Filter by job ID
Production investigations usually start from a job ID, not a log line. The recipe is the same on every deployment: grep the coordinator first, then follow the worker IDs it logged into the worker hosts.
sudo journalctl -u zisk-coordinator | grep <job-id>
The output names which workers received which segments. Switch to the worker host and grep again on that line's segment or job ID:
sudo journalctl -u zisk-worker | grep <job-id>
With JSON logs flowing into an aggregator, the same recipe
collapses to a single query like job_id="<job-id>", returning
the end-to-end trail across every host in one view. This is the
practical reason to insist on JSON in production; without it,
correlation is host-by-host.
Job IDs are minted by the coordinator and returned in the
JobRequest response. Log the returned ID on the client side
alongside whatever business context (request ID, user ID,
batch ID) is meaningful in your environment — the cluster does
not know that context.
Next steps
Continue to Troubleshooting for the runbook that maps the most common symptoms surfaced by metrics and logs to their concrete fixes.