Profiling in depth
ziskemu is ZisK's advanced emulator and profiler. This tutorial drives it directly to
find exactly what makes a guest expensive to prove, the costly functions, the
hot instructions, and the operations that dominate.
It is the deep companion to Profiling your program,
which uses the friendly cargo-zisk run --profiling wrapper on a merkle-tree
example. The concepts introduced there — the profiling-vs-proving-cost
trade-off, the cost-report categories, basic cost markers, and the Firefox
call-stack view — aren't repeated here; this guide adds the full ziskemu
toolset on top.
You only need a compiled guest ELF and an input. Build one with
cargo-zisk build --release (the ELF lands at
target/elf/riscv64ima-zisk-zkvm-elf/release/<your-guest>), then point -e at
it and -i at your input. Every command below has the same shape, shortened to
ziskemu -e <elf> -i <input> …:
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/<your-guest> -i input.bin -X
The reports shown below come from a larger program — an Ethereum block
validator — so every section has rich data to read. Your own guest prints the
same shapes, just smaller. ziskemu runs the guest and analyzes it; it never
generates a proof, which is what makes it fast to iterate with.
Profiling cost vs. final cost
You optimize against profiling cost: the per-operation cost that moves proportionally with your code, so every change shows up immediately. The final cost — what you actually pay to prove — is measured per instance (operations are batched into fixed-capacity state-machine instances), so it stays flat and then jumps when you cross a boundary. The companion tutorial summarizes this in Understanding profiling importance; two consequences are worth internalizing:
- Instance boundaries. 1 or 5,242 Keccak operations both cost one Keccak instance; the 5,243rd needs a second — doubling that cost for a single extra operation. Profiling cost, in contrast, rises smoothly with each operation.
- Specialized instances. ZisK has a cheap
BinaryAddinstance for 64-bit additions versus the generalBinaryinstance (ADD, SUB, AND, OR, XOR, …). So 1MADDs profile cheaper than 1MORs — and profiling cost shows that consistently, while final cost only reveals it once the program is large enough to split instances. The planner also picks strategies from your operation mix, and a function's final cost aggregates everything it calls.
Optimize with profiling cost; confirm the savings with final cost.
What profiling can tell you
Profiling reads the symbols already in your ELF, so it works on optimized release builds with no instrumentation, no special build flags, and no runtime overhead. For the questions it helps you answer — which crate is cheapest to prove, whether a change helped, whether precompiles are applied, and where to patch — and the optimize-then-re-profile workflow, see Understanding profiling importance.
The statistics report (-X)
-X (or --stats) is the first command to run — overall cost plus a
per-operation breakdown:
ziskemu -e <elf> -i <input> -X
REPORT
----------------------------------------
STEPS 92,875,129
COST DISTRIBUTION COST %
------------------------------------------------
BASE 293,601,280 2.57%
MAIN 6,315,508,772 55.22%
OPCODES 1,334,639,984 11.67%
PRECOMPILES 2,565,960,716 22.43%
MEMORY 927,932,629 8.11%
TOTAL 11,437,643,381 100.00%
FROPS 963,440,253 8.42%
RAM USAGE 18,465,008 3.47%
STEPS is the instruction count — how long the run is. COST DISTRIBUTION
is the profiling cost, split across five categories: BASE (fixed overhead like
tables and range checks), MAIN (the processor itself, proportional to STEPS),
OPCODES (simple 64-bit arithmetic and logic), PRECOMPILES (operations too big
for 64 bits — 256-bit math, elliptic curves, Keccak, DMA), and MEMORY (reads,
writes, and the extra cost of non-aligned or sub-8-byte access). Whichever
dominates tells you what kind of work is expensive; the companion tutorial
explains each on a real program in
Cost distribution summary.
Two extra lines round it out: FROPS — frequent operations like adding 1
(loop counters), adding 8 (pointers), or values < 256, which ZisK pre-computes
into BASE; the figure is what you'd pay without that optimization — and
RAM USAGE, reported only with the default bump allocator.
Below the summary comes the per-opcode breakdown (excerpt):
COST BY OPCODE COUNT % COST % RANK
-----------------------------------------------------------------------------
OP add 7,086,411 7.63% 177,160,275 1.55% #4
OP and 3,740,044 4.03% 224,402,640 1.96% #3
OP or 7,482,273 8.06% 448,936,380 3.93% #2
OP xor 1,027,290 1.11% 61,637,400 0.54%
OP mul 409,765 0.44% 38,927,675 0.34%
OP keccak 32,650 0.04% 2,466,707,500 21.57% #1
OP secp256k1_add 17,688 0.02% 25,187,712 0.22%
…
FROPS BY OPCODE COUNT HIT COST % RANK
----------------------------------------------------------------------------
FROP sll 8,729,869 84.91% 462,683,057 4.05% #1
FROP eq 3,273,419 85.77% 196,405,140 1.72% #2
FROP or 1,303,629 14.84% 78,217,740 0.68% #3
FROP ltu 942,288 34.78% 56,537,280 0.49% #4
…
Each row shows the COUNT, its share of steps, its total COST and share of cost,
and a #1–#4 RANK on the four most expensive. Rows keep a fixed order across
runs so you can diff two profiles line by line — find the costly ones by the
# markers, not by scanning. The habit that matters most: compare COUNT
against the cost %. Here keccak ran only 32,650 times (0.04% of steps) yet
is 21.57% of cost — rare but expensive, the textbook optimization target. The
FROPS table adds a HIT column: how often the pre-computed pattern matched.
Find the expensive functions (-S)
Add -S (or --read-symbols) to attribute cost to functions. ZisK simulates the
call stack and charges each function cumulatively — its own code plus
everything it calls (the _start / main frames are filtered out, since they
would always be 100%):
ziskemu -e <elf> -i <input> -X -S
You get two rankings. By cost — usually the one you act on:
TOP COST FUNCTIONS (COST, % COST, CALLS, COST/CALL, FUNCTION)
-------------------------------------------------------------
5,255,204,123 45.95% 1 5,255,204,123 <reth_evm::execute::BasicBlockExecutor<&reth_evm
4,997,989,104 43.70% 70 71,399,844 <revm_handler::mainnet_handler::MainnetHandler<r
4,530,507,470 39.61% 41,793 108,403 <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoiz
3,759,934,537 32.87% 10,505 357,918 ziskos::zisklib::lib::keccak256::keccak256
…
…and a parallel TOP STEP FUNCTIONS ranked by cycles instead of cost.
Comparing the two is revealing: keccak256 is 14.77% of steps but 32.87% of
cost. A function whose cost share dwarfs its cycle share is doing expensive
per-cycle work (a precompile) and is worth a close look.
Tune the view as you go: -T / --top-roi N shows more or fewer functions;
-M / --main-name names a non-standard entry point; and --roi-filter "<regex>" marks functions of interest (add --top-roi-filter to show only
matches — ideal for comparing implementations of one subsystem). Rust names get
long, so they're compacted to 160 characters by default (collapsing nested
generics and eliding path segments); adjust with --compact-names=N or
--no-compact-names.
Why a function is expensive (-D)
Once you know which function, -D (or --top-roi-detail) shows why: for each
top function, a cost-by-opcode breakdown scoped to that function, plus its top
callers.
ziskemu -e <elf> -i <input> -X -S -D
DETAIL FUNCTION ziskos::zisklib::lib::keccak256::keccak256
----------------------------------------------------------
STEPS 13,714,388 14.77%
COST 3,759,934,537 32.87%
| COST BY OPCODE COUNT COST % RANK
| OP or 2,489,249 149,354,940 3.97% #2
| OP xor 492,192 29,531,520 0.79% #3
| OP sll 360,008 19,080,424 0.51% #4
| OP keccak 32,650 2,466,707,500 65.61% #1
| TOP STEP CALLERS (calls, steps)
| 3,974 9,749,694 71.09% <zeth_mpt_state::SparseState as stateless::trie::State
| 2,332 2,778,890 20.26% <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoize::Cac
| 1,284 217,150 1.58% revm_interpreter::instructions::system::keccak256::<re
| …
The opcode table here is scoped to this one function: keccak is 65.61% of its
cost, so the hashing itself is the target, not the surrounding code. TOP STEP
CALLERS reverses the view — who calls it and how much each is responsible for.
SparseState drives 71% of the calls, so that's the path to investigate first.
-C / --roi-callers N controls how many callers are listed (default 10).
Instruction-level hotspots (-H)
To go below functions — to the exact instructions that run most often — use the
PC histogram, -H (or --histogram). It groups consecutive hot instructions and
attributes each group to its function (with -S):
ziskemu -e <elf> -i <input> -X -S -H 50
TOP PC HISTOGRAM (EXECUTIONS, % EXECUTIONS, PC)
-----------------------------------------------
796,670 0.86% 0x801230b8: lbu r16, 0x0(r14)
796,670 0.86% 0x801230bc: beq r16, r12, 0xffffffd4
1,593,340 1.72% ----------- <revm_bytecode::legacy::raw::LegacyRawBytecode>::into_analyzed
429,174 0.46% 0x800a38ec: ld r10, 0x60(r21)
…
429,174 0.46% 0x800a392c: bne r10, r0, 0xffffffc0
7,295,958 7.86% ----------- <revm_handler::mainnet_handler::MainnetHandler<revm_context::evm::Ev
Each group is a run of instructions (executions, % of steps, address, the RISC-V instruction) followed by a dashed summary line with the group's total and its function. The first group above is a tight 2-instruction byte-scanning loop — 1.72% of the whole program in two instructions — and the second is a 17-instruction dispatcher at 7.86%, a prime target. Reach for the histogram after function-level profiling: it shows which lines inside a hot function are the hot loop, and confirms whether a compiler optimization actually landed.
Measure your own code: profile tags
To measure a region you choose — an algorithm, a loop, a critical section —
wrap it in profile-tag macros from ziskos. The companion tutorial adds a pair
to a guest step by step in
Using inline cost markers;
here is the full set. The macros have zero overhead outside the profiler and
measure either cost or steps, reporting either immediately (printed at
each end!) or as an accumulated end-of-run summary — eight macros in a 2×2:
| Report style | Cost | Steps |
|---|---|---|
| Immediate | profile_start! / profile_end! | profile_steps_start! / profile_steps_end! |
| Accumulated | profile_report_start! / profile_report_end! | profile_report_steps_start! / profile_report_steps_end! |
Use immediate for sections that run a few times, accumulated for sections that run
many. Each macro takes a bare label and must be closed by its matching end! with
the same label; you can nest and mix them freely:
profile_start!(total); // cost of the whole run
for i in 0..100 {
profile_report_steps_start!(iteration); // accumulated steps per iteration
expensive_computation(i);
profile_report_steps_end!(iteration);
}
profile_report_start!(hash_phase); // accumulated cost
for item in items { compute_hash(item); }
profile_report_end!(hash_phase);
profile_end!(total);
See the accumulated (report) tags by adding --profile-tags:
ziskemu -e <elf> -i <input> -X --profile-tags
PROFILE TAGS COST (COST, % COST, CALLS, AVG, MIN, MAX)
-------------------------------------------------------
1,234,567,890 10.79% 100 12,345,678 10,000,000 15,000,000 total
456,789,012 3.99% 50 9,135,780 5,000,000 12,000,000 hash_phase
Each tag reports its total, share of steps or cost, call count, and the average, minimum, and maximum per call. Name tags descriptively and keep one tag per logical section — profile tags answer what is expensive, while the function rankings answer where.
Explore visually: the Firefox Profiler
For an interactive call graph — flame graphs, a timeline, search — export with
--profiler-output and open it at profiler.firefox.com
(Load a profile from file). -S is required and -X recommended; in fact
-X -S on its own already writes profile.json.gz automatically. The companion
tutorial produces the same file via cargo-zisk run --profiling complete and
tours the views in
Exploring the call stack.
The output follows the
Firefox Profiler format spec,
so other compatible tools can read it too.
ziskemu -e <elf> -i <input> -X -S --profiler-output=profile.json.gz
More tools and a workflow
A few smaller tools round out the toolset:
-
Inspect call arguments. Pair
--roi-filterwith--track-call-args N(1–8 arguments, the RISC-Va0–a7registers) to log each call's arguments to a<function>.txtfile (one line per call).--track-separatorsets the delimiter (default;) and--track-output-paththe directory. It logs the raw register values — the actual value for scalars, but only the address for pointers — so it is best for scalar arguments or address patterns.bashziskemu -e <elf> -i <input> -S \--roi-filter "hash_function" --track-call-args 4 --track-output-path ./traces -
Quick checks.
--stepsprints only the step count;--with-progressreports progress every 16M steps on long runs;--no-thousands-sepgives machine-readable numbers. -
Everything at once. The flags compose — full statistics, symbols, detailed callers, 30 functions, 15 callers each, 50 instruction groups, a filter, argument tracking, and performance metrics (
-m):bashziskemu -e <elf> -i input.bin -X -S -D -T 30 -C 15 -H 50 \--roi-filter "sha256|hash" --track-call-args 6 --track-output-path ./profiling_data -m
A workflow that scales: start with -X, add -S to see the functions, -D
to see why they're expensive, and -H to see the hot instructions — then apply a
precompile or patch (see Optimizing your program)
and re-run to confirm the cost dropped. Chase the big percentages, profile
realistic inputs, and filter with --roi-filter in large codebases.
Worked example: which EVM opcodes cost the most?
Putting it together on a real program: to rank the cost of individual Ethereum opcodes in a stateless block validator, filter to the EVM interpreter's instruction functions — they all live under one namespace — and show only those:
ziskemu -S -X \
-e .../stateless-validator-reth/.../release/zec-reth \
-i .../benchmark_inputs/24654304_30c8b8.bin \
--roi-filter "revm_interpreter::instructions::" --top-roi-filter -T 200
TOP COST FUNCTIONS (COST, % COST, CALLS, COST/CALL, FUNCTION)
-------------------------------------------------------------
9,433,353,231 10.32% 5,824 1,619,737 revm_interpreter::instructions::contract::call_helpers::load_acc_
8,344,978,788 9.13% 1,695 4,923,291 revm_interpreter::instructions::contract::call::<revm_interpreter
4,599,658,812 5.03% 342,951 13,412 revm_interpreter::instructions::stack::swap::<1, revm_interpreter
2,772,734,752 3.03% 128,956 21,501 revm_interpreter::instructions::memory::mload::<revm_interpreter:
691,435,073 0.76% 5,682 121,688 revm_interpreter::instructions::system::keccak256::<revm_interpre
669,514,638 0.73% 245,798 2,723 revm_interpreter::instructions::arithmetic::add::<revm_interprete
…
From this single view you see the most expensive opcodes (the call family at
the top), the most frequently called ones (swap/push, with huge call counts
but tiny per-call cost), and the best optimization targets. No modification to the
ELF is needed — profiling reads the existing symbols; the only thing to know is
the naming convention of the functions you want to filter.
Where to go next
- Profiling your program — the hands-on
walkthrough with
cargo-zisk run --profiling, if you skipped it. - Optimizing your program — act on what you found by routing expensive work through ZisK precompiles, then re-profile.
- ziskemu CLI reference — the complete flag list.