Skip to main content

Profiling in depth

ziskemu is ZisK's advanced emulator and profiler. This tutorial drives it directly to find exactly what makes a guest expensive to prove, the costly functions, the hot instructions, and the operations that dominate.

It is the deep companion to Profiling your program, which uses the friendly cargo-zisk run --profiling wrapper on a merkle-tree example. The concepts introduced there — the profiling-vs-proving-cost trade-off, the cost-report categories, basic cost markers, and the Firefox call-stack view — aren't repeated here; this guide adds the full ziskemu toolset on top.

You only need a compiled guest ELF and an input. Build one with cargo-zisk build --release (the ELF lands at target/elf/riscv64ima-zisk-zkvm-elf/release/<your-guest>), then point -e at it and -i at your input. Every command below has the same shape, shortened to ziskemu -e <elf> -i <input> …:

bash
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/<your-guest> -i input.bin -X

The reports shown below come from a larger program — an Ethereum block validator — so every section has rich data to read. Your own guest prints the same shapes, just smaller. ziskemu runs the guest and analyzes it; it never generates a proof, which is what makes it fast to iterate with.

Profiling cost vs. final cost

You optimize against profiling cost: the per-operation cost that moves proportionally with your code, so every change shows up immediately. The final cost — what you actually pay to prove — is measured per instance (operations are batched into fixed-capacity state-machine instances), so it stays flat and then jumps when you cross a boundary. The companion tutorial summarizes this in Understanding profiling importance; two consequences are worth internalizing:

  • Instance boundaries. 1 or 5,242 Keccak operations both cost one Keccak instance; the 5,243rd needs a second — doubling that cost for a single extra operation. Profiling cost, in contrast, rises smoothly with each operation.
  • Specialized instances. ZisK has a cheap BinaryAdd instance for 64-bit additions versus the general Binary instance (ADD, SUB, AND, OR, XOR, …). So 1M ADDs profile cheaper than 1M ORs — and profiling cost shows that consistently, while final cost only reveals it once the program is large enough to split instances. The planner also picks strategies from your operation mix, and a function's final cost aggregates everything it calls.

Optimize with profiling cost; confirm the savings with final cost.

What profiling can tell you

Profiling reads the symbols already in your ELF, so it works on optimized release builds with no instrumentation, no special build flags, and no runtime overhead. For the questions it helps you answer — which crate is cheapest to prove, whether a change helped, whether precompiles are applied, and where to patch — and the optimize-then-re-profile workflow, see Understanding profiling importance.

The statistics report (-X)

-X (or --stats) is the first command to run — overall cost plus a per-operation breakdown:

bash
ziskemu -e <elf> -i <input> -X
REPORT
----------------------------------------
STEPS 92,875,129

COST DISTRIBUTION COST %
------------------------------------------------
BASE 293,601,280 2.57%
MAIN 6,315,508,772 55.22%
OPCODES 1,334,639,984 11.67%
PRECOMPILES 2,565,960,716 22.43%
MEMORY 927,932,629 8.11%

TOTAL 11,437,643,381 100.00%

FROPS 963,440,253 8.42%
RAM USAGE 18,465,008 3.47%

STEPS is the instruction count — how long the run is. COST DISTRIBUTION is the profiling cost, split across five categories: BASE (fixed overhead like tables and range checks), MAIN (the processor itself, proportional to STEPS), OPCODES (simple 64-bit arithmetic and logic), PRECOMPILES (operations too big for 64 bits — 256-bit math, elliptic curves, Keccak, DMA), and MEMORY (reads, writes, and the extra cost of non-aligned or sub-8-byte access). Whichever dominates tells you what kind of work is expensive; the companion tutorial explains each on a real program in Cost distribution summary. Two extra lines round it out: FROPS — frequent operations like adding 1 (loop counters), adding 8 (pointers), or values < 256, which ZisK pre-computes into BASE; the figure is what you'd pay without that optimization — and RAM USAGE, reported only with the default bump allocator.

Below the summary comes the per-opcode breakdown (excerpt):

COST BY OPCODE COUNT % COST % RANK
-----------------------------------------------------------------------------
OP add 7,086,411 7.63% 177,160,275 1.55% #4
OP and 3,740,044 4.03% 224,402,640 1.96% #3
OP or 7,482,273 8.06% 448,936,380 3.93% #2
OP xor 1,027,290 1.11% 61,637,400 0.54%
OP mul 409,765 0.44% 38,927,675 0.34%
OP keccak 32,650 0.04% 2,466,707,500 21.57% #1
OP secp256k1_add 17,688 0.02% 25,187,712 0.22%


FROPS BY OPCODE COUNT HIT COST % RANK
----------------------------------------------------------------------------
FROP sll 8,729,869 84.91% 462,683,057 4.05% #1
FROP eq 3,273,419 85.77% 196,405,140 1.72% #2
FROP or 1,303,629 14.84% 78,217,740 0.68% #3
FROP ltu 942,288 34.78% 56,537,280 0.49% #4

Each row shows the COUNT, its share of steps, its total COST and share of cost, and a #1#4 RANK on the four most expensive. Rows keep a fixed order across runs so you can diff two profiles line by line — find the costly ones by the # markers, not by scanning. The habit that matters most: compare COUNT against the cost %. Here keccak ran only 32,650 times (0.04% of steps) yet is 21.57% of cost — rare but expensive, the textbook optimization target. The FROPS table adds a HIT column: how often the pre-computed pattern matched.

Find the expensive functions (-S)

Add -S (or --read-symbols) to attribute cost to functions. ZisK simulates the call stack and charges each function cumulatively — its own code plus everything it calls (the _start / main frames are filtered out, since they would always be 100%):

bash
ziskemu -e <elf> -i <input> -X -S

You get two rankings. By cost — usually the one you act on:

TOP COST FUNCTIONS (COST, % COST, CALLS, COST/CALL, FUNCTION)
-------------------------------------------------------------
5,255,204,123 45.95% 1 5,255,204,123 <reth_evm::execute::BasicBlockExecutor<&reth_evm
4,997,989,104 43.70% 70 71,399,844 <revm_handler::mainnet_handler::MainnetHandler<r
4,530,507,470 39.61% 41,793 108,403 <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoiz
3,759,934,537 32.87% 10,505 357,918 ziskos::zisklib::lib::keccak256::keccak256

…and a parallel TOP STEP FUNCTIONS ranked by cycles instead of cost. Comparing the two is revealing: keccak256 is 14.77% of steps but 32.87% of cost. A function whose cost share dwarfs its cycle share is doing expensive per-cycle work (a precompile) and is worth a close look.

Tune the view as you go: -T / --top-roi N shows more or fewer functions; -M / --main-name names a non-standard entry point; and --roi-filter "<regex>" marks functions of interest (add --top-roi-filter to show only matches — ideal for comparing implementations of one subsystem). Rust names get long, so they're compacted to 160 characters by default (collapsing nested generics and eliding path segments); adjust with --compact-names=N or --no-compact-names.

Why a function is expensive (-D)

Once you know which function, -D (or --top-roi-detail) shows why: for each top function, a cost-by-opcode breakdown scoped to that function, plus its top callers.

bash
ziskemu -e <elf> -i <input> -X -S -D
DETAIL FUNCTION ziskos::zisklib::lib::keccak256::keccak256
----------------------------------------------------------
STEPS 13,714,388 14.77%
COST 3,759,934,537 32.87%

| COST BY OPCODE COUNT COST % RANK
| OP or 2,489,249 149,354,940 3.97% #2
| OP xor 492,192 29,531,520 0.79% #3
| OP sll 360,008 19,080,424 0.51% #4
| OP keccak 32,650 2,466,707,500 65.61% #1

| TOP STEP CALLERS (calls, steps)
| 3,974 9,749,694 71.09% <zeth_mpt_state::SparseState as stateless::trie::State
| 2,332 2,778,890 20.26% <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoize::Cac
| 1,284 217,150 1.58% revm_interpreter::instructions::system::keccak256::<re
| …

The opcode table here is scoped to this one function: keccak is 65.61% of its cost, so the hashing itself is the target, not the surrounding code. TOP STEP CALLERS reverses the view — who calls it and how much each is responsible for. SparseState drives 71% of the calls, so that's the path to investigate first. -C / --roi-callers N controls how many callers are listed (default 10).

Instruction-level hotspots (-H)

To go below functions — to the exact instructions that run most often — use the PC histogram, -H (or --histogram). It groups consecutive hot instructions and attributes each group to its function (with -S):

bash
ziskemu -e <elf> -i <input> -X -S -H 50
TOP PC HISTOGRAM (EXECUTIONS, % EXECUTIONS, PC)
-----------------------------------------------
796,670 0.86% 0x801230b8: lbu r16, 0x0(r14)
796,670 0.86% 0x801230bc: beq r16, r12, 0xffffffd4
1,593,340 1.72% ----------- <revm_bytecode::legacy::raw::LegacyRawBytecode>::into_analyzed

429,174 0.46% 0x800a38ec: ld r10, 0x60(r21)

429,174 0.46% 0x800a392c: bne r10, r0, 0xffffffc0
7,295,958 7.86% ----------- <revm_handler::mainnet_handler::MainnetHandler<revm_context::evm::Ev

Each group is a run of instructions (executions, % of steps, address, the RISC-V instruction) followed by a dashed summary line with the group's total and its function. The first group above is a tight 2-instruction byte-scanning loop — 1.72% of the whole program in two instructions — and the second is a 17-instruction dispatcher at 7.86%, a prime target. Reach for the histogram after function-level profiling: it shows which lines inside a hot function are the hot loop, and confirms whether a compiler optimization actually landed.

Measure your own code: profile tags

To measure a region you choose — an algorithm, a loop, a critical section — wrap it in profile-tag macros from ziskos. The companion tutorial adds a pair to a guest step by step in Using inline cost markers; here is the full set. The macros have zero overhead outside the profiler and measure either cost or steps, reporting either immediately (printed at each end!) or as an accumulated end-of-run summary — eight macros in a 2×2:

Report styleCostSteps
Immediateprofile_start! / profile_end!profile_steps_start! / profile_steps_end!
Accumulatedprofile_report_start! / profile_report_end!profile_report_steps_start! / profile_report_steps_end!

Use immediate for sections that run a few times, accumulated for sections that run many. Each macro takes a bare label and must be closed by its matching end! with the same label; you can nest and mix them freely:

profile_start!(total); // cost of the whole run

for i in 0..100 {
profile_report_steps_start!(iteration); // accumulated steps per iteration
expensive_computation(i);
profile_report_steps_end!(iteration);
}

profile_report_start!(hash_phase); // accumulated cost
for item in items { compute_hash(item); }
profile_report_end!(hash_phase);

profile_end!(total);

See the accumulated (report) tags by adding --profile-tags:

bash
ziskemu -e <elf> -i <input> -X --profile-tags
PROFILE TAGS COST (COST, % COST, CALLS, AVG, MIN, MAX)
-------------------------------------------------------
1,234,567,890 10.79% 100 12,345,678 10,000,000 15,000,000 total
456,789,012 3.99% 50 9,135,780 5,000,000 12,000,000 hash_phase

Each tag reports its total, share of steps or cost, call count, and the average, minimum, and maximum per call. Name tags descriptively and keep one tag per logical section — profile tags answer what is expensive, while the function rankings answer where.

Explore visually: the Firefox Profiler

For an interactive call graph — flame graphs, a timeline, search — export with --profiler-output and open it at profiler.firefox.com (Load a profile from file). -S is required and -X recommended; in fact -X -S on its own already writes profile.json.gz automatically. The companion tutorial produces the same file via cargo-zisk run --profiling complete and tours the views in Exploring the call stack. The output follows the Firefox Profiler format spec, so other compatible tools can read it too.

bash
ziskemu -e <elf> -i <input> -X -S --profiler-output=profile.json.gz

More tools and a workflow

A few smaller tools round out the toolset:

  • Inspect call arguments. Pair --roi-filter with --track-call-args N (1–8 arguments, the RISC-V a0a7 registers) to log each call's arguments to a <function>.txt file (one line per call). --track-separator sets the delimiter (default ;) and --track-output-path the directory. It logs the raw register values — the actual value for scalars, but only the address for pointers — so it is best for scalar arguments or address patterns.

    bash
    ziskemu -e <elf> -i <input> -S \
    --roi-filter "hash_function" --track-call-args 4 --track-output-path ./traces
  • Quick checks. --steps prints only the step count; --with-progress reports progress every 16M steps on long runs; --no-thousands-sep gives machine-readable numbers.

  • Everything at once. The flags compose — full statistics, symbols, detailed callers, 30 functions, 15 callers each, 50 instruction groups, a filter, argument tracking, and performance metrics (-m):

    bash
    ziskemu -e <elf> -i input.bin -X -S -D -T 30 -C 15 -H 50 \
    --roi-filter "sha256|hash" --track-call-args 6 --track-output-path ./profiling_data -m

A workflow that scales: start with -X, add -S to see the functions, -D to see why they're expensive, and -H to see the hot instructions — then apply a precompile or patch (see Optimizing your program) and re-run to confirm the cost dropped. Chase the big percentages, profile realistic inputs, and filter with --roi-filter in large codebases.

Worked example: which EVM opcodes cost the most?

Putting it together on a real program: to rank the cost of individual Ethereum opcodes in a stateless block validator, filter to the EVM interpreter's instruction functions — they all live under one namespace — and show only those:

bash
ziskemu -S -X \
-e .../stateless-validator-reth/.../release/zec-reth \
-i .../benchmark_inputs/24654304_30c8b8.bin \
--roi-filter "revm_interpreter::instructions::" --top-roi-filter -T 200
TOP COST FUNCTIONS (COST, % COST, CALLS, COST/CALL, FUNCTION)
-------------------------------------------------------------
9,433,353,231 10.32% 5,824 1,619,737 revm_interpreter::instructions::contract::call_helpers::load_acc_
8,344,978,788 9.13% 1,695 4,923,291 revm_interpreter::instructions::contract::call::<revm_interpreter
4,599,658,812 5.03% 342,951 13,412 revm_interpreter::instructions::stack::swap::<1, revm_interpreter
2,772,734,752 3.03% 128,956 21,501 revm_interpreter::instructions::memory::mload::<revm_interpreter:
691,435,073 0.76% 5,682 121,688 revm_interpreter::instructions::system::keccak256::<revm_interpre
669,514,638 0.73% 245,798 2,723 revm_interpreter::instructions::arithmetic::add::<revm_interprete

From this single view you see the most expensive opcodes (the call family at the top), the most frequently called ones (swap/push, with huge call counts but tiny per-call cost), and the best optimization targets. No modification to the ELF is needed — profiling reads the existing symbols; the only thing to know is the naming convention of the functions you want to filter.

Where to go next