Profiling in depth

ziskemu is ZisK's advanced emulator and profiler. This tutorial drives it directly to find exactly what makes a guest expensive to prove, the costly functions, the hot instructions, and the operations that dominate.

It is the deep companion to Profiling your program, which uses the friendly cargo-zisk run --profiling wrapper on a merkle-tree example. The concepts introduced there — the profiling-vs-proving-cost trade-off, the cost-report categories, basic cost markers, and the Firefox call-stack view — aren't repeated here; this guide adds the full ziskemu toolset on top.

You only need a compiled guest ELF and an input. Build one with cargo-zisk build --release (the ELF lands at target/elf/riscv64ima-zisk-zkvm-elf/release/<your-guest>), then point -e at it and -i at your input. Every command below has the same shape, shortened to ziskemu -e <elf> -i <input> …:

bash
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/<your-guest> -i input.bin -X

The reports shown below come from a larger program — an Ethereum block validator — so every section has rich data to read. Your own guest prints the same shapes, just smaller. ziskemu runs the guest and analyzes it; it never generates a proof, which is what makes it fast to iterate with.

Profiling cost vs. final cost

You optimize against profiling cost: the per-operation cost that moves proportionally with your code, so every change shows up immediately. The final cost — what you actually pay to prove — is measured per instance (operations are batched into fixed-capacity state-machine instances), so it stays flat and then jumps when you cross a boundary. The companion tutorial summarizes this in Understanding profiling importance; two consequences are worth internalizing:

Instance boundaries. 1 or 5,242 Keccak operations both cost one Keccak instance; the 5,243rd needs a second — doubling that cost for a single extra operation. Profiling cost, in contrast, rises smoothly with each operation.
Specialized instances. ZisK has a cheap BinaryAdd instance for 64-bit additions versus the general Binary instance (ADD, SUB, AND, OR, XOR, …). So 1M ADDs profile cheaper than 1M ORs — and profiling cost shows that consistently, while final cost only reveals it once the program is large enough to split instances. The planner also picks strategies from your operation mix, and a function's final cost aggregates everything it calls.

Optimize with profiling cost; confirm the savings with final cost.

What profiling can tell you

Profiling reads the symbols already in your ELF, so it works on optimized release builds with no instrumentation, no special build flags, and no runtime overhead. For the questions it helps you answer — which crate is cheapest to prove, whether a change helped, whether precompiles are applied, and where to patch — and the optimize-then-re-profile workflow, see Understanding profiling importance.

The statistics report (`-X`)

-X (or --stats) is the first command to run — overall cost plus a per-operation breakdown:

bash
ziskemu -e <elf> -i <input> -X

REPORT
----------------------------------------
STEPS                         92,875,129

COST DISTRIBUTION                   COST       %
------------------------------------------------
BASE                         293,601,280   2.57%
MAIN                       6,315,508,772  55.22%
OPCODES                    1,334,639,984  11.67%
PRECOMPILES                2,565,960,716  22.43%
MEMORY                       927,932,629   8.11%

TOTAL                     11,437,643,381 100.00%

FROPS                        963,440,253   8.42%
RAM USAGE                     18,465,008   3.47%

STEPS is the instruction count — how long the run is. COST DISTRIBUTION is the profiling cost, split across five categories: BASE (fixed overhead like tables and range checks), MAIN (the processor itself, proportional to STEPS), OPCODES (simple 64-bit arithmetic and logic), PRECOMPILES (operations too big for 64 bits — 256-bit math, elliptic curves, Keccak, DMA), and MEMORY (reads, writes, and the extra cost of non-aligned or sub-8-byte access). Whichever dominates tells you what kind of work is expensive; the companion tutorial explains each on a real program in Cost distribution summary. Two extra lines round it out: FROPS — frequent operations like adding 1 (loop counters), adding 8 (pointers), or values < 256, which ZisK pre-computes into BASE; the figure is what you'd pay without that optimization — and RAM USAGE, reported only with the default bump allocator.

Below the summary comes the per-opcode breakdown (excerpt):

COST BY OPCODE                     COUNT       %            COST       % RANK
-----------------------------------------------------------------------------
OP add                         7,086,411   7.63%     177,160,275   1.55% #4
OP and                         3,740,044   4.03%     224,402,640   1.96% #3
OP or                          7,482,273   8.06%     448,936,380   3.93% #2
OP xor                         1,027,290   1.11%      61,637,400   0.54%
OP mul                           409,765   0.44%      38,927,675   0.34%
OP keccak                         32,650   0.04%   2,466,707,500  21.57% #1
OP secp256k1_add                  17,688   0.02%      25,187,712   0.22%
…

FROPS BY OPCODE                    COUNT    HIT            COST       % RANK
----------------------------------------------------------------------------
FROP sll                       8,729,869  84.91%     462,683,057   4.05% #1
FROP eq                        3,273,419  85.77%     196,405,140   1.72% #2
FROP or                        1,303,629  14.84%      78,217,740   0.68% #3
FROP ltu                         942,288  34.78%      56,537,280   0.49% #4
…

Each row shows the COUNT, its share of steps, its total COST and share of cost, and a #1–#4 RANK on the four most expensive. Rows keep a fixed order across runs so you can diff two profiles line by line — find the costly ones by the # markers, not by scanning. The habit that matters most: compare COUNT against the cost %. Here keccak ran only 32,650 times (0.04% of steps) yet is 21.57% of cost — rare but expensive, the textbook optimization target. The FROPS table adds a HIT column: how often the pre-computed pattern matched.

Find the expensive functions (`-S`)

Add -S (or --read-symbols) to attribute cost to functions. ZisK simulates the call stack and charges each function cumulatively — its own code plus everything it calls (the _start / main frames are filtered out, since they would always be 100%):

bash
ziskemu -e <elf> -i <input> -X -S

You get two rankings. By cost — usually the one you act on:

TOP COST FUNCTIONS (COST, % COST, CALLS, COST/CALL, FUNCTION)
-------------------------------------------------------------
  5,255,204,123  45.95%          1   5,255,204,123 <reth_evm::execute::BasicBlockExecutor<&reth_evm
  4,997,989,104  43.70%         70      71,399,844 <revm_handler::mainnet_handler::MainnetHandler<r
  4,530,507,470  39.61%     41,793         108,403 <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoiz
  3,759,934,537  32.87%     10,505         357,918 ziskos::zisklib::lib::keccak256::keccak256
  …

…and a parallel TOP STEP FUNCTIONS ranked by cycles instead of cost. Comparing the two is revealing: keccak256 is 14.77% of steps but 32.87% of cost. A function whose cost share dwarfs its cycle share is doing expensive per-cycle work (a precompile) and is worth a close look.

Tune the view as you go: -T / --top-roi N shows more or fewer functions; -M / --main-name names a non-standard entry point; and --roi-filter "<regex>" marks functions of interest (add --top-roi-filter to show only matches — ideal for comparing implementations of one subsystem). Rust names get long, so they're compacted to 160 characters by default (collapsing nested generics and eliding path segments); adjust with --compact-names=N or --no-compact-names.

Why a function is expensive (`-D`)

Once you know which function, -D (or --top-roi-detail) shows why: for each top function, a cost-by-opcode breakdown scoped to that function, plus its top callers.

bash
ziskemu -e <elf> -i <input> -X -S -D

DETAIL FUNCTION ziskos::zisklib::lib::keccak256::keccak256
----------------------------------------------------------
STEPS                         13,714,388  14.77%
COST                       3,759,934,537  32.87%

|    COST BY OPCODE                     COUNT            COST       % RANK
|    OP or                          2,489,249     149,354,940   3.97% #2
|    OP xor                           492,192      29,531,520   0.79% #3
|    OP sll                           360,008      19,080,424   0.51% #4
|    OP keccak                         32,650   2,466,707,500  65.61% #1

|    TOP STEP CALLERS (calls, steps)
|              3,974       9,749,694  71.09% <zeth_mpt_state::SparseState as stateless::trie::State
|              2,332       2,778,890  20.26% <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoize::Cac
|              1,284         217,150   1.58% revm_interpreter::instructions::system::keccak256::<re
|              …

The opcode table here is scoped to this one function: keccak is 65.61% of its cost, so the hashing itself is the target, not the surrounding code. TOP STEP CALLERS reverses the view — who calls it and how much each is responsible for. SparseState drives 71% of the calls, so that's the path to investigate first. -C / --roi-callers N controls how many callers are listed (default 10).

Instruction-level hotspots (`-H`)

To go below functions — to the exact instructions that run most often — use the PC histogram, -H (or --histogram). It groups consecutive hot instructions and attributes each group to its function (with -S):

bash
ziskemu -e <elf> -i <input> -X -S -H 50

TOP PC HISTOGRAM (EXECUTIONS, % EXECUTIONS, PC)
-----------------------------------------------
        796,670   0.86%  0x801230b8:   lbu r16, 0x0(r14)
        796,670   0.86%  0x801230bc:   beq r16, r12, 0xffffffd4
      1,593,340   1.72%  -----------   <revm_bytecode::legacy::raw::LegacyRawBytecode>::into_analyzed

        429,174   0.46%  0x800a38ec:   ld r10, 0x60(r21)
        …
        429,174   0.46%  0x800a392c:   bne r10, r0, 0xffffffc0
      7,295,958   7.86%  -----------   <revm_handler::mainnet_handler::MainnetHandler<revm_context::evm::Ev

Each group is a run of instructions (executions, % of steps, address, the RISC-V instruction) followed by a dashed summary line with the group's total and its function. The first group above is a tight 2-instruction byte-scanning loop — 1.72% of the whole program in two instructions — and the second is a 17-instruction dispatcher at 7.86%, a prime target. Reach for the histogram after function-level profiling: it shows which lines inside a hot function are the hot loop, and confirms whether a compiler optimization actually landed.

Measure your own code: profile tags

To measure a region you choose — an algorithm, a loop, a critical section — wrap it in profile-tag macros from ziskos. The companion tutorial adds a pair to a guest step by step in Using inline cost markers; here is the full set. The macros have zero overhead outside the profiler and measure either cost or steps, reporting either immediately (printed at each end!) or as an accumulated end-of-run summary — eight macros in a 2×2:

Report style	Cost	Steps
Immediate	`profile_start!` / `profile_end!`	`profile_steps_start!` / `profile_steps_end!`
Accumulated	`profile_report_start!` / `profile_report_end!`	`profile_report_steps_start!` / `profile_report_steps_end!`

Use immediate for sections that run a few times, accumulated for sections that run many. Each macro takes a bare label and must be closed by its matching end! with the same label; you can nest and mix them freely:

profile_start!(total);                       // cost of the whole run

for i in 0..100 {
    profile_report_steps_start!(iteration);  // accumulated steps per iteration
    expensive_computation(i);
    profile_report_steps_end!(iteration);
}

profile_report_start!(hash_phase);           // accumulated cost
for item in items { compute_hash(item); }
profile_report_end!(hash_phase);

profile_end!(total);

See the accumulated (report) tags by adding --profile-tags:

bash
ziskemu -e <elf> -i <input> -X --profile-tags

PROFILE TAGS COST (COST, % COST, CALLS, AVG, MIN, MAX)
-------------------------------------------------------
  1,234,567,890  10.79%  100  12,345,678  10,000,000  15,000,000  total
    456,789,012   3.99%   50   9,135,780   5,000,000  12,000,000  hash_phase

Each tag reports its total, share of steps or cost, call count, and the average, minimum, and maximum per call. Name tags descriptively and keep one tag per logical section — profile tags answer what is expensive, while the function rankings answer where.

Explore visually: the Firefox Profiler

For an interactive call graph — flame graphs, a timeline, search — export with --profiler-output and open it at profiler.firefox.com (Load a profile from file). -S is required and -X recommended; in fact -X -S on its own already writes profile.json.gz automatically. The companion tutorial produces the same file via cargo-zisk run --profiling complete and tours the views in Exploring the call stack. The output follows the Firefox Profiler format spec, so other compatible tools can read it too.

bash
ziskemu -e <elf> -i <input> -X -S --profiler-output=profile.json.gz

More tools and a workflow

A few smaller tools round out the toolset:

Inspect call arguments. Pair --roi-filter with --track-call-args N (1–8 arguments, the RISC-V a0–a7 registers) to log each call's arguments to a <function>.txt file (one line per call). --track-separator sets the delimiter (default ;) and --track-output-path the directory. It logs the raw register values — the actual value for scalars, but only the address for pointers — so it is best for scalar arguments or address patterns.
bash
```
ziskemu -e <elf> -i <input> -S \
  --roi-filter "hash_function" --track-call-args 4 --track-output-path ./traces
```
Quick checks. --steps prints only the step count; --with-progress reports progress every 16M steps on long runs; --no-thousands-sep gives machine-readable numbers.
Everything at once. The flags compose — full statistics, symbols, detailed callers, 30 functions, 15 callers each, 50 instruction groups, a filter, argument tracking, and performance metrics (-m):
bash
```
ziskemu -e <elf> -i input.bin -X -S -D -T 30 -C 15 -H 50 \
  --roi-filter "sha256|hash" --track-call-args 6 --track-output-path ./profiling_data -m
```

A workflow that scales: start with -X, add -S to see the functions, -D to see why they're expensive, and -H to see the hot instructions — then apply a precompile or patch (see Optimizing your program) and re-run to confirm the cost dropped. Chase the big percentages, profile realistic inputs, and filter with --roi-filter in large codebases.

Worked example: which EVM opcodes cost the most?

Putting it together on a real program: to rank the cost of individual Ethereum opcodes in a stateless block validator, filter to the EVM interpreter's instruction functions — they all live under one namespace — and show only those:

bash
ziskemu -S -X \
  -e .../stateless-validator-reth/.../release/zec-reth \
  -i .../benchmark_inputs/24654304_30c8b8.bin \
  --roi-filter "revm_interpreter::instructions::" --top-roi-filter -T 200

TOP COST FUNCTIONS (COST, % COST, CALLS, COST/CALL, FUNCTION)
-------------------------------------------------------------
  9,433,353,231  10.32%      5,824       1,619,737 revm_interpreter::instructions::contract::call_helpers::load_acc_
  8,344,978,788   9.13%      1,695       4,923,291 revm_interpreter::instructions::contract::call::<revm_interpreter
  4,599,658,812   5.03%    342,951          13,412 revm_interpreter::instructions::stack::swap::<1, revm_interpreter
  2,772,734,752   3.03%    128,956          21,501 revm_interpreter::instructions::memory::mload::<revm_interpreter:
    691,435,073   0.76%      5,682         121,688 revm_interpreter::instructions::system::keccak256::<revm_interpre
    669,514,638   0.73%    245,798           2,723 revm_interpreter::instructions::arithmetic::add::<revm_interprete
  …

From this single view you see the most expensive opcodes (the call family at the top), the most frequently called ones (swap/push, with huge call counts but tiny per-call cost), and the best optimization targets. No modification to the ELF is needed — profiling reads the existing symbols; the only thing to know is the naming convention of the functions you want to filter.

Where to go next

Profiling your program — the hands-on walkthrough with cargo-zisk run --profiling, if you skipped it.
Optimizing your program — act on what you found by routing expensive work through ZisK precompiles, then re-profile.
ziskemu CLI reference — the complete flag list.

Profiling cost vs. final cost​

What profiling can tell you​

The statistics report (-X)​

Find the expensive functions (-S)​

Why a function is expensive (-D)​

Instruction-level hotspots (-H)​

Measure your own code: profile tags​

Explore visually: the Firefox Profiler​

More tools and a workflow​

Worked example: which EVM opcodes cost the most?​

Where to go next​