Skip to main content

Profiling your program

This guide walks through profiling a guest program using inline cost markers, summarized reports and a complete report of the call stack. By the end of this analysis, you will know how to identify which operations drive cost and potential paths to optimization.

Understanding profiling importance

Proving is computationally expensive: a program that runs in milliseconds on your machine can take minutes to prove if it isn't written with the zkVM in mind. Profiling shows you where that cost goes so you can fix it before you pay for a full proving run. Before reading any number, though, there's one idea to get straight, ZisK reports two different kinds of "cost".

Profiling cost vs. final cost

When you profile a ZisK program, two different "costs" show up, and knowing which is which keeps your optimization on track:

AspectProfiling costFinal cost
What it isThe cost accrued directly inside a function's own instructions, priced with the best-case model pe operation.The exact cost of an execution, measured at instance level as a result of how the prover plans and allocates instances at runtime.
GranularityPer individual operation.Per instance, execution units inside state machines that batch many operations together.
Reaction to code changesDirect and proportional: change the code and the number moves with it.Non-linear: it jumps when an extra operation crosses an instance boundary, or when a change shifts the planner's strategy.
What it includesOnly the function's own instructions; padding and aggregation are excluded.The function's own cost plus everything it calls, summed at the instance level and shaped by the planner's strategy.
Best used forOptimization as it shows the direct impact of each change.Sizing the real proving run in production.
Why optimize with profiling cost?

Profiling cost is a predictable, proportional signal tied directly to your code: every improvement shows up immediately. Final cost is the true production cost, but it moves in jumps at instance boundaries, so it's a poor guide while you're editing. Optimize using profiling cost, then read final cost to confirm the real savings in the proving system.

To make this clearer, let's walk through two theoretical examples.

Example: Instance boundaries

Imagine a program that performs Keccak hashes. Watch how the two costs diverge as the operation count grows:

Keccak operationsProfiling costFinal cost
1,000proportional to 1,0001 Keccak instance (fits within capacity)
5,0005× the 1,000 casestill 1 instance (if capacity is ~5,242)
5,243~5.24× the 1,000 case2 instances as one extra operation crossed the boundary

Profiling cost grows smoothly with every operation, so it's easy to predict the effect of adding or removing work. Final cost stays flat until you cross an instance boundary, then jumps. That's why profiling cost is better for optimizing, while final cost tells you the real proving price in production.

Example: Internal planner

Suppose you have to pick between two implementations that differ by one million operations: Option A uses 1M 64-bit ADD operations, while Option B uses 1M 64-bit OR operations. ZisK has a cheap specialized instance for 64-bit additions (BinaryAdd), while the general Binary instance (which handles ADD, SUB, AND, OR, XOR, and more) is more expensive.

How you measureVerdict
Profiling costOption A is cheaper in terms of cost as it uses the efficient specialized instance.
Final cost, small programBoth may share a single Binary instance, making them seem equal.
Final cost, large programA uses a dedicated BinaryAdd instance, B uses Binary.

The difference between the two final-cost rows is the planner at work: in a small program it folds the additions into the shared Binary instance, so both options cost the same; once the program is large enough, it gives them a dedicated, cheaper BinaryAdd instance and Option A wins. The same code is priced differently depending on the overall operation mix. So profiling cost points to Option A at any size, while final cost only agrees once the planner separates the instances.

Spotting optimization opportunities

Profiling reads straight off any release build, no instrumentation required, so it's also the tool you use to decide where to optimize. It helps you answer questions like these:

QuestionWhat profiling tells you
Which crate or algorithm is cheapest to prove?Compare equivalent implementations and pick the most ZisK-efficient one based on a data-driven choice instead of a guess.
Did my change actually help?Compare before/after profiles to confirm the profiling cost decreased.
Is patching being applied?See whether precompiles run where you expect, catch paths still running generic code instead of the ZisK-optimized version.
Where should I patch?Find the hotspot functions such as expensive cryptography or arithmetic-heavy code that would benefit most from an optimization.

The work is iterative. First, profile your program to find the expensive functions. Then look for patterns that match an available precompile, such as hashing or big-integer math. Patch the code to use that precompile or a ZisK-optimized implementation, or change which operations you use, remembering you're optimizing for the ZisK architecture, not hardware. Finally, re-profile to confirm the profiling cost dropped. Guided by profiling cost, this targets the right areas and produces measurable improvements.


Profiling a guest program

In this section you will profile a guest program that reads N from the input, hashes each leaf index with SHA-256 to produce N leaves, and reduces them into a single Merkle root. You will apply the three profiling tools in sequence, each time building on what the previous one revealed.

Creating the project

You have two ways to get a working ZisK project for this guide. Pick whichever fits your situation; the rest of the guide is identical either way.

Clone the examples repository

If you want the finished version of the program this guide builds, or just want to skim a complete project before writing your own, clone the companion examples repo and move into the merkle-tree directory:

bash
git clone https://github.com/0xPolygonHermez/zisk.git
cd zisk/examples/merkle-tree/guest

Scaffold a new project

To start from an empty project and write the program yourself, use the cargo-zisk CLI to scaffold a new workspace. It handles workspace setup, toolchain configuration, and dependency wiring so you can move straight to writing logic:

bash
cargo-zisk new merkle-tree
cd merkle-tree/guest

Download the sample input this guide uses into a samples/ folder inside the guest/ crate, where the later commands expect it:

bash
mkdir samples
BASE=https://raw.githubusercontent.com/0xPolygonHermez/zisk/refs/heads/main/examples/merkle-tree
curl -L -o samples/example-input.bin $BASE/example-input.bin

Either path lands you in a project with a guest/ crate where the ZisK program lives and a common/ crate with the shared types and logic.

Define the shared logic

The shared logic hashes with SHA-256 and encodes the digest as hex, so add sha2 and hex to the common crate:

common/Cargo.toml
[package]
name = "merkle-common"
version = "0.1.0"
edition = "2024"

[dependencies]
sha2 = "0.10.9"
hex = "0.4.3"

Then open common/src/lib.rs and write the shared Merkle tree logic. The sha2 and hex items are re-exported so the guest can use them through the common crate:

common/src/lib.rs
/// Re-export the SHA-256 hasher and its `Digest` trait from the `sha2` crate.
pub use sha2::{Digest, Sha256};

/// Re-export the `hex` crate for encoding/decoding byte slices as hex strings.
pub use hex;

/// A 32-byte array that holds a raw SHA-256 digest.
pub type Hash = [u8; 32];

/// Computes the Merkle root of `leaves` using SHA-256 to hash sibling pairs.
///
/// Odd-length levels duplicate the last leaf before pairing. Reduces in-place
/// until a single root hash remains.
pub fn merkle_root(mut leaves: Vec<Hash>) -> Hash {
let mut buf = [0u8; 64];
while leaves.len() > 1 {
let next_len = leaves.len().div_ceil(2);
for i in 0..next_len {
let left = leaves[2 * i];
let right = leaves.get(2 * i + 1).copied().unwrap_or(left);
buf[..32].copy_from_slice(&left);
buf[32..].copy_from_slice(&right);
leaves[i] = Sha256::digest(&buf).into();
}
leaves.truncate(next_len);
}
leaves[0]
}

Write the guest program

Now open guest/src/main.rs and write the program. It hashes each leaf index with SHA-256 to produce N leaves and reduces them into a single Merkle root:

guest/src/main.rs
#![no_main]
ziskos::entrypoint!(main);

use merkle_common::{hex, merkle_root, Digest, Hash, Sha256};

fn main() {
// Read the number of leaves from the guest's standard input stream.
let n: u64 = ziskos::io::read::<u64>();

// Build leaves by hashing each index in the range 1..=n.
let leaves: Vec<Hash> = (1..=n).map(|i| Sha256::digest(&i.to_le_bytes()).into()).collect();

// Compute the Merkle root over the leaf set.
let root = merkle_root(leaves);

// Commit the root as a public output so a verifier can inspect it
// without re-executing the program.
ziskos::io::commit_slice(&root);

println!("merkle-root({n}) => 0x{:?}", hex::encode(root));
}

Build and run it once to confirm the logic is correct before profiling:

bash
cargo-zisk build --release
cargo-zisk run --release -i samples/example-input.bin
merkle-root(1000) => 0xf94e7857a9aa655788bccc391771dbe005b9b9cbd2be8f26c56bd08c6c755da5

Using inline cost markers

Before reaching for heavier tools, drop markers into the guest to get a first glance on where cost is concentrated. ZisK exposes two flavours of marker:

Macro pairBehaviorUse when
profile_report_start! / profile_report_end!Accumulates cost per region; prints a single summary table when the program exits.You care about totals rather than individual hits.
profile_start! / profile_end!Prints the cost of the region every time the marker pair is hit.A region runs multiple times and you want per-iteration variance.

Each pair has a matching steps variant — profile_report_steps_start! / profile_report_steps_end! and profile_steps_start! / profile_steps_end! — that records the RISC-V step count for the region instead of its cost. Wrapping a region with both the cost and the steps report markers makes the run print a steps report and a cost report, so you see how each section contributes to both metrics.

Either pair can be used, and they can be mixed in the same program. We'll use the report variants — for both steps and cost — in this example so the output stays compact and easy to read. Keep the clean program in guest/src/main.rs and create a separate guest/src/inline.rs for the instrumented version, wrapping the two phases of the program:

guest/src/inline.rs
#![no_main]
ziskos::entrypoint!(main);

use merkle_common::{hex, merkle_root, Digest, Hash, Sha256};

fn main() {
// Read the number of leaves from the guest's standard input stream.
let n: u64 = ziskos::io::read::<u64>();

// Profile the leaf-preparation phase — record both cost and steps.
ziskos::profile_report_start!(PREPARE_LEAVES);
ziskos::profile_report_steps_start!(PREPARE_LEAVES);
let leaves: Vec<Hash> = (1..=n).map(|i| Sha256::digest(i.to_le_bytes()).into()).collect();
ziskos::profile_report_steps_end!(PREPARE_LEAVES);
ziskos::profile_report_end!(PREPARE_LEAVES);

// Profile the Merkle-root computation phase — record both cost and steps.
ziskos::profile_report_start!(COMPUTE_ROOT);
ziskos::profile_report_steps_start!(COMPUTE_ROOT);
let root = merkle_root(leaves);
ziskos::profile_report_steps_end!(COMPUTE_ROOT);
ziskos::profile_report_end!(COMPUTE_ROOT);

// Commit the root as a public output so a verifier can inspect it
// without re-executing the program.
ziskos::io::commit_slice(&root);

println!("merkle-root({n}) => 0x{}", hex::encode(root));
}

Register this binary in guest/Cargo.toml so both sources can be built and run independently — the clean main and the instrumented inline:

guest/Cargo.toml
[[bin]]
name = "inline-guest"
path = "src/inline.rs"

Build and run the inline binary:

bash
cargo-zisk build --release --bin inline-guest
cargo-zisk run --release --bin inline-guest -i samples/example-input.bin --profiling inline
╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ◆ REPORT SUMMARY ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ STEPS 13,285,457 ║
║ COST 1,761,570,966 ║
║ RAM 0.03 MB / 507.75 MB ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ◆ COST DISTRIBUTION SUMMARY ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ CATEGORY COST % ║
║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║
║ Base █████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 293,601,280 16.7% ║
║ Main ████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 903,411,076 51.3% ║
║ Opcodes ███████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 512,255,611 29.1% ║
║ Precompiles ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 895,305 0.1% ║
║ Memory ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 51,407,694 2.9% ║
║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║
║ Total 1,761,570,966 100.0% ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ◆ STEPS PROFILE TAGS ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ COMPUTE_ROOT ██████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 7,386,714 55.6% ║
║ PREPARE_LEAVES █████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 3,680,072 27.7% ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ◆ COST PROFILE TAGS ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ COMPUTE_ROOT ██████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 979,074,124 55.6% ║
║ PREPARE_LEAVES █████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 487,471,119 27.7% ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

The run prints two reports — STEPS PROFILE TAGS and COST PROFILE TAGS — one per metric you recorded. Both agree: COMPUTE_ROOT dominates over PREPARE_LEAVES in steps and cost. But markers only tell us which section is expensive, not why. Let's run the summarized report to look inside it.

Obtaining a summarized report

Inline markers told us which section of the guest is expensive, but they cannot say why. The summarized report fills that gap: it takes the same execution and groups its cost three different ways so you can pinpoint exactly what is driving cost and how it will move when you change the program. Run it with --profiling summary:

bash
cargo-zisk run --release -i samples/example-input.bin --profiling summary
merkle-root(1000) => f94e7857a9aa655788bccc391771dbe005b9b9cbd2be8f26c56bd08c6c755da5
╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ◆ REPORT SUMMARY ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ STEPS 13,284,475 ║
║ COST 1,760,938,930 ║
║ RAM 0.03 MB / 507.75 MB ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ◆ COST DISTRIBUTION SUMMARY ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ CATEGORY COST % ║
║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║
║ Base █████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 293,601,280 16.7% ║
║ Main ████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 903,344,300 51.3% ║
║ Opcodes ███████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 511,698,015 29.1% ║
║ Precompiles ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 895,048 0.1% ║
║ Memory ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 51,400,287 2.9% ║
║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║
║ Total 1,760,938,930 100.0% ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ◆ COST DISTRIBUTION BY OPCODE ║ ◆ OPS vs FROPS ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ OPCODE COST % ║ OPS + FROPS FROPS % ║
║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║
║ xor ██░░░░░░░░░░░░░░░░░░░░░ 123,323,220 7.0% ║ 126,985,140 3,661,920 2.9% ║
║ or █░░░░░░░░░░░░░░░░░░░░░░ 105,359,820 6.0% ║ 114,561,660 9,201,840 8.0% ║
║ srl_w █░░░░░░░░░░░░░░░░░░░░░░ 102,040,900 5.8% ║ 107,028,730 4,987,830 4.7% ║
║ sll █░░░░░░░░░░░░░░░░░░░░░░ 87,754,220 5.0% ║ 101,518,426 13,764,206 13.6% ║
║ add █░░░░░░░░░░░░░░░░░░░░░░ 49,913,675 2.8% ║ 50,804,000 890,325 1.8% ║
║ and ░░░░░░░░░░░░░░░░░░░░░░░ 35,641,440 2.0% ║ 36,996,600 1,355,160 3.7% ║
║ srl ░░░░░░░░░░░░░░░░░░░░░░░ 2,602,088 0.1% ║ 2,710,473 108,385 4.0% ║
║ signextend_b ░░░░░░░░░░░░░░░░░░░░░░░ 2,545,696 0.1% ║ 2,545,696 0 0.0% ║
║ signextend_w ░░░░░░░░░░░░░░░░░░░░░░░ 2,121,431 0.1% ║ 2,121,431 0 0.0% ║
║ dma_xmemset ░░░░░░░░░░░░░░░░░░░░░░░ 600,400 0.0% ║ ║
║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║
║ Total 512,593,063 29.1% ║ 547,184,854 34,591,791 6.3% ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Report summary

The first box is a global snapshot of the execution, three big numbers that summarize how big this run was:

FieldDescription
STEPSNumber of processor cycles executed. Proportional to how long the program runs; doubling N should roughly double the step count.
COSTTotal profiling cost expressed in proof area units. This is the metric to minimize: higher cost means longer proof generation.
RAMHeap memory used vs. total available. Only reported when using ZisK's default bump allocator (which never frees memory and avoids heap overhead).

Use COST as your optimization target, STEPS as a sanity check that the code path you touched actually got shorter, and RAM to confirm the allocator stayed within budget.

Cost distribution summary

The second box breaks total cost into five categories. Each one is proven by a different state machine inside the prover, so cost visibly shifts between categories when you change what the program does, replacing a pure-Rust hash with a precompile migrates cost from Opcodes into Precompiles at a fraction of the rate, and so on:

CategoryWhat it measures
BaseFixed overhead such as tables, range checks, and other constant components that exist regardless of program logic. You cannot reduce this directly; it is the floor of any proof.
MainProcessor execution cost, directly proportional to STEPS. Shrinking a hot loop or eliminating instructions reduces Main.
OpcodesSimple 64-bit operations (a op b = c, flag): arithmetic, bitwise, shifts. Cryptographic primitives written in pure Rust (hashes, big-int math) accumulate here and are usually the first thing to attack.
PrecompilesComplex operations whose parameters exceed 64 bits and use memory as an exchange channel. Routing work through a precompile (SHA-256, Keccak, ECDSA, …) moves cost into this column at a fraction of the rate.
MemoryDirect memory reads/writes plus the extra state machines needed for non-aligned or sub-8-byte accesses. Keeping hot data 8-byte aligned minimizes this.

A useful rule of thumb: Base is structural and out of your hands, Main scales with how much code runs, and the other three scale with what the code does. When you optimize, watch how cost migrates between these columns from run to run that migration is the clearest signal that your change is working.

Cost by opcode summary

The third box has two panels side by side, both reading from the same execution but answering different questions.

The COST DISTRIBUTION BY OPCODE lists every RISC-V opcode the guest executed, its absolute cost in proof units, and its share of total cost. The bars give a quick visual ranking; the absolute numbers are what you compare between runs.

The OPS vs FROPS reports Frequently Repeated Operation patterns: short sequences of opcodes that the emulator detects at runtime and proves as a single, cheaper unit.

ColumnMeaning
OPS + FROPSTotal times the opcode appeared, including instances absorbed into a FROP.
FROPSHow many of those instances were absorbed into a FROP.
%FROP hit rate. Higher means more of that opcode's cost was already discounted by the FROP optimization.

These savings are already deducted from the COST column on the left, you do not need to subtract anything yourself. The FROP panel is diagnostic: when an opcode still dominates cost despite a high hit rate, the underlying work is genuinely expensive and the only remaining lever is structural.

Top cost functions

The fourth box ranks guest functions by cumulative cost, each entry's number includes the cost of every nested call it makes, not just its own instructions. A function listed at 80% cumulative cost is the path through which 80% of the work flows, even if its own body is only a few lines.

The summary alone cannot show which call paths reach a hot function only that they do. To trace them, move to the complete call-stack view in the next section.

Summarized report conclusions

Three things stand out from the summarized report we've generated:

BoxSignalWhat it means
Cost distribution summaryPrecompiles at 0.0%SHA-256 runs entirely as plain opcodes with no native acceleration.
Top cost functionssha2::compress256 at 59.8%The hash computation dominates total cost.
Cost by opcodexor, or, srl_w, sll, and lead the opcode tableAll characteristic of the SHA-256 round function — confirms where cost lives.

At this point it is already clear that SHA-256 needs to be replaced. If you want to go further and understand the full call structure, which paths reach compress256 and from where, the call stack view lets you explore that in depth.

Exploring the call stack

The summarized report confirmed sha2::compress256 as the dominant cost driver, but the summary alone cannot show how execution reaches it or whether heap allocation is compounding the pressure. The call stack view lets us trace exactly how execution reaches the hot spots. Generate it with:

bash
cargo-zisk run --release -i samples/example-input.bin --profiling complete
merkle-root(1000) => 0xf94e7857a9aa655788bccc391771dbe005b9b9cbd2be8f26c56bd08c6c755da5
Saving profiler data to profile.json.gz...

╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ◆ REPORT SUMMARY ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ STEPS 13,284,475 ║
║ COST 1,760,938,930 ║
║ RAM 0.03 MB / 507.75 MB ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ◆ COST DISTRIBUTION SUMMARY ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ CATEGORY COST % ║
║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║
║ Base █████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 293,601,280 16.7% ║
║ Main ████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 903,344,300 51.3% ║
║ Opcodes ███████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 511,698,015 29.1% ║
║ Precompiles ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 895,048 0.1% ║
║ Memory ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 51,400,287 2.9% ║
║ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ║
║ Total 1,760,938,930 100.0% ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

This writes profile.json.gz to the current directory. Open it in profiler.firefox.com by clicking Load a profile from file and selecting the file, using Firefox is recommended for large profiles.

  • Call stack (top-down) lets you understand cost in terms of entry calls: starting from main, you can follow exactly which calls are responsible for the cost at each level. The flamegraph visualises this as nested bars, each proportional to the cost accumulated through that entry path.

  • Inverted call stack (bottom-up) flips the perspective: instead of following calls down, it ranks leaf functions by cost. sha2::compress256 surface immediately at the top as it is called in both phases we've profiled.

The call stack leaves no ambiguity: compress256 is the single bottleneck, reached through both the leaf-hashing and tree-reduction phases. The next step is replacing it with a native precompile.

Evaluate execution cost

Profiling cost is the right signal while you are optimizing, but the number that ultimately sizes your proving run is proving cost, how the prover allocates resources at runtime, not how many operations the source generates (see the warning at the top of this guide). Once the markers and the summarized report have guided you to a final version of the guest, measure that real cost with cargo-zisk execute.

The main binary is already the clean, uninstrumented version, the markers live only in inline, so there is nothing to strip. Markers add a small amount of work to every run and have no place in the binary you intend to prove. Build main:

bash
cargo-zisk build --release

Then run it through the executor. It performs the same work the prover will and reports the proof instances it would have to allocate:

bash
cargo-zisk execute --release -i samples/example-input.bin
INFO: --- EXECUTE SUMMARY -----------
INFO: Execution completed in 215ms, steps: 13284349, cost: N/A
INFO: Execution time breakdown: 215ms [ Execution 212ms + Count&Plan 2ms + Count&Plan MO 0ms ]
INFO: Plan [ Arith: 1 | Binary: 2 | BinaryExtension: 1 | Dma: 1 | Dma64AlignedMem: 1 | DmaPrePost: 1 | DmaUnaligned: 1 | InputData: 1 | Main: 4 | Mem: 1 | MemAlign: 1 | Rom: 1 | RomData: 1 ] Total instances: 17

On Linux x86_64 with enough memory, append --asm to run the execution through the native Assembly executor for a faster measurement:

bash
cargo-zisk execute --release -i samples/example-input.bin --asm
INFO: --- EXECUTE SUMMARY -----------
INFO: Execution completed in 15ms, steps: 13284349, cost: N/A
INFO: Execution time breakdown: 15ms [ Execution 15ms + Count&Plan 0ms + Count&Plan MO 0ms ]
INFO: Assembly execution speed: 9.343ms (1422 MHz)
INFO: Plan [ Arith: 1 | Binary: 2 | BinaryExtension: 1 | Dma: 1 | Dma64AlignedMem: 1 | DmaPrePost: 1 | DmaUnaligned: 1 | InputData: 1 | Main: 4 | Mem: 1 | MemAlign: 1 | Rom: 1 | RomData: 1 ] Total instances: 17

Each entry is a prover component with a fixed capacity; the number is how many instances of it this execution needed. Total global instances is the headline figure and your real proving-cost budget, a change that pushes any component over a boundary adds one to the total even when profiling cost barely moved.

Summary

You have profiled a real guest program from first marker to full call stack. You know which operations drove cost in the merkle-tree example. The same three-step workflow applies to any guest: mark the expensive regions, confirm with the summarized report, trace the call stack, and measure real costs with execute.


Next steps

With the bottlenecks identified, the next step is to eliminate them:

  • Optimizing your program: apply a patched crate or replace call sites with zisklib wrappers to route expensive operations through ZisK native precompiles.