Profiling Programs with ZiskEmu

ZiskEmu provides powerful profiling capabilities to analyze the cost and performance characteristics of your programs. This guide explains how to use these features to identify hotspots, optimize your code, and understand resource consumption.

What This Guide Covers

This guide walks you through ZiskEmu's profiling capabilities, progressing from high-level overviews to detailed analysis:

Introduction: Understanding profiling costs vs. final costs, symbol-based analysis, and detecting optimization opportunities
Basic Profiling: Global statistics showing cost distribution across major categories (base, main, opcodes, precompiles, memory)
SDK Report Mode: Streamlined, compact output format ideal for CI/CD and quick checks, with selective section display options
Function Name Display Options: Configure how long function names are displayed with compact and no-compact modes
Profile Tags: Instrument your code to measure specific sections, with immediate or deferred reporting of steps and costs
Firefox Profiler Integration: Export profiling data for advanced visualization and interactive analysis
Function-Level Profiling: Identifying which functions consume the most resources with cumulative analysis
Customizing ROI Display: Controlling how many functions to show and filtering by patterns
Detailed Caller Analysis: In-depth breakdown showing which operations are expensive within each function and who calls them
Tracking Function Calls: Logging individual call parameters to analyze usage patterns and optimize for common cases
PC Histogram Analysis: Low-level view of the most frequently executed RISC-V instruction sequences
Additional Options: Quick reference for other useful flags (steps, progress indicators, formatting)
Practical Example: Real-world case study analyzing Ethereum opcode costs in a block validator

Introduction

Understanding Profiling Costs vs. Final Costs

When profiling a program in ZisK, it's important to understand the difference between profiling costs and final costs:

Profiling Costs

Profiling costs represent the individual operational cost accrued directly within a function's own instructions, based on the best-case cost model for each operation. These costs:

Exclude costs padding or aggregation costs
Reflect a direct cause-and-effect relationship between code changes and cost variations
Use the optimal cost for each operation type
Allow you to observe how small program modifications affect performance
Are ideal for optimization work because they show the direct impact of your code changes

For example, when you replace a function with a precompiled function or optimize a loop, the profiling cost will immediately reflect this improvement, making it easy to validate that your optimization is working as expected.

Final Costs

Final costs represent the real and exact cost of a specific execution, accounting for the actual resource consumption in the ZisK proving system. The key difference is that final costs measure cost at the instance granularity, not at the individual operation level.

In ZisK's architecture, multiple operations are grouped into instances (execution units in state machines), and the cost is determined by these instances:

Instance-based granularity: If you use 1 Keccak operation or 5,242 Keccak operations, you pay for one full Keccak instance. However, if you use 5,243 operations, you need a second instance, effectively doubling the cost for that single additional operation.
Planner strategies: The ZisK planner dynamically chooses execution strategies based on the operation mix. For example, depending on how many additions and binary operations you have, the planner might use a Binary state machine, a BinaryAdd state machine, or both. These decisions affect the final cost since each instance type has a different cost structure.
Aggregation across function calls: Final costs include both the function's own profiling cost and all costs from functions it calls, summed at the instance level.

Why use profiling costs for optimization? Because profiling costs provide a predictable and proportional metric directly tied to your code changes. When optimizing, you want to see the immediate effect of your changes at the operation level. Final costs, while representing the true execution cost, can show non-linear behavior due to instance boundaries and planning strategies. Once you've optimized based on profiling costs, the final costs will reflect the real resource savings in the proving system.

Example: Keccak Operations

Consider a program that performs Keccak hash operations:

Scenario 1: Using 1,000 Keccak operations

Profiling cost: Proportional to 1,000 operations
Final cost: 1 Keccak instance (fits within instance capacity)

Scenario 2: Using 5,000 Keccak operations

Profiling cost: 5× the cost of Scenario 1 (proportional to operations)
Final cost: Still 1 Keccak instance (if capacity is 5,242 operations)

Scenario 3: Using 5,243 Keccak operations

Profiling cost: ~5.24× the cost of Scenario 1 (proportional increase)
Final cost: 2 Keccak instances (crossed the instance boundary with just 1 extra operation!)

The profiling cost grows linearly with the number of operations, making it easy to predict the impact of adding or removing operations. The final cost, however, stays constant until you cross an instance boundary, then jumps significantly. This is why profiling costs are better for optimization: you can see the effect of every change, while final costs help you understand the actual proving cost in production.

Example: Comparing Optimization Alternatives

Suppose you have implemented two different optimizations for your program, and you need to decide which one is better. The difference between them is 1 million operations:

Option A: Uses 1M 64-bit ADD operations
Option B: Uses 1M 64-bit OR operations

In ZisK's architecture, there are specialized instances for 64-bit additions (BinaryAdd) that are much cheaper than the general binary instances (Binary) that can perform ADD, SUB, AND, OR, XOR, and other operations.

Analysis with Profiling Costs:

Option A (ADD): Lower profiling cost (uses efficient specialized instances)
Option B (OR): Higher profiling cost (requires general binary instances)
Clear winner: Option A is better ✓

Analysis with Final Costs (Small Program):

If your program is small and doesn't fill a Binary instance:

Both options may end up using the same Binary instance
Final cost: Same for both options (no clear winner)
Misleading conclusion: No difference between optimizations ✗

Analysis with Final Costs (Large Program):

If your program is larger and already uses separate instances:

Option A uses a dedicated BinaryAdd instance (cheaper)
Option B uses a Binary instance (more expensive)
Final cost: Option A is clearly cheaper ✓
Correct conclusion: Matches profiling cost analysis

Lesson: Profiling costs consistently show that Option A is better, regardless of program size. Final costs may give conflicting signals depending on whether instance boundaries are crossed. This is why profiling costs are the reliable metric for making optimization decisions—they provide a consistent signal that doesn't depend on the overall program context.

Symbol-Based Analysis

One of ZiskEmu's key advantages is that profiling works on any ELF file without requiring special instrumentation or debug information. The profiler uses symbol information already present in the binary, which means:

Works with release builds (optimized binaries)
No need to recompile with special flags
No runtime overhead during execution
Analyzes production-ready binaries (not stripped)

Detecting Optimization Opportunities

One of the most powerful uses of ZiskEmu's profiling is identifying where to apply patches and optimizations. The profiling costs help you answer critical questions:

Which crates/libraries are most performant for proof generation?

Compare different library implementations to see their effect on verification costs
Test alternative dependencies to find the most ZisK-efficient options
Evaluate different algorithm implementations (e.g., hash libraries, cryptographic crates, serialization libraries) to determine which performs best in the ZisK proving system
Make data-driven decisions when choosing between equivalent functionality from different crates

Validating optimizations:

After applying a optimization or patch, run the profiler again to confirm the profiling cost decreased
Compare before/after profiles to ensure the optimization is effective

Is patching being applied correctly?

Verify that precompiles are being used where expected
Detect cases or paths where generic code is running instead of optimized ZisK-specific implementations
Identify functions that should be patched but aren't

Where should you apply patches?

Find hotspot functions that would benefit most from ZisK precompiles
Identify expensive cryptographic operations (SHA-256, Keccak, etc.) that could use hardware acceleration
Locate arithmetic-heavy code that could leverage ZisK's optimized arithmetic operations

Example workflow:

Profile your program to identify expensive functions
Look for patterns that match available precompiles (hashing, big integer math, etc.)
Patch the code to use:
- ZisK-optimized implementations
- Precompiles
- Change operations or how they're used, considering you're optimizing for ZisK architecture, not hardware
Re-profile to verify the profiling cost reduction

This iterative approach, guided by profiling costs, ensures your optimizations target the right areas and produce measurable improvements.

Basic Profiling (statistics)

The simplest way to profile your program is to use the -X (or --stats) flag. This provides an overview of execution statistics including total costs, memory operations, and opcode usage.

Command

ziskemu -e \<elf\> -i \<input\> -X

Output Explanation

REPORT                                  
----------------------------------------
STEPS                         92,875,129

COST DISTRIBUTION                   COST       %
------------------------------------------------
BASE                         293,601,280   2.57%
MAIN                       6,315,508,772  55.22%
OPCODES                    1,334,639,984  11.67%
PRECOMPILES                2,565,960,716  22.43%
MEMORY                       927,932,629   8.11%

TOTAL                     11,437,643,381 100.00%

FROPS                        963,440,253   8.42%
RAM USAGE                     18,465,008   3.47%

Understanding the Report:

STEPS: The number of processor cycles or instructions executed during program execution. This is an indicator of how long the program is—more steps mean a longer program execution.

COST DISTRIBUTION: This shows the profiling cost (see the Understanding Profiling Costs section for detailed explanation). Each operation is costed individually using the proof area as the metric, which is the best indicator of proof generation time—higher cost means longer proof generation.

The cost is broken down into these categories:

BASE: Cost of fixed components such as tables, range checks, and other constant overhead that exists regardless of program logic.
MAIN: Cost of the processor itself without operation costs. This is directly proportional to the steps count and represents the base cost of executing instructions.
OPCODES: Cost of simple operations performed by the processor (additions, subtractions, etc.) in the format a operation b = c, flag, where a, b, and c are 64-bit values. These are basic arithmetic and logical operations.
PRECOMPILES: Cost of complex operations whose parameters don't fit in 64 bits, requiring memory as an exchange system. Examples include:
- 256-bit additions
- Elliptic curve operations
- Keccak hashing
- DMA operations
MEMORY: Cost of direct memory operations (read, write) and the additional state machines required for non-aligned memory access. This includes cases where:
- The address is not aligned to 8 bytes
- Operations don't work with 8-byte chunks (e.g., reading a single byte)
TOTAL: Sum of all costs. Each category shows the percentage (%) it represents of the total cost.

FROPS (FRequent OPerationS): These are operations that are very frequently used by the processor, such as:

Adding 1 to a relatively small number (common in loop counters)
Adding 8 to an address (typical for pointer arithmetic)
Working with values < 256

These frequent operations are analyzed, detected, and pre-calculated, becoming part of the BASE cost but representing significant savings. In this example, FROPS show 8.42% - this is the cost the program would have if these optimizations were not applied. The actual savings are already reflected in the lower costs of the affected operations.

RAM USAGE: The amount of memory used out of the total available. This information is only available with the default allocator (bump allocator), which:

Never frees memory - always allocates new memory
Avoids the CPU cycles needed to manage the entire heap (typically >10% overhead)
Is recommended as long as sufficient memory is available
Provides better performance by eliminating heap management costs

Detailed Opcode Breakdown:

Below the summary, you'll see a detailed breakdown of each operation:

COST BY OPCODE                     COUNT       %            COST       % RANK
-----------------------------------------------------------------------------
OP ltu                         1,767,360   1.90%     106,041,600   0.93%
OP lt                            389,360   0.42%      23,361,600   0.20%
OP eq                            543,251   0.58%      32,595,060   0.28%
OP add                         7,086,411   7.63%     177,160,275   1.55% #4
OP sub                           693,157   0.75%      41,589,420   0.36%
OP and                         3,740,044   4.03%     224,402,640   1.96% #3
OP or                          7,482,273   8.06%     448,936,380   3.93% #2
OP xor                         1,027,290   1.11%      61,637,400   0.54%
OP add_w                          15,804   0.02%         948,240   0.01%
OP sub_w                           4,085   0.00%         245,100   0.00%
OP sll                         1,551,879   1.67%      82,249,587   0.72%
OP srl                           611,361   0.66%      32,402,133   0.28%
OP sra                           807,976   0.87%      42,822,728   0.37%
OP srl_w                          84,289   0.09%       4,467,317   0.04%
OP sra_w                              62   0.00%           3,286   0.00%
OP signextend_b                  121,977   0.13%       6,464,781   0.06%
OP signextend_h                    1,684   0.00%          89,252   0.00%
OP signextend_w                   27,460   0.03%       1,455,380   0.01%
OP pubout                             32   0.00%               0   0.00%
OP muluh                          86,682   0.09%       8,234,790   0.07%
OP mul                           409,765   0.44%      38,927,675   0.34%
OP divu                            6,368   0.01%         604,960   0.01%
OP remu                                4   0.00%             380   0.00%
OP dma_memcpy                    302,551   0.33%      12,707,142   0.11%
OP dma_memcmp                     91,454   0.10%       3,841,068   0.03%
OP dma_inputcpy                       90   0.00%           3,780   0.00%
OP dma_xmemset                    32,381   0.03%       1,360,002   0.01%
OP _dma_pre                      140,043   0.15%      12,323,784   0.11%
OP _dma_post                     164,752   0.18%      14,498,176   0.13%
OP keccak                         32,650   0.04%   2,466,707,500  21.57% #1
OP arith256_mod                      714   0.00%       1,016,736   0.01%
OP secp256k1_add                  17,688   0.02%      25,187,712   0.22%
OP secp256k1_dbl                  19,884   0.02%      28,314,816   0.25%
OP fcall_param                       652   0.00%               0   0.00%
OP fcall                             172   0.00%               0   0.00%
OP fcall_get                         156   0.00%               0   0.00%

FROPS BY OPCODE                    COUNT    HIT            COST       % RANK
----------------------------------------------------------------------------
FROP ltu                         942,288  34.78%      56,537,280   0.49% #4
FROP lt                          641,963  62.25%      38,517,780   0.34%
FROP eq                        3,273,419  85.77%     196,405,140   1.72% #2
FROP add                       1,597,142  18.39%      39,928,550   0.35%
FROP sub                         357,871  34.05%      21,472,260   0.19%
FROP and                         471,898  11.20%      28,313,880   0.25%
FROP or                        1,303,629  14.84%      78,217,740   0.68% #3
FROP xor                         105,118   9.28%       6,307,080   0.06%
FROP add_w                        75,366  82.67%       4,521,960   0.04%
FROP sub_w                         2,177  34.77%         130,620   0.00%
FROP sll                       8,729,869  84.91%     462,683,057   4.05% #1
FROP srl                         376,620  38.12%      19,960,860   0.17%
FROP sra                           5,962   0.73%         315,986   0.00%
FROP srl_w                        66,935  44.26%       3,547,555   0.03%
FROP sra_w                            60  49.18%           3,180   0.00%
FROP muluh                        25,590  22.79%       2,431,050   0.02%
FROP mul                          43,603   9.62%       4,142,285   0.04%
FROP divu                             42   0.66%           3,990   0.00%

COST BY OPCODE Table:

This table shows detailed statistics for each operation or precompile executed:

COUNT: Number of times this operation was called
%: Percentage of steps (cycles) that use this operation
COST: Total profiling cost for all executions of this operation
%: Percentage of total cost that this operation represents
RANK: The top 4 most expensive operations are marked with #1, #2, #3, #4

Important: Operations are not sorted by cost. They maintain a consistent order across executions to facilitate comparison between different runs. Look for the #N markers to identify the most expensive operations.

For example, in this output, keccak was executed 32,650 times (0.03% of steps) but accounts for 21.41% of the total cost, making it the #1 most expensive operation. This indicates that Keccak operations dominate the cost despite being relatively infrequent.

FROPS BY OPCODE Table:

FROPS (Frequently-used OPerationS) are highly common operations that have been analyzed and optimized through pre-calculation. These include operations like:

Incrementing by 1 (loop counters)
Adding 8 (pointer arithmetic)
Working with small values (< 256)

The table shows:

COUNT: Number of times the FROP variant was executed
HIT: Hit rate percentage - how often the frequent operation pattern was matched and the optimization applied
COST: Total cost with the optimization benefit already applied
%: Percentage of total cost
RANK: Top ranked FROPS by cost

High hit rates indicate that the program uses these common patterns frequently, benefiting from the pre-calculated optimizations. The FROPS total shown earlier (8.42% in this example) represents the cost that would be added if these optimizations were not available.

Key Insights from Statistics:

Use this information to:

Identify which operation types dominate your program's cost
Find operations with high count but disproportionate cost (optimization candidates)
Verify that precompiles are being used where expected
Understand the balance between computation (OPCODES), memory access (MEMORY), and complex operations (PRECOMPILES)

SDK Report Mode

For a cleaner, more compact output ideal for continuous integration or quick checks, use the --sdk flag. This provides a streamlined report with only the essential summary information.

Command

ziskemu -e <elf> -i <input> --sdk

Output Example

╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║  ◆ REPORT SUMMARY                                                                                                    ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║  STEPS                                                                                                    92,875,129  ║
║  COST                                                                                              11,437,643,381  ║
║  RAM                                                                                                  17.61 MB /  64.00 MB  ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║  ◆ COST DISTRIBUTION SUMMARY                                                                                         ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║  CATEGORY     ∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙          COST      %  ║
║  ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
║  Base         ▎∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙     293,601,280   2.6%  ║
║  Main         ███████████████████████████████████████████████████████∙∙∙∙   6,315,508,772  55.2%  ║
║  Opcodes      ████████████▊∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙   1,334,639,984  11.7%  ║
║  Precompiles  █████████████████████████▊∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙   2,565,960,716  22.4%  ║
║  Memory       █████████▎∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙     927,932,629   8.1%  ║
║  ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
║  Total                                                                       11,437,643,381 100.0%  ║
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

The SDK report provides:

Clean visual layout with box-drawing characters
Progress bars showing the proportional cost of each category
Essential metrics only: steps, total cost, RAM usage, and cost distribution
No detailed breakdowns - ideal for automated testing or quick cost checks

SDK Selective Sections

By default, the SDK report shows only the summary. You can selectively enable additional sections:

Show Opcode Details (`--opcodes`)

Adds a section showing the top 10 most expensive opcodes with their cost distribution and FROPS hit rates:

ziskemu -e <elf> -i <input> --sdk --opcodes

This adds a COST DISTRIBUTION BY OPCODE section comparing regular operations vs frequent operations (FROPS).

Show Top Functions (`--top-functions`)

Lists the functions with highest cost. Requires -S to read symbols:

ziskemu -e <elf> -i <input> --sdk --top-functions -S

This adds a TOP COST FUNCTIONS section with automatic compacting of long function names.

Note: Using --top-functions automatically enables symbol reading (-S), so you can omit the -S flag if you only need it for this feature.

Show Profile Tags (`--profile-tags`)

Displays accumulated profile tag measurements from your code. Requires profile tags in your program (see Profile Tags section):

ziskemu -e <elf> -i <input> --sdk --profile-tags

This shows sections like STEPS PROFILE TAGS and COST PROFILE TAGS if you've instrumented your code with profile markers.

Combining Options

You can combine multiple flags to customize the report:

# Show summary + opcodes + top functions
ziskemu -e <elf> -i <input> --sdk --opcodes --top-functions -S

# Show all optional sections
ziskemu -e <elf> -i <input> --sdk --opcodes --top-functions --profile-tags -S

Behavior Note: If you specify any of the selective flags (--opcodes, --top-functions, --profile-tags), only the summary plus the explicitly requested sections will be shown. If you don't specify any selective flags, you get only the summary.

SDK Width Configuration

Control the width of the SDK report output with --sdk-width:

# Use wider report (150 characters)
ziskemu -e <elf> -i <input> --sdk --sdk-width=150

# Use narrower report (100 characters) 
ziskemu -e <elf> -i <input> --sdk --sdk-width=100

Default width: 120 characters. Wider reports provide more space for progress bars and function names, while narrower reports fit better in smaller terminals or log viewers.

Function Name Display Options

When displaying function-level profiling information with -S, function names can become very long, especially in Rust with its fully-qualified paths and generic parameters. ZiskEmu provides options to control how these names are displayed.

Compact Names (Default)

By default, long function names are automatically shortened to 160 characters using intelligent compacting:

# Default behavior - compact to 160 characters
ziskemu -e <elf> -i <input> -X -S

The compacting algorithm:

Collapses nested generic parameters: <A<B<C>>> → <A<…>>
Elides intermediate path segments: std::io::default_write_fmt::Adapter → std::..::Adapter
Maintains readability while reducing length

Custom Compact Length

Specify a different maximum length:

# Compact to 80 characters
ziskemu -e <elf> -i <input> -X -S --compact-names=80

# Compact to 200 characters  
ziskemu -e <elf> -i <input> -X -S --compact-names=200

Disable Compacting

To see complete, uncompacted function names:

ziskemu -e <elf> -i <input> -X -S --no-compact-names

When to use each option:

Default (160 chars): Good balance for most terminal widths and readability
Shorter (80-100 chars): When viewing in narrow terminals or want very concise output
Longer (200+ chars): When you need more context from the function path
No compacting: When you need to see the complete, exact function signatures (e.g., for copy-pasting into code searches)

Profile Tags

Profile tags allow you to instrument your code to measure specific code sections, loops, or algorithms. This is useful when you want to:

Measure the cost or steps of a specific algorithm
Compare different implementation approaches
Track performance of critical sections across multiple calls
Identify hotspots within a single function

How Profile Tags Work

You add markers in your guest code using macros provided by ziskos. These markers:

Have zero overhead when not running in the ZiskEmu profiler
Work at the source code level - you decide what to measure
Can measure either steps (execution cycles) or cost (profiling cost)
Can either print immediately or accumulate for a summary report

Setting Up Profile Tags

In your guest code's Cargo.toml, add the ziskos dependency:

[dependencies]
ziskos = { path = "../../ziskos" }  # Adjust path as needed

In your guest source code:

use ziskos::{profile_start, profile_end};
use ziskos::{profile_report_start, profile_report_end};
use ziskos::{profile_steps_start, profile_steps_end};
use ziskos::{profile_report_steps_start, profile_report_steps_end};

fn main() {
    // Example usage in your code
    profile_start!(hash_computation);
    let result = expensive_hash_function(&data);
    profile_end!(hash_computation);
    
    // ... more code
}

Profile Tag Macros

There are 8 macros organized in 2 dimensions:

Dimension 1 - What to measure:

Cost macros (profile_start! / profile_end!): Measure profiling cost
Steps macros (profile_steps_start! / profile_steps_end!): Measure execution steps

Dimension 2 - When to report:

Immediate (profile_start! / profile_end!): Print result after each end! call
Report (profile_report_start! / profile_report_end!): Accumulate and show at program end

Immediate Output Macros

Print the measurement immediately after the end! call:


#![allow(unused)]
fn main() {
// Measure and print COST after each execution
profile_start!(my_algorithm);
run_my_algorithm();
profile_end!(my_algorithm);
// Prints: [my_algorithm] 12345

// Measure and print STEPS after each execution  
profile_steps_start!(my_loop);
for i in 0..1000 {
    expensive_operation(i);
}
profile_steps_end!(my_loop);
// Prints: [my_loop] 45678
}

Use case: When you want to track each individual execution, or when the measured section is called only once or a few times.

Report Macros

Accumulate measurements and show statistics at the end:


#![allow(unused)]
fn main() {
for batch in batches {
    profile_report_start!(process_batch);
    process_batch(&batch);
    profile_report_end!(process_batch);
}
// No output during execution

// At program end, you'll see accumulated statistics:
// Total, average, min, max for all executions
}

Use case: When measuring sections called many times (loops, repeated operations) and you want aggregate statistics rather than individual measurements.

Complete Example

use ziskos::{
    profile_start, profile_end,
    profile_report_start, profile_report_end,
    profile_steps_start, profile_steps_end,
    profile_report_steps_start, profile_report_steps_end
};

fn main() {
    // Measure total cost once
    profile_start!(total_execution);
    
    // Accumulate statistics for repeated calls
    for i in 0..100 {
        profile_report_steps_start!(loop_iteration);
        expensive_computation(i);
        profile_report_steps_end!(loop_iteration);
    }
    
    // Nested measurements
    profile_steps_start!(data_processing);
    
    profile_report_start!(hash_phase);
    for item in items {
        compute_hash(item);
    }
    profile_report_end!(hash_phase);
    
    profile_steps_end!(data_processing);
    
    profile_end!(total_execution);
}

Viewing Profile Tag Results

To see the accumulated profile tag statistics, add --profile-tags to your command:

# With standard report
ziskemu -e <elf> -i <input> -X --profile-tags

# With SDK report  
ziskemu -e <elf> -i <input> --sdk --profile-tags

The output shows aggregated statistics for all profile tags used with the report variants:

PROFILE TAGS STEPS (STEPS, % STEPS, CALLS, AVG, MIN, MAX)
----------------------------------------------------------
     10,234,567  11.02%        100     102,345     98,123     125,678  loop_iteration
      3,456,789   3.72%         50      69,135     45,000      89,000  hash_phase

PROFILE TAGS COST (COST, % COST, CALLS, AVG, MIN, MAX)
-------------------------------------------------------
  1,234,567,890  10.79%        100  12,345,678  10,000,000  15,000,000  total_execution
    456,789,012   3.99%         50   9,135,780   5,000,000  12,000,000  hash_phase

Statistics shown:

TOTAL: Sum of all measurements
% TOTAL: Percentage of total steps or cost
CALLS: Number of times the tag was executed
AVG: Average per call
MIN: Minimum value observed
MAX: Maximum value observed

Best Practices

Use descriptive tag names: hash_computation is better than tag1
Choose report vs. immediate based on frequency:
- Few calls (1-10): Use immediate variants
- Many calls (100+): Use report variants
Match start/end pairs: Always use matching macro pairs (same tag name, same variant)
Don't nest same tag names: Each tag should represent a unique code section
Combine with function profiling: Profile tags show "what", function profiling shows "where"

Firefox Profiler Integration

ZiskEmu can export profiling data to Firefox Profiler format, enabling advanced visualization and analysis of your program's execution.

Generating Profiler Data

Use --profiler-output to specify the output file:

# Generate compressed profiler data (recommended)
ziskemu -e <elf> -i <input> -X -S --profiler-output=profile.json.gz

# Generate uncompressed JSON
ziskemu -e <elf> -i <input> -X -S --profiler-output=profile.json

Requirements: The -S flag is required to load symbol information. The -X flag is recommended for complete profiling data.

Default: If you use -X -S without specifying --profiler-output, a file named profile.json.gz is created automatically.

Viewing in Firefox Profiler

Go to https://profiler.firefox.com
Click "Load a profile from file"
Select your profile.json.gz file

The Firefox Profiler provides:

Call tree visualization showing the function call hierarchy
Flame graphs for identifying performance hotspots
Timeline view showing execution progress over time
Function details with cumulative costs
Search and filtering capabilities

Use Cases

Firefox Profiler is particularly useful when:

You need to visualize complex call graphs
Standard text reports are too verbose
You want to share profiling results with team members
You need to compare multiple profiling runs
You want interactive exploration of the call stack

File Format

The exported file follows the Firefox Profiler format specification, making it compatible with other tools that support this format.

Function-Level Profiling

To understand which functions contribute most to your program's cost, add the -S (or --read-symbols) flag to read symbol information from the ELF file.

Command

ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S

Output Explanation

When symbol reading is enabled, ZiskEmu simulates a call stack to evaluate functions cumulatively. This means it tracks not only the cycles and cost of each function's own code, but also all the calls made within that function. This cumulative analysis provides a complete picture of each function's contribution to the total execution cost.

Note: Initial calls to _start or _main are filtered out as they represent 100% of the program and don't provide useful optimization insights.

ZiskEmu provides two complementary analyses:

1. TOP STEP FUNCTIONS - Analysis by execution cycles:

TOP STEP FUNCTIONS (STEPS, % STEPS, CALLS, STEPS/CALL, FUNCTION)
----------------------------------------------------------------
     54,831,894  59.04%          1      54,831,894 <reth_evm::execute::BasicBlockExecutor<&reth_evm
     53,951,767  58.09%          1      53,951,767 <alloy_evm::eth::block::EthBlockExecutor<alloy_e
     52,133,363  56.13%         70         744,762 <revm_handler::mainnet_handler::MainnetHandler<r
     48,406,973  52.12%     41,793           1,158 <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoiz
     26,004,168  28.00%          1      26,004,168 <zeth_mpt_state::SparseState as stateless::trie:
     21,389,831  23.03%     41,590             514 <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoiz
     16,104,120  17.34%      1,039          15,499 <revm_context::journal::inner::JournalInner<revm
     15,999,662  17.23%        841          19,024 <revm_context::journal::inner::JournalInner<revm
     15,635,579  16.84%      1,239          12,619 <revm_database::states::state::State<stateless::
     15,498,490  16.69%        388          39,944 <&mut revm_database::states::state::State<statel
     15,014,347  16.17%        770          19,499 <revm_context::context::Context<revm_context::bl
     14,994,327  16.14%        770          19,473 <revm_context::journal::Journal<&mut revm_databa
     14,299,020  15.40%        618          23,137 revm_interpreter::instructions::contract::call_h
     14,253,493  15.35%        618          23,063 revm_interpreter::instructions::contract::call_h
     14,230,009  15.32%        618          23,025 revm_interpreter::instructions::contract::call_h
     13,714,388  14.77%     10,505           1,305 ziskos::zisklib::lib::keccak256::keccak256

Shows for each function:

STEPS: Total cumulative cycles used by the function (including all nested calls)
% STEPS: Percentage of total program cycles this function represents
CALLS: Number of times this function was called
STEPS/CALL: Average cycles per call to this function
FUNCTION: Function name from symbol table

2. TOP COST FUNCTIONS - Analysis by profiling cost:

TOP COST FUNCTIONS (COST, % COST, CALLS, COST/CALL, FUNCTION)
-------------------------------------------------------------
  5,255,204,123  45.95%          1   5,255,204,123 <reth_evm::execute::BasicBlockExecutor<&reth_evm
  5,172,696,823  45.23%          1   5,172,696,823 <alloy_evm::eth::block::EthBlockExecutor<alloy_e
  4,997,989,104  43.70%         70      71,399,844 <revm_handler::mainnet_handler::MainnetHandler<r
  4,530,507,470  39.61%     41,793         108,403 <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoiz
  4,014,605,785  35.10%          1   4,014,605,785 <zeth_mpt_state::SparseState as stateless::trie:
  3,759,934,537  32.87%     10,505         357,918 ziskos::zisklib::lib::keccak256::keccak256

Shows for each function:

COST: Total cumulative profiling cost of the function (including all nested calls)
% COST: Percentage of total program cost this function represents
CALLS: Number of times this function was called
COST/CALL: Average profiling cost per call to this function
FUNCTION: Function name from symbol table

Key insights:

Both tables show cumulative metrics - each function includes the cost/cycles of everything it calls. This helps identify:

Which high-level functions consume the most resources
Whether optimization should focus on a function's implementation or the functions it calls
Functions with high cost per call that might benefit from caching or optimization
Functions called frequently that could benefit from batching or precompiles

By comparing the STEPS and COST analyses, you can identify cases where functions have many cycles but relatively low cost (efficient operations) versus high cost per cycle (expensive operations like precompiles).

For example, ziskos::zisklib::lib::keccak256::keccak256 shows:

Called 10,505 times
13,714,388 steps (14.77% of total) with ~1,305 steps/call
3,759,934,537 cost (32.87% of total) with ~357,918 cost/call

This indicates that while Keccak uses 14.77% of cycles, it represents 32.87% of the total cost - showing it's an expensive operation relative to its cycle count, typical of precompile operations.

Customizing ROI Display

Showing More or Fewer Functions

Use the -T (or --top-roi) flag to control how many top functions are displayed:

# Show top 50 functions
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S -T 50

# Show only top 10 functions
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S -T 10

Specifying the Main Entry Point

If your program's entry point isn't named main, use the -M (or --main-name) flag:

ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S -M custom_entry

Filtering Functions by Pattern

For large programs, you may want to focus analysis on specific functions or modules. Use the --roi-filter flag with a regular expression pattern to mark functions of interest:

# Filter functions containing "sha256" in their name
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S --roi-filter "sha256"

# Filter multiple patterns
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S --roi-filter "hash|crypto|encode"

When combined with --top-roi-filter, the display will show only functions that match the specified pattern:

# Show only functions matching the filter pattern
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S \
  --roi-filter "keccak" --top-roi-filter

This is useful when you want to:

Focus optimization efforts on a specific subsystem or module
Analyze only cryptographic functions
Compare different implementations of similar functionality
Filter out noise from unrelated code

Detailed Caller Analysis

The -D (or --top-roi-detail) flag provides an in-depth breakdown of each top function, showing exactly where costs come from and who calls the function. This detailed analysis helps pinpoint optimization opportunities at a granular level.

Command

ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S -D

What This Shows

For each top function, the detailed analysis provides:

Overall metrics: Total steps and cost for the function
Cost by opcode: Breakdown showing which operations (opcodes and precompiles) consume the most resources within this function, with ranking of the top 4 most expensive operations
Top step callers: List of functions that call this function, showing:
- Number of calls from each caller
- Total steps attributed to calls from that caller
- Percentage of this function's total steps coming from each caller

This information helps you understand:

What makes a function expensive (which operations dominate)
Who is responsible for calling it (caller distribution)
Where to focus optimization (expensive operations vs. frequent callers)

Output Explanation

DETAIL FUNCTION ziskos::zisklib::lib::keccak256::keccak256
----------------------------------------------------------
STEPS                         13,714,388  14.77%
COST                       3,759,934,537  32.87%

|    COST BY OPCODE                     COUNT            COST       % RANK
|    ---------------------------------------------------------------------
|    OP ltu                            28,516       1,710,960   0.05%
|    OP add                           169,207       4,230,175   0.11%
|    OP sub                             3,644         218,640   0.01%
|    OP and                            94,545       5,672,700   0.15%
|    OP or                          2,489,249     149,354,940   3.97% #2
|    OP xor                           492,192      29,531,520   0.79% #3
|    OP sll                           360,008      19,080,424   0.51% #4
|    OP dma_memcpy                     21,010         882,420   0.02%
|    OP dma_xmemset                    21,010         882,420   0.02%
|    OP _dma_pre                        2,346         206,448   0.01%
|    OP _dma_post                       9,863         867,944   0.02%
|    OP keccak                         32,650   2,466,707,500  65.61% #1

|    TOP STEP CALLERS (calls, steps)
|    -------------------------------
|              3,974       9,749,694  71.09% <zeth_mpt_state::SparseState as stateless::trie::State
|              2,332       2,778,890  20.26% <zeth_mpt::mpt::node::Node<zeth_mpt::mpt::memoize::Cac
|              1,284         217,150   1.58% revm_interpreter::instructions::system::keccak256::<re
|              1,266         188,634   1.38% <revm_database::states::state::State<stateless::witnes
|                720         107,280   0.78% <alloy_primitives::bits::bloom::Bloom>::accrue_log
|                429          63,921   0.47% <reth_trie_common::hashed_state::HashedPostState>::fro
|                202          30,098   0.22% <revm_database::states::state::State<stateless::witnes
|                144         350,053   2.55% <alloy_trie::hash_builder::HashBuilder>::update
|                 66         102,536   0.75% stateless::recover_block::verify_and_compute_sender
|                 58         110,681   0.81% alloy_primitives::utils::keccak256_impl

Understanding the detailed report:

Function Header:

DETAIL FUNCTION ziskos::zisklib::lib::keccak256::keccak256
----------------------------------------------------------
STEPS                         13,714,388  14.77%
COST                       3,759,934,537  32.87%

Shows the total cumulative steps and profiling cost for this function (including nested calls).

COST BY OPCODE section:

|    COST BY OPCODE                     COUNT            COST       % RANK
|    ---------------------------------------------------------------------
|    OP keccak                         32,650   2,466,707,500  65.61% #1
|    OP or                          2,489,249     149,354,940   3.97% #2
|    OP xor                           492,192      29,531,520   0.79% #3

Breaks down which operations consume resources within this function:

COUNT: Number of times each operation was executed
COST: Total profiling cost for all executions
%: Percentage of this function's total cost
RANK: Top 4 most expensive operations marked #1 through #4

This shows that keccak precompile dominates this function's cost at 65.61%, making it the primary optimization target.

TOP STEP CALLERS section:

|    TOP STEP CALLERS (calls, steps)
|    -------------------------------
|              3,974       9,749,694  71.09% <zeth_mpt_state::SparseState...
|              2,332       2,778,890  20.26% <zeth_mpt::mpt::node::Node...

Shows which functions call this function and how steps are distributed:

First column: Number of calls from this caller
Second column: Total steps consumed when called from this caller
Percentage: How much of this function's total steps come from this caller
Function name: The calling function

This reveals that SparseState is responsible for 71% of this function's execution, making it the primary call path to analyze.

Controlling Detail Level

Use the -C (or --roi-callers) flag to control how many callers are shown in the detailed analysis for each function:

# Show top 20 callers for each function in the detailed report
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S -D -C 20

# Show only top 5 callers for each function
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S -D -C 5

The default value is 10 callers per function. Increasing this number provides more complete call path information but may make the output more verbose.

Tracking Function Calls

Sometimes you need to analyze each individual call to a function to understand:

Which parameter values are most frequently used
What patterns exist in the arguments
Which specific input values trigger expensive code paths

This information is valuable for optimization strategies. For example, if you discover that certain parameter values are very common, you could:

Add fast paths for those frequent values
Use lookup tables or caching for common inputs
Optimize the general case based on typical parameter distributions

How It Works

Use the --track-call-args feature combined with --roi-filter to log parameter values for each call to matching functions:

--roi-filter "pattern": Specifies which functions to track (using a regular expression)
--track-call-args N: Specifies how many parameters to log (up to 8, corresponding to RISC-V a0-a7 registers)

Important limitation: The tool logs the raw parameter values from registers. This means:

For scalar values (integers, booleans): You get the actual value
For pointers/addresses: You get only the address itself, not the data it points to
This makes tracking most useful for functions with scalar parameters or when you're interested in address patterns

Command

# Track calls to filtered functions, logging first 4 parameters
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -S \
  --roi-filter "hash_function" --track-call-args 4 --track-output-path ./traces

Options

--roi-filter "pattern": Regular expression to match function names you want to track (required)
--track-call-args N: Number of parameters to log (1-8, corresponding to RISC-V a0-a7 registers)
--track-separator "SEP": Character used to separate parameter values in output (default: ;)
--track-output-path PATH: Directory where tracking files will be written (default: current directory)

Output

For each matched function, a text file is created (<function_name>.txt) with one line per call:

# ROI: hash_function (PC: 0x00012a0-0x00012f8)
# Separator: ';'
# Parameters: a0-a3
0x7fff8200;0x00000100;0x7fff8400;0x00000000
0x7fff8300;0x00000040;0x7fff8400;0x00000001
0x7fff8450;0x00000080;0x7fff8400;0x00000002

Each line contains the parameter values (in hexadecimal) for one function call, separated by the chosen separator. You can then analyze this file to:

Find the most common parameter combinations
Identify patterns in memory addresses
Detect outliers or unusual parameter values
Build histograms of value distributions

PC Histogram Analysis

The -H (or --histogram) flag provides a low-level view of the most frequently executed code positions in your program. Unlike function-level profiling, this analysis operates at the program counter (PC) level, showing you the exact assembly instructions that execute most often.

What This Shows

This analysis:

Identifies the most executed individual instructions by their program counter address
Groups consecutive instructions together automatically
Attributes these instruction groups to their parent function (when symbols are loaded with -S)
Helps identify hot loops, critical paths, and instruction-level bottlenecks

This is particularly useful for:

Understanding which specific code sequences dominate execution time
Identifying tight loops that could benefit from optimization
Verifying that optimizations are affecting the intended code paths
Finding unexpected hotspots at the instruction level

Command

# Show top 50 most executed instruction groups
ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X -S -H 50

The histogram requires -S to display function names. The number after -H controls how many instruction groups to display.

Output Explanation

TOP PC HISTOGRAM (EXECUTIONS, % EXECUTIONS, PC)
-----------------------------------------------
        796,670   0.86%  0x801230b8:   lbu r16, 0x0(r14)
        796,670   0.86%  0x801230bc:   beq r16, r12, 0xffffffd4
      1,593,340   1.72%  -----------   <revm_bytecode::legacy::raw::LegacyRawBytecode>::into_analyzed

        755,644   0.81%  0x801230c0:   slli r17, r16, 0x38
        755,644   0.81%  0x801230c4:   srai r17, r17, 0x38
        755,644   0.81%  0x801230c8:   bge r15, r17, 0x14
      2,266,932   2.44%  -----------   <revm_bytecode::legacy::raw::LegacyRawBytecode>::into_analyzed

        547,858   0.59%  0x801230dc:   addi r14, r14, 0x1
        547,858   0.59%  0x801230e0:   bltu r14, r10, 0xffffffd8
      1,095,716   1.18%  -----------   <revm_bytecode::legacy::raw::LegacyRawBytecode>::into_analyzed

        429,174   0.46%  0x800a38ec:   ld r10, 0x60(r21)
        429,174   0.46%  0x800a38f0:   lbu r11, 0x0(r10)
        429,174   0.46%  0x800a38f4:   addi r10, r10, 0x1
        429,174   0.46%  0x800a38f8:   sd r10, 0x60(r21)
        429,174   0.46%  0x800a38fc:   slli r10, r11, 0x4
        429,174   0.46%  0x800a3900:   add r10, r19, r10
        429,174   0.46%  0x800a3904:   ld r11, 0x8(r10)
        429,174   0.46%  0x800a3908:   ld r12, 0x180(r21)
        429,174   0.46%  0x800a390c:   sub r13, r12, r11
        429,174   0.46%  0x800a3910:   sd r13, 0x180(r21)
        429,174   0.46%  0x800a3914:   bltu r12, r11, 0x20
        429,174   0.46%  0x800a3918:   ld r12, 0x0(r10)
        429,174   0.46%  0x800a391c:   addi r10, r21, 0x0 => copyb
        429,174   0.46%  0x800a3920:   addi r11, r9, 0x0 => copyb
        429,174   0.46%  0x800a3924:   jalr r1, r12, 0x0
        429,174   0.46%  0x800a3928:   lbu r10, 0x68(r21)
        429,174   0.46%  0x800a392c:   bne r10, r0, 0xffffffc0
      7,295,958   7.86%  -----------   <revm_handler::mainnet_handler::MainnetHandler<revm_context::evm::Ev

Understanding the histogram:

The output is organized into instruction groups, where each group consists of:

Individual instruction lines: Each shows:
- EXECUTIONS: Number of times this specific instruction was executed
- % EXECUTIONS: Percentage of total program steps
- PC: Program counter address in hexadecimal
- Instruction: The RISC-V assembly instruction at that address
Group summary line (with dashes):
- Total executions: Sum of all instructions in this group
- % EXECUTIONS: Cumulative percentage for the entire group
- Function name: The function to which these instructions belong

Key insights from the example:

The first group shows a simple loop checking bytes:

        796,670   0.86%  0x801230b8:   lbu r16, 0x0(r14)     # Load byte
        796,670   0.86%  0x801230bc:   beq r16, r12, 0xffffffd4  # Branch if equal
      1,593,340   1.72%  -----------   <revm_bytecode::legacy::raw::LegacyRawBytecode>::into_analyzed

This tight 2-instruction sequence executed 796,670 times, representing 1.72% of total execution.

The large group at the bottom represents a complex instruction dispatcher:

        429,174   0.46%  0x800a38ec:   ld r10, 0x60(r21)     # Load from context
        ...
        429,174   0.46%  0x800a392c:   bne r10, r0, 0xffffffc0   # Loop back
      7,295,958   7.86%  -----------   <revm_handler::mainnet_handler::MainnetHandler...

This 17-instruction sequence accounts for 7.86% of total execution, making it a prime optimization target.

When to use histogram analysis:

After function-level profiling: Once you identify expensive functions, use histograms to see which specific instruction sequences within those functions dominate
Validating compiler optimizations: Verify that loops are unrolled or optimized as expected
Finding unexpected hotspots: Sometimes a small instruction sequence accounts for disproportionate execution time
Comparing implementations: See how different code structures affect instruction-level execution patterns

Additional Options

Show Steps Without Full Statistics

For quick execution time checks without generating full statistics, use the --steps flag:

ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin --steps

Progress Indicators

For long-running programs, show progress updates every 16M steps with --with-progress:

ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin --with-progress

Disable Thousands Separator

For machine-readable output, disable the thousands separator with --no-thousands-sep:

ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest -i input.bin -X --no-thousands-sep

Complete Example: Comprehensive Profiling

Here's a complete example that uses most profiling features together:

ziskemu -e target/elf/riscv64ima-zisk-zkvm-elf/release/guest \
  -i input.bin \
  -X \
  -S \
  -D \
  -T 30 \
  -C 15 \
  -H 50 \
  --roi-filter "sha256|hash" \
  --track-call-args 6 \
  --track-output-path ./profiling_data \
  -m

This command will:

Generate full statistics (-X)
Read and use symbol information (-S)
Show detailed caller analysis (-D)
Display top 30 functions by cost (-T 30)
Show top 15 callers for each function (-C 15)
Display top 50 most executed instructions (-H 50)
Filter to sha256/hash-related functions (--roi-filter)
Track first 6 parameters of filtered function calls (--track-call-args)
Save tracking data to ./profiling_data directory
Show performance metrics (-m)

Tips for Effective Profiling

Start Simple, Add Detail

Begin with basic statistics (-X) to get an overview, then progressively add more detailed analysis:

Basic: ziskemu -e program.elf -i input.bin -X
Functions: ziskemu -e program.elf -i input.bin -X -S
Callers: ziskemu -e program.elf -i input.bin -X -S -D
Detailed: Add -H as needed

Focus on High Impact

Use the final_cost percentage to identify functions with the highest impact. Optimizing a function that represents 50% of execution time will have much more effect than optimizing one at 1%.

Understand Profiling Cost vs. Final Cost

When a function has high final cost but low profiling cost, the optimization opportunity lies in the functions it calls, not in the function itself. Focus your optimization efforts where profiling costs are highest, as these represent direct computational work that can be improved through code changes or patching with precompiles.

Use Filtering for Large Codebases

In programs with hundreds of functions, use --roi-filter to focus on specific subsystems or modules of interest.

Track Representative Inputs

Profile with realistic, representative inputs. The cost distribution can vary significantly based on input characteristics.

Practical Example: Analyzing Ethereum Opcode Costs

This example demonstrates how to analyze the cost distribution of Ethereum opcodes in a real-world client implementation. By filtering for the EVM instruction interpreter functions, we can obtain a detailed breakdown of which Ethereum operations consume the most resources during block validation.

Scenario

You want to understand which Ethereum opcodes are most expensive in terms of ZisK proving costs when validating a specific block. This information helps you:

Identify which EVM operations would benefit most from optimization
Understand the cost profile of real-world Ethereum transactions
Guide decisions about which precompiles or patches to prioritize

Command

target/release/ziskemu \
  -S \
  -X \
  -e ../zisk-eth-client/bin/guests/stateless-validator-reth/target/riscv64ima-zisk-zkvm-elf/release/zec-reth \
  -i ../data/benchmark_inputs/24654304_30c8b8.bin \
  --roi-filter "revm_interpreter::instructions::" \
  --top-roi-filter \
  -T 200

What this does:

-S: Load symbol information from the ELF file
-X: Generate full statistics with cost breakdown
-e <path>: Path to the compiled Ethereum client (reth implementation)
-i <input>: Block data to validate (block 24,654,304)
--roi-filter "revm_interpreter::instructions::": Filter to show only functions in the EVM instruction interpreter namespace (where all Ethereum opcodes are implemented)
--top-roi-filter: Display only the filtered functions in the top ROI lists
-T 200: Show top 200 functions (to capture all EVM opcodes)

Expected Output

The output will show the TOP COST FUNCTIONS filtered to only include EVM instruction implementations, giving you a clear view of which Ethereum opcodes dominate the proving cost for this specific block:

TOP COST FUNCTIONS (COST, % COST, CALLS, COST/CALL, FUNCTION)
-------------------------------------------------------------
  9,433,353,231  10.32%      5,824       1,619,737 revm_interpreter::instructions::contract::call_helpers::load_acc_
  9,396,093,086  10.28%      5,824       1,613,340 revm_interpreter::instructions::contract::call_helpers::load_acco
  9,377,741,662  10.26%      5,824       1,610,189 revm_interpreter::instructions::contract::call_helpers::load_acco
  8,344,978,788   9.13%      1,695       4,923,291 revm_interpreter::instructions::contract::call::<revm_interpreter
  4,599,658,812   5.03%    342,951          13,412 revm_interpreter::instructions::stack::swap::<1, revm_interpreter
  2,772,734,752   3.03%    128,956          21,501 revm_interpreter::instructions::memory::mload::<revm_interpreter:
  2,580,388,569   2.82%     10,675         241,722 revm_interpreter::instructions::host::sload::<revm_interpreter::i
  1,726,257,923   1.89%    105,903          16,300 revm_interpreter::instructions::memory::mstore::<revm_interpreter
  1,599,904,068   1.75%    119,289          13,412 revm_interpreter::instructions::stack::swap::<2, revm_interpreter
  1,576,416,043   1.72%     13,627         115,683 revm_interpreter::instructions::arithmetic::mulmod::<revm_interpr
  1,499,796,900   1.64%    111,825          13,412 revm_interpreter::instructions::stack::swap::<3, revm_interpreter
  1,430,041,088   1.56%    106,624          13,412 revm_interpreter::instructions::stack::swap::<4, revm_interpreter
  1,045,628,445   1.14%      2,201         475,069 revm_interpreter::instructions::contract::static_call::<revm_inte
    896,353,301   0.98%    184,312           4,863 revm_interpreter::instructions::control::jumpi::<revm_interpreter
    812,869,552   0.89%    561,374           1,448 revm_interpreter::instructions::stack::push::<1, revm_interpreter
    806,652,474   0.88%    465,922           1,731 revm_interpreter::instructions::stack::push::<2, revm_interpreter
    763,874,190   0.84%      6,781         112,649 revm_interpreter::instructions::host::sstore::<revm_interpreter::
    691,435,073   0.76%      5,682         121,688 revm_interpreter::instructions::system::keccak256::<revm_interpre
    669,514,638   0.73%    245,798           2,723 revm_interpreter::instructions::arithmetic::add::<revm_interprete
    638,632,995   0.70%    102,549           6,227 revm_interpreter::instructions::arithmetic::mul::<revm_interprete
    620,675,903   0.68%    239,701           2,589 revm_interpreter::instructions::control::jump::<revm_interpreter:
    527,546,726   0.58%     83,391           6,326 revm_interpreter::instructions::bitwise::shr::<revm_interpreter::
    452,376,936   0.49%    302,391           1,496 revm_interpreter::instructions::stack::dup::<2, revm_interpreter:
    325,487,994   0.36%     41,683           7,808 revm_interpreter::instructions::bitwise::sar::<revm_interpreter::
    311,851,955   0.34%     25,502          12,228 revm_interpreter::instructions::system::codecopy::<revm_interpret
    289,141,110   0.32%    120,407           2,401 revm_interpreter::instructions::bitwise::iszero::<revm_interprete
    264,613,976   0.29%    176,881           1,496 revm_interpreter::instructions::stack::dup::<3, revm_interpreter:
    262,969,735   0.29%     18,608          14,132 revm_interpreter::instructions::system::calldataload::<revm_inter
    252,430,047   0.28%     41,031           6,152 revm_interpreter::instructions::bitwise::sgt::<revm_interpreter::
    248,940,076   0.27%      1,928         129,118 revm_interpreter::instructions::contract::delegate_call::<revm_in
    242,086,315   0.26%        192       1,260,866 revm_interpreter::instructions::host::extcodesize::<revm_interpre
    229,785,355   0.25%     10,852          21,174 revm_interpreter::instructions::stack::push::<32, revm_interprete

This filtered view allows you to quickly identify:

Most expensive opcodes: Which EVM operations have the highest total cost
Frequently called opcodes: Operations with many calls but lower individual cost
Optimization targets: Opcodes that would benefit most from ZisK-specific optimizations or precompiles

Important note: With this method, no modification to the ELF file is required. The profiling works directly on the compiled binary using existing symbol information. However, you do need to know the naming convention used for the functions that implement each opcode. In this case, the REVM interpreter uses the namespace revm_interpreter::instructions:: consistently, making it easy to filter all opcode implementations with a single pattern.

Conclusion

ZiskEmu's profiling capabilities provide deep insights into your program's resource consumption and performance characteristics. By understanding profiling and final costs, analyzing regions of interest, and using the various filtering and tracking options, you can effectively identify optimization opportunities and improve the efficiency of your ZisK programs.

Use profiling costs as your primary optimization metric, as they provide a direct cause-and-effect relationship with code changes. This makes them ideal for detecting where patches should be applied, validating that optimizations are working correctly, and ensuring that precompiles are being used where expected.

Remember that profiling works on any ELF file with symbols, including release builds, making it easy to analyze production-ready code without special compilation flags or instrumentation.

ZisK Documentation