benchmarkingembeddedperformance

How to Run Timing and Performance Benchmarks for Heterogeneous Embedded Systems

UUnknown

2026-02-19

10 min read

Practical 2026 guide to measure CPU cycles, WCET and GPU latency on heterogeneous RISC‑V+GPU embedded systems—step‑by‑step and CI ready.

Stop guessing — measure end-to-end timing across RISC‑V CPUs and GPUs

You’re wrestling with a common embedded-engineering pain: an application that looks fast in microbenchmarks but misses deadlines in production once the GPU or interconnect adds unexpected latency. On mixed RISC‑V CPU + GPU designs, that gap is bigger because CPU cycle counts, WCET estimates and GPU latencies live in different worlds. This guide gives you a reproducible, 2026‑forward workflow to measure and reason about both sides of the coin.

What you’ll get (quick)

A step‑by‑step benchmarking plan combining CPU timing, WCET estimation, and GPU latency measurements for heterogeneous embedded systems.
Concrete commands, code snippets, and configuration tips for RISC‑V platforms and modern GPU stacks (including NVLink/Fusion‑style interconnects).
Advice on correlating timestamps across domains, automating tests, and integrating results into CI and safety toolchains.

Why this matters in 2026

Two ecosystem trends turned this from “nice‑to‑have” to “must‑have”:

In January 2026, Vector announced the acquisition and planned integration of StatInf’s RocqStat timing analysis technology into VectorCAST — a clear sign that tooling vendors are unifying WCET and verification workflows for safety‑critical embedded software.
SiFive’s moves to integrate NVIDIA’s NVLink Fusion with RISC‑V IP point to more tight coupling between RISC‑V hosts and high‑performance GPUs. That increases the need to measure interconnect and GPU DMA latencies end‑to‑end, not just CPU cycles.

"Timing safety is becoming a critical capability for software verification in safety‑critical systems." — industry source, 2026

High‑level benchmarking strategy

Think of the system as four measurable stages:

Host enqueue / API call overhead on the RISC‑V CPU.
Host → device transfer over the interconnect (PCIe, NVLink/Fusion, or SoC DMA).
GPU kernel startup and execution time.
Device → host transfer and completion notifications.

Your benchmark must measure each stage, correlate timestamps across domains, and combine static WCET analysis for safety margins.

Prerequisites & hardware/software checklist

Board or SoC with RISC‑V host and a GPU device (or a development platform that simulates the interconnect).
Linux toolchain for RISC‑V (GCC, perf support), or a bare‑metal SDK if you run without an OS.
GPU profiling tools: NVTX/CUPTI/Nsight for NVIDIA stacks; vendor profiler for Mali/Vivante when applicable.
WCET/static timing analysis tools — include RocqStat/VectorCAST if available, and a measurement‑based toolkit (MBPTA) for probabilistic analysis.
Automation CI runner (GitHub Actions/Buildkite with access to hardware or a lab runner).

Step 1 — Define workloads and acceptance criteria

Start by listing the critical code paths and acceptable latencies. Define at least two workload classes:

Microbenchmarks: small kernels, short DMA transfers, kernel launch latency.
End‑to‑end scenarios: full sensor → CPU preprocessing → GPU inference → actuator loop.

For each path, record target deadlines and which metric matters most: mean, p99, or true worst‑case. Safety‑critical flows often require conservative WCET upper bounds; soft real‑time may accept p99 with monitoring.

Step 2 — RISC‑V CPU timing: cycles and OS timers

Measure CPU timing at the cycle level using the RISC‑V CSRs and corroborate with high‑resolution OS timers when running Linux.

Read cycle counters (bare‑metal or user space)

Use rdcycle/rdtime CSRs for cycle accuracy. Examples below work in GCC on RISC‑V:

static inline unsigned long long rdcycle64(void) {
    unsigned long long c;
    asm volatile ("rdcycle %0" : "=r" (c));
    return c;
  }

void timed_section(void) {
  unsigned long long t0 = rdcycle64();
  /* code you want to measure */
  unsigned long long t1 = rdcycle64();
  printf("cycles: %llu\n", t1 - t0);
}

Convert cycles to time: time_us = cycles / (cpu_mhz * 1000). Ensure you know the effective clock (watch for dynamic frequency scaling).

Using Linux timers and perf

For kernel + user‑space combined timing and PMU events, use perf (Linux 5.x+ has good RISC‑V perf support). Example:

# Record cycles and instructions for a process
perf record -e cycles:u,instructions:u -p  -- sleep 10

# Report
perf report --stdio

Key tips:

Pin measurement threads to cores with taskset and set SCHED_FIFO to avoid scheduler jitter.
Disable CPU frequency scaling: echo performance >/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor.
Isolate a core with isolcpus for deterministic runs.

Step 3 — WCET estimation (static + measurement)

WCET is the guaranteed upper bound you’ll use for safety certifications. Use a hybrid approach:

Static analysis: use tools that model pipelines, caches, and interconnect delays. In 2026, tooling consolidation (Vector + RocqStat integration) is making it easier to connect static timing analysis to testing toolchains.
Measurement‑based techniques: enumerate paths and stress rare branches. Attack worst‑case execution paths with fuzzed inputs and deliberate cache/branch predictor pollution.
Combine: use static analysis to bound unexplored paths; use MBPTA (Measurement Based Probabilistic Timing Analysis) to estimate tail latencies where static models are intractable.

Practical steps:

Compile with full debug symbols and no aggressive link‑time optimizations if the WCET tool needs CFG reconstruction.
Annotate I/O and interrupt boundaries. WCET tools assume worst‑case hardware states — model DMA contention and caches explicitly.
Run stress harnesses for long durations to observe rare scheduling/interrupt spikes and record p99/p999 values.

Step 4 — Measure GPU latency and interconnect costs

GPU latency breaks into four measurable pieces. Use GPU APIs to timestamp on the device and host, and measure transfers explicitly.

NVIDIA-style (CUDA/CUPTI) measurements

If you have an NVIDIA GPU or NVLink/Fusion path, use CUDA events or CUPTI timestamps to measure kernel and transfer latencies precisely:

cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);

cudaEventRecord(start, stream);
// enqueue async memcpy / kernel
cudaEventRecord(stop, stream);
cudaEventSynchronize(stop);
float ms = 0; cudaEventElapsedTime(&ms, start, stop);
printf("Elapsed ms: %f\n", ms);

To measure DMA/transfer latency across NVLink/Fusion, measure host enqueue to device timestamp and compare to device global clock (via CUPTI). CUPTI can give device timestamps and activity API traces.

Generic GPU stacks and embedded GPUs

For Mali, Vivante, or vendor embedded GPUs, use the vendor profiler or GL/CL timestamp queries. The key idea is identical: place device timestamps as early and late as possible, measure host enqueue and completion, and subtract to isolate interconnect cost.

Measure kernel launch overhead

Kernel launch overhead (host API latency) is often a dominant contributor for small kernels. Measure with empty kernels and record both host wall time and device start times to separate driver overhead from device start latency.

Step 5 — Correlate timestamps across CPU and GPU domains

Clocks drift. To get robust end‑to‑end numbers you must correlate host and device clocks. Use one of these strategies:

Synchronized hardware timer: if the SoC exposes a shared system timer accessible to both CPU and GPU, use it as the single source of truth.
API timestamp translation: use CUPTI or driver APIs that report both host and device timestamps and provide translation routines.
Ping‑pong calibration: issue a trivial GPU kernel that writes the device time back to host, and estimate offset and drift from repeated samples.

Example correlation (ping‑pong calibration):

// Host t0 -> enqueue
// Device records device_t1 at kernel start and writes it back
// Host reads device_t1 at host time t2
// Estimate offset: offset = device_t1 - ((t0 + t2) / 2)

Repeat calibration runs periodically to correct for drift when long tests are required.

Step 6 — Compute end‑to‑end and WCET margins

Once you have per‑stage measurements and WCET candidate values, produce a report with:

Per‑stage median, p95, p99, and observed max.
Static WCET bound for CPU stages and static/device guarantees for GPU where applicable.
End‑to‑end worst‑case = sum(static WCET_cpu, max observed interconnect worst‑case, static GPU bound + safety margin).

Document assumptions and sources of non‑determinism (e.g., shared DMA channels, bus arbitration, peer‑to‑peer transfers). If your static WCET tool reports per‑path conservative bounds, add those directly. For measurement‑driven parts, use MBPTA to convert observed tail behavior into conservative probabilistic bounds acceptable to your project.

Step 7 — Automate, baseline, and integrate into CI

Automate benchmark runs and store raw traces and artifacts for reproducibility:

Write a harness that runs N iterations, collects per‑iteration timestamps, and exports CSV/JSON.
Push artifacts to a trace storage (object store) and track regression against a baseline using p95/p99 thresholds.
Fail CI if E2E worst‑case increases beyond acceptable delta.

Automation sample (pseudo):

#!/bin/bash
for i in {1..1000}; do
  ./run_benchmark --iter $i > traces/iter-$i.json
done
python3 analyze.py traces/*.json --out report.html

Advanced strategies and common pitfalls

Cache and pipeline modeling: static tools must know cache sizes, associativity, and pipeline hazards. If you can, disable certain caches for deterministic runs or model them explicitly in your WCET tool.
Interrupt and DMA noise: IRQs, DMA bursts, and shared bus contention can create rare spikes. Use isolated cores and dedicate DMA channels where possible during WCET runs.
Power management: CPU/GPU DVFS affects timings. Lock frequencies for benchmark runs but remember that production may use DVFS — measure both.
GPU batching and driver optimizations: Drivers may fuse small launches, hide latency by scheduling. For safety analysis, measure worst‑case under driver stress (e.g., concurrent streams) and consider driver behavior part of the model.

Short case study (hypothetical)

On a RISC‑V host with an NVLink‑connected GPU, we measured a small inference path:

Host enqueue: median 40 µs, p99 120 µs (measured via clock_gettime + rdcycle).
Host→device transfer (128 KB): median 180 µs, p99 250 µs (using CUPTI DMA traces).
Kernel execution: median 1.2 ms, p99 1.35 ms (device timestamps).
Device→host transfer (results): median 90 µs, p99 140 µs.

Static WCET analysis for the CPU preprocessing path returned 400 µs. Adding conservative GPU driver margin (200 µs) gave an end‑to‑end worst‑case bound ≈ 2.4 ms. Repeating with disabled interrupts and locked frequencies reduced p99 tail, improving the margin and enabling a lower safety buffer.

Actionable checklist before you ship

Pin critical threads and isolate cores for timing runs.
Lock frequencies and record the effective CPU/GPU clocks.
Collect both device and host timestamps and run regular clock calibration.
Use static WCET tools for control code; use measurement‑based analysis for complex drivers and DMA behavior.
Automate benchmarks and include p95/p99 regression detection in CI.

Final recommendations & future outlook

In 2026, expect tooling consolidation. Vector’s acquisition of RocqStat signals tighter integration of timing analysis into verification toolchains, while RISC‑V + NVLink Fusion announcements promise more mixed CPU/GPU SoCs. That means two things:

Toolchains will increasingly give you the ability to link static WCET analysis, runtime traces, and test harnesses in one pipeline — adopt those integrations when available.
End‑to‑end timing will become a first‑class quality metric for embedded AI workloads; invest in cross‑domain timestamping and automated regression checks now.

Key takeaways

Measure everywhere: CPU cycle counters, host timers, device timestamps, and interconnect traces.
Combine approaches: static WCET for control code, measurement for DMA and driver behavior, and MBPTA for tail estimation.
Automate and version: store traces, baseline results, and fail CI when p99 regressions appear.

Call to action

Start a disciplined timing program today: build a small benchmark harness that logs CPU cycles (rdcycle), GPU device timestamps (CUPTI/Nsight or vendor queries), and per‑stage CSV outputs. Commit that harness to CI and run nightly. If you want a lightweight way to share snippets, trace outputs and example configs with teammates and reviewers, sign up at pasty.cloud to store, annotate, and embed your benchmark artifacts into design reviews and verification workflows.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.