lindblad-bench: Cache Hierarchy Analysis of Lindblad Simulation

Status: Preprint on arXiv

GitHub: github.com/rylanmalarchick/lindblad-bench

Paper: arXiv:2603.18052 -- "Cache Hierarchy and Vectorization Analysis of Lindblad Master Equation Simulation for Near-Term Quantum Control"

Overview

For near-term quantum control -- optimal pulse engineering, feedback controllers, robustness sweeps -- the simulation bottleneck is dense matrix-vector multiplication on small systems (d=3 to d=27). At these scales, Python dispatch overhead dominates actual arithmetic, and the working set fits entirely in CPU cache. This paper analyzes where the time actually goes and what compiler/memory layout choices matter.

The core observation: a 9x9 complex matrix-vector product (d=3 transmon) is 648 FLOPs and 1.3 KB. That fits in L1 cache. The bottleneck isn't the math -- it's everything around it.

Key Results

Metric	Value
Speedup (SoA + O3 + ffast-math)	2-4x over scalar baseline
Arithmetic intensity	~0.5 FLOP/byte (bandwidth-bound)
d=3 working set	1.3 KB (L1 cache)
d=9 working set	105 KB (L2 boundary)
d=27 working set	8.5 MB (L3 boundary)

Cache Hierarchy Scaling

Hilbert dim	Liouvillian size	Fits in	Regime
d=3	9x9 = 1.3 KB	L1 (48 KB)	Compute-bound
d=9	81x81 = 105 KB	L2 (2 MB)	Transitional
d=27	729x729 = 8.5 MB	L3 (36 MB)	Bandwidth-bound
d=81	6561x6561 = 690 MB	RAM	Memory-bound

Approach

Bare-metal C library -- no dependencies, C11, cache-aligned allocations
Pade [13/13] matrix exponential -- Higham 2005 scaling-and-squaring for propagator precomputation
Roofline modeling -- measured arithmetic intensity vs peak throughput to identify bottleneck regime
Compiler flag sweep -- scalar, O2, O3, O3+native, O3+native+ffast-math across GCC builds
Assembly analysis -- Godbolt verification of AVX2 FMA generation (vfmadd231pd)

The key finding: -ffast-math is essential for GCC to auto-vectorize complex arithmetic. Without it, strict IEEE 754 compliance prevents the compiler from reordering floating-point operations needed for SIMD.

Structure-of-Arrays Layout

The standard approach stores complex numbers as interleaved (re, im) pairs (Array-of-Structures). Switching to separate real and imaginary arrays (Structure-of-Arrays) enables wider SIMD loads and eliminates cross-lane shuffles:

AoS: [re0, im0, re1, im1, re2, im2, ...]  -- 128-bit loads mix re/im
SoA: [re0, re1, re2, re3, ...]             -- 256-bit loads are pure re
     [im0, im1, im2, im3, ...]             -- 256-bit loads are pure im

Connection to QubitOS

This library is the standalone performance study informing the QubitOS Tier 2 solver (C fast path for d <= 27). The FFI interface in QubitOS is designed to call into this library. The cache hierarchy analysis explains why small-system simulation should use precomputed propagators rather than adaptive ODE integrators.

Technology Stack

Category	Technology
Language	C11, POSIX
Build	CMake, multiple build variants
Analysis	Python (NumPy, matplotlib), perf counters
Validation	QuTiP 5.x reference, Hilbert-Schmidt norm < 1e-6
Paper	LaTeX, quantumarticle class

Citation

Malarchick, R. (2026). "Cache Hierarchy and Vectorization Analysis of
Lindblad Master Equation Simulation for Near-Term Quantum Control."
arXiv:2603.18052.