Status: Preprint on arXiv
GitHub: github.com/rylanmalarchick/lindblad-bench
Paper: arXiv:2603.18052 -- "Cache Hierarchy and Vectorization Analysis of Lindblad Master Equation Simulation for Near-Term Quantum Control"
Overview
For near-term quantum control -- optimal pulse engineering, feedback controllers, robustness sweeps -- the simulation bottleneck is dense matrix-vector multiplication on small systems (d=3 to d=27). At these scales, Python dispatch overhead dominates actual arithmetic, and the working set fits entirely in CPU cache. This paper analyzes where the time actually goes and what compiler/memory layout choices matter.
The core observation: a 9x9 complex matrix-vector product (d=3 transmon) is 648 FLOPs and 1.3 KB. That fits in L1 cache. The bottleneck isn't the math -- it's everything around it.
Key Results
| Metric | Value |
|---|---|
| Speedup (SoA + O3 + ffast-math) | 2-4x over scalar baseline |
| Arithmetic intensity | ~0.5 FLOP/byte (bandwidth-bound) |
| d=3 working set | 1.3 KB (L1 cache) |
| d=9 working set | 105 KB (L2 boundary) |
| d=27 working set | 8.5 MB (L3 boundary) |
Cache Hierarchy Scaling
| Hilbert dim | Liouvillian size | Fits in | Regime |
|---|---|---|---|
| d=3 | 9x9 = 1.3 KB | L1 (48 KB) | Compute-bound |
| d=9 | 81x81 = 105 KB | L2 (2 MB) | Transitional |
| d=27 | 729x729 = 8.5 MB | L3 (36 MB) | Bandwidth-bound |
| d=81 | 6561x6561 = 690 MB | RAM | Memory-bound |
Approach
- Bare-metal C library -- no dependencies, C11, cache-aligned allocations
- Pade [13/13] matrix exponential -- Higham 2005 scaling-and-squaring for propagator precomputation
- Roofline modeling -- measured arithmetic intensity vs peak throughput to identify bottleneck regime
- Compiler flag sweep -- scalar, O2, O3, O3+native, O3+native+ffast-math across GCC builds
- Assembly analysis -- Godbolt verification of AVX2 FMA generation (
vfmadd231pd)
The key finding: -ffast-math is essential for GCC to auto-vectorize complex arithmetic. Without it, strict IEEE 754 compliance prevents the compiler from reordering floating-point operations needed for SIMD.
Structure-of-Arrays Layout
The standard approach stores complex numbers as interleaved (re, im) pairs (Array-of-Structures). Switching to separate real and imaginary arrays (Structure-of-Arrays) enables wider SIMD loads and eliminates cross-lane shuffles:
AoS: [re0, im0, re1, im1, re2, im2, ...] -- 128-bit loads mix re/im
SoA: [re0, re1, re2, re3, ...] -- 256-bit loads are pure re
[im0, im1, im2, im3, ...] -- 256-bit loads are pure im
Connection to QubitOS
This library is the standalone performance study informing the QubitOS Tier 2 solver (C fast path for d <= 27). The FFI interface in QubitOS is designed to call into this library. The cache hierarchy analysis explains why small-system simulation should use precomputed propagators rather than adaptive ODE integrators.
Technology Stack
| Category | Technology |
|---|---|
| Language | C11, POSIX |
| Build | CMake, multiple build variants |
| Analysis | Python (NumPy, matplotlib), perf counters |
| Validation | QuTiP 5.x reference, Hilbert-Schmidt norm < 1e-6 |
| Paper | LaTeX, quantumarticle class |
Links
Citation
Malarchick, R. (2026). "Cache Hierarchy and Vectorization Analysis of
Lindblad Master Equation Simulation for Near-Term Quantum Control."
arXiv:2603.18052.