← All Projects

lindblad-bench: Cache Hierarchy Analysis of Lindblad Simulation

Quantum ComputingCPerformance AnalysisHPCLindbladSIMD

Status: Preprint on arXiv

GitHub: github.com/rylanmalarchick/lindblad-bench

Paper: arXiv:2603.18052 -- "Cache Hierarchy and Vectorization Analysis of Lindblad Master Equation Simulation for Near-Term Quantum Control"

Overview

For near-term quantum control -- optimal pulse engineering, feedback controllers, robustness sweeps -- the simulation bottleneck is dense matrix-vector multiplication on small systems (d=3 to d=27). At these scales, Python dispatch overhead dominates actual arithmetic, and the working set fits entirely in CPU cache. This paper analyzes where the time actually goes and what compiler/memory layout choices matter.

The core observation: a 9x9 complex matrix-vector product (d=3 transmon) is 648 FLOPs and 1.3 KB. That fits in L1 cache. The bottleneck isn't the math -- it's everything around it.

Key Results

Metric Value
Speedup (SoA + O3 + ffast-math) 2-4x over scalar baseline
Arithmetic intensity ~0.5 FLOP/byte (bandwidth-bound)
d=3 working set 1.3 KB (L1 cache)
d=9 working set 105 KB (L2 boundary)
d=27 working set 8.5 MB (L3 boundary)

Cache Hierarchy Scaling

Hilbert dim Liouvillian size Fits in Regime
d=3 9x9 = 1.3 KB L1 (48 KB) Compute-bound
d=9 81x81 = 105 KB L2 (2 MB) Transitional
d=27 729x729 = 8.5 MB L3 (36 MB) Bandwidth-bound
d=81 6561x6561 = 690 MB RAM Memory-bound

Approach

  1. Bare-metal C library -- no dependencies, C11, cache-aligned allocations
  2. Pade [13/13] matrix exponential -- Higham 2005 scaling-and-squaring for propagator precomputation
  3. Roofline modeling -- measured arithmetic intensity vs peak throughput to identify bottleneck regime
  4. Compiler flag sweep -- scalar, O2, O3, O3+native, O3+native+ffast-math across GCC builds
  5. Assembly analysis -- Godbolt verification of AVX2 FMA generation (vfmadd231pd)

The key finding: -ffast-math is essential for GCC to auto-vectorize complex arithmetic. Without it, strict IEEE 754 compliance prevents the compiler from reordering floating-point operations needed for SIMD.

Structure-of-Arrays Layout

The standard approach stores complex numbers as interleaved (re, im) pairs (Array-of-Structures). Switching to separate real and imaginary arrays (Structure-of-Arrays) enables wider SIMD loads and eliminates cross-lane shuffles:

AoS: [re0, im0, re1, im1, re2, im2, ...]  -- 128-bit loads mix re/im
SoA: [re0, re1, re2, re3, ...]             -- 256-bit loads are pure re
     [im0, im1, im2, im3, ...]             -- 256-bit loads are pure im

Connection to QubitOS

This library is the standalone performance study informing the QubitOS Tier 2 solver (C fast path for d <= 27). The FFI interface in QubitOS is designed to call into this library. The cache hierarchy analysis explains why small-system simulation should use precomputed propagators rather than adaptive ODE integrators.

Technology Stack

Category Technology
Language C11, POSIX
Build CMake, multiple build variants
Analysis Python (NumPy, matplotlib), perf counters
Validation QuTiP 5.x reference, Hilbert-Schmidt norm < 1e-6
Paper LaTeX, quantumarticle class

Links

Citation

Malarchick, R. (2026). "Cache Hierarchy and Vectorization Analysis of
Lindblad Master Equation Simulation for Near-Term Quantum Control."
arXiv:2603.18052.