Status: Core Complete
GitHub: github.com/rylanmalarchick/cuda-quantum-simulator
Overview
A CUDA-based quantum state vector simulator that efficiently simulates quantum circuits on GPU. Features modern C++17 with RAII memory management, comprehensive noise modeling, and validated against CPU reference implementation.
Performance
Benchmarks (RTX 4070, 8GB VRAM)
| Metric | Result |
|---|---|
| GPU faster than CPU | Starting at 20 qubits (1.1× speedup) |
| Peak speedup | 4.1× at 22 qubits (4M states) |
| vs. NVIDIA cuStateVec | Competitive (faster at small qubit counts due to lower dispatch overhead; converges at scale) |
| Maximum qubits | 28 qubits (8GB VRAM limit) |
| Practical range | 20-25 qubits for fast iteration |
Validation
- GPU kernels validated against CPU reference implementation
- 156 unit tests across 9 test suites (GoogleTest)
- Comparison benchmarks with NVIDIA cuStateVec
Architecture
| Stage | Component | Description |
|---|---|---|
| 1 | Circuit Parser | Simple format or reuse from optimizer project |
| 2 | State Vector (GPU) | 2^n complex amplitudes on GPU |
| 3 | Gate Kernels | CUDA kernels for H, X, CNOT, Rz, etc. |
| 4 | Measurement | Probability extraction, sampling |
| 5 | Validation | Compare against CPU reference |
Key Features
Gate Set (16 gates)
| Type | Gates |
|---|---|
| Single-qubit | X, Y, Z, H, S, T, S†, T†, Rx, Ry, Rz |
| Two-qubit | CNOT, CZ, SWAP, CRY, CRZ |
| Three-qubit | Toffoli (CCX) |
Noise Models (6 channels)
- Depolarizing: Random Pauli errors with probability p
- Amplitude damping: T1 relaxation (|1⟩ → |0⟩)
- Phase damping: T2 dephasing
- Bit flip, Phase flip, Bit-phase flip
- Monte Carlo trajectory simulation with batched execution
Simulation Modes
- State Vector: Scales to 28 qubits on RTX 4070 (8GB VRAM)
- Density Matrix: Mixed state evolution (1-14 qubits, O(4^n) memory)
- Batched Monte Carlo: Parallel trajectories for noise simulation
Software Engineering
- MIT License with SPDX headers on all 31 source files
- Modern C++17:
[[nodiscard]],noexcept, move semantics - RAII Memory Management:
CudaMemory<T>wrapper eliminates raw pointer handling - CUDA error checking:
CUDA_CHECK()macro with file/line info - Fluent API:
circuit.h(0).cnot(0, 1).rz(1, M_PI/4) - GoogleTest integration with CMake
Test Coverage
| Test Suite | Tests | Focus |
|---|---|---|
| test_statevector | 15 | State initialization, normalization |
| test_gates | 26 | Gate correctness |
| test_gate_algebra | 24 | Gate composition |
| test_gpu_cpu_equivalence | 13 | CPU/GPU parity |
| test_boundary | 18 | Edge cases |
| test_noise | 22 | Noise channels |
| test_density_matrix | 26 | Mixed states |
| test_optimized_gates | 8 | Kernel optimizations |
| test_warmup | 4 | GPU initialization |
| Total | 156 |
Technical Approach
RAII Memory Management
All GPU memory uses the CudaMemory<T> RAII wrapper:
CudaMemory<cuDoubleComplex> d_state(size); // Allocates on construction
// Automatically freed when out of scope - no manual cudaFree needed
Single-Qubit Gate Strategy
For a single-qubit gate on qubit q:
- State vector has 2^n elements
- Gate affects pairs of elements differing only in bit q
- Each CUDA thread handles one pair
Two-Qubit Gate Strategy
For CNOT on control c, target t:
- Affects pairs where control bit is 1
- Only flips target bit when control is 1
- Careful indexing to avoid race conditions
Optimized Kernels
- Shared memory tiling for data locality
- Coalesced memory access patterns
- Occupancy optimization
Integration with Other Projects
| Stage | Project | Role |
|---|---|---|
| 1 | quantum-circuit-optimizer | Optimizes circuits |
| 2 | cuda-quantum-simulator (this project) | Validates circuits |
| 3 | QubitPulseOpt | Generates control pulses |
Use cases:
- Validate quantum circuit optimizer (run before/after, compare results)
- Test pulse sequences from QubitPulseOpt (simulate ideal vs noisy)
- Benchmark optimization quality (depth reduction → faster simulation)
Technology Stack
| Category | Technology |
|---|---|
| Language | CUDA C++ with C++17 host code |
| Build | CMake with CUDA support |
| Memory | RAII wrappers (CudaMemory<T>) |
| Testing | GoogleTest (156 tests, 9 suites) |
| Profiling | Nsight Systems |
| License | MIT (SPDX headers on all files) |
Roadmap
- State vector simulation
- Density matrix simulation
- Comprehensive noise models
- cuStateVec comparison benchmarks
- Multi-GPU support (only remaining TODO)
Links
Why Build From Scratch?
Using cuQuantum would hide the interesting parts. Building a simulator from scratch teaches:
- GPU memory hierarchy and kernel optimization
- Coalescing, occupancy, and memory access patterns
- Real CUDA programming beyond PyTorch abstractions