CUDA Quantum Simulator: GPU-Accelerated State Vector Simulation

Status: Core Complete

GitHub: github.com/rylanmalarchick/cuda-quantum-simulator

Overview

A CUDA-based quantum state vector simulator that efficiently simulates quantum circuits on GPU. Features modern C++17 with RAII memory management, comprehensive noise modeling, and validated against CPU reference implementation.

Performance

Benchmarks (RTX 4070, 8GB VRAM)

Metric	Result
GPU faster than CPU	Starting at 20 qubits (1.1× speedup)
Peak speedup	4.1× at 22 qubits (4M states)
vs. NVIDIA cuStateVec	Competitive (faster at small qubit counts due to lower dispatch overhead; converges at scale)
Maximum qubits	28 qubits (8GB VRAM limit)
Practical range	20-25 qubits for fast iteration

Validation

GPU kernels validated against CPU reference implementation
156 unit tests across 9 test suites (GoogleTest)
Comparison benchmarks with NVIDIA cuStateVec

Architecture

Stage	Component	Description
1	Circuit Parser	Simple format or reuse from optimizer project
2	State Vector (GPU)	2^n complex amplitudes on GPU
3	Gate Kernels	CUDA kernels for H, X, CNOT, Rz, etc.
4	Measurement	Probability extraction, sampling
5	Validation	Compare against CPU reference

Key Features

Gate Set (16 gates)

Type	Gates
Single-qubit	X, Y, Z, H, S, T, S†, T†, Rx, Ry, Rz
Two-qubit	CNOT, CZ, SWAP, CRY, CRZ
Three-qubit	Toffoli (CCX)

Noise Models (6 channels)

Depolarizing: Random Pauli errors with probability p
Amplitude damping: T1 relaxation (|1⟩ → |0⟩)
Phase damping: T2 dephasing
Bit flip, Phase flip, Bit-phase flip
Monte Carlo trajectory simulation with batched execution

Simulation Modes

State Vector: Scales to 28 qubits on RTX 4070 (8GB VRAM)
Density Matrix: Mixed state evolution (1-14 qubits, O(4^n) memory)
Batched Monte Carlo: Parallel trajectories for noise simulation

Software Engineering

MIT License with SPDX headers on all 31 source files
Modern C++17: [[nodiscard]], noexcept, move semantics
RAII Memory Management: CudaMemory<T> wrapper eliminates raw pointer handling
CUDA error checking: CUDA_CHECK() macro with file/line info
Fluent API: circuit.h(0).cnot(0, 1).rz(1, M_PI/4)
GoogleTest integration with CMake

Test Coverage

Test Suite	Tests	Focus
test_statevector	15	State initialization, normalization
test_gates	26	Gate correctness
test_gate_algebra	24	Gate composition
test_gpu_cpu_equivalence	13	CPU/GPU parity
test_boundary	18	Edge cases
test_noise	22	Noise channels
test_density_matrix	26	Mixed states
test_optimized_gates	8	Kernel optimizations
test_warmup	4	GPU initialization
Total	156

Technical Approach

RAII Memory Management

All GPU memory uses the CudaMemory<T> RAII wrapper:

CudaMemory<cuDoubleComplex> d_state(size);  // Allocates on construction
// Automatically freed when out of scope - no manual cudaFree needed

Single-Qubit Gate Strategy

For a single-qubit gate on qubit q:

State vector has 2^n elements
Gate affects pairs of elements differing only in bit q
Each CUDA thread handles one pair

Two-Qubit Gate Strategy

For CNOT on control c, target t:

Affects pairs where control bit is 1
Only flips target bit when control is 1
Careful indexing to avoid race conditions

Optimized Kernels

Shared memory tiling for data locality
Coalesced memory access patterns
Occupancy optimization

Integration with Other Projects

Stage	Project	Role
1	quantum-circuit-optimizer	Optimizes circuits
2	cuda-quantum-simulator (this project)	Validates circuits
3	QubitPulseOpt	Generates control pulses

Use cases:

Validate quantum circuit optimizer (run before/after, compare results)
Test pulse sequences from QubitPulseOpt (simulate ideal vs noisy)
Benchmark optimization quality (depth reduction → faster simulation)

Technology Stack

Category	Technology
Language	CUDA C++ with C++17 host code
Build	CMake with CUDA support
Memory	RAII wrappers (`CudaMemory<T>`)
Testing	GoogleTest (156 tests, 9 suites)
Profiling	Nsight Systems
License	MIT (SPDX headers on all files)

Roadmap

State vector simulation
Density matrix simulation
Comprehensive noise models
cuStateVec comparison benchmarks
Multi-GPU support (only remaining TODO)

Why Build From Scratch?

Using cuQuantum would hide the interesting parts. Building a simulator from scratch teaches:

GPU memory hierarchy and kernel optimization
Coalescing, occupancy, and memory access patterns
Real CUDA programming beyond PyTorch abstractions