Status: Core Complete
Overview
A CUDA-based quantum state vector simulator that efficiently simulates quantum circuits on GPU. Features modern C++17 with RAII memory management, comprehensive noise modeling, and validated against Qiskit Aer.
Architecture
Key Features
Implemented (Core Complete)
- State Vector Simulation: Up to 29 qubits on single GPU (12GB VRAM)
- Single-qubit gates: X, Y, Z, H, S, T, Sdag, Tdag, Rx, Ry, Rz
- Two-qubit gates: CNOT, CZ, CRY, CRZ, SWAP, Toffoli
- Optimized kernels: Coalesced memory access, shared memory variants
- Density matrix simulation: Mixed state evolution with Lindblad noise
- Noise models: Depolarizing, amplitude damping, phase damping channels
- Batched simulation: Multiple circuits in parallel for Monte Carlo
- RAII memory management:
CudaMemory<T>wrapper eliminates raw pointer handling - 9 test suites passing: Gates, algebra, boundary, noise, density matrix, optimized gates, GPU/CPU equivalence
Software Engineering
- MIT License with SPDX headers on all 31 source files
- Modern C++17:
[[nodiscard]],noexcept, move semantics - CUDA error checking:
CUDA_CHECK()macro with file/line info - Fluent API:
circuit.h(0).cnot(0, 1).rz(1, M_PI/4) - GoogleTest integration with CMake
Hardware Target
- RTX 4070 (12GB VRAM)
- Maximum qubits: ~29 (2^29 * 16 bytes ≈ 8GB)
- Practical target: 20-25 qubits for fast iteration
Technical Approach
RAII Memory Management
All GPU memory uses the CudaMemory<T> RAII wrapper:
CudaMemory<cuDoubleComplex> d_state(size); // Allocates on construction
// Automatically freed when out of scope - no manual cudaFree needed
Single-Qubit Gate Strategy
For a single-qubit gate on qubit q:
- State vector has 2^n elements
- Gate affects pairs of elements differing only in bit q
- Each CUDA thread handles one pair
Two-Qubit Gate Strategy
For CNOT on control c, target t:
- Affects pairs where control bit is 1
- Only flips target bit when control is 1
- Careful indexing to avoid race conditions
Noise Modeling
- Depolarizing: Random Pauli errors with probability p
- Amplitude damping: T1 relaxation (|1⟩ → |0⟩)
- Phase damping: T2 dephasing
- Monte Carlo trajectory simulation with batched execution
Benchmarking
Validation
- Compare against Qiskit Aer statevector simulator
- Test cases: Bell state, GHZ state, random circuits
Performance Metrics
- Gates per second for different gate types
- Qubit scaling (time vs number of qubits)
- GPU vs CPU speedup comparison
- Profiling via Nsight Systems
Integration with Other Projects
Use cases:
- Validate quantum circuit optimizer (run before/after, compare results)
- Test pulse sequences from QubitPulseOpt (simulate ideal vs noisy)
- Benchmark optimization quality (depth reduction → faster simulation)
Technology Stack
- Language: CUDA C++ with C++17 host code
- Build: CMake with CUDA support
- Memory: RAII wrappers (
CudaMemory<T>) - Testing: GoogleTest (9 test suites)
- Profiling: Nsight Systems
- License: MIT (SPDX headers on all files)
Links
Why Build From Scratch?
Using cuQuantum would hide the interesting parts. Building a simulator from scratch teaches:
- GPU memory hierarchy and kernel optimization
- Coalescing, occupancy, and memory access patterns
- Real CUDA programming beyond PyTorch abstractions