← All Projects

CUDA Quantum Simulator: GPU-Accelerated State Vector Simulation

Quantum ComputingCUDAGPUC++HPC

Status: Core Complete

GitHub: github.com/rylanmalarchick/cuda-quantum-simulator

Overview

A CUDA-based quantum state vector simulator that efficiently simulates quantum circuits on GPU. Features modern C++17 with RAII memory management, comprehensive noise modeling, and validated against CPU reference implementation.

Performance

Benchmarks (RTX 4070, 8GB VRAM)

Metric Result
GPU faster than CPU Starting at 20 qubits (1.1× speedup)
Peak speedup 4.1× at 22 qubits (4M states)
vs. NVIDIA cuStateVec Competitive (faster at small qubit counts due to lower dispatch overhead; converges at scale)
Maximum qubits 28 qubits (8GB VRAM limit)
Practical range 20-25 qubits for fast iteration

Validation

  • GPU kernels validated against CPU reference implementation
  • 156 unit tests across 9 test suites (GoogleTest)
  • Comparison benchmarks with NVIDIA cuStateVec

Architecture

Stage Component Description
1 Circuit Parser Simple format or reuse from optimizer project
2 State Vector (GPU) 2^n complex amplitudes on GPU
3 Gate Kernels CUDA kernels for H, X, CNOT, Rz, etc.
4 Measurement Probability extraction, sampling
5 Validation Compare against CPU reference

Key Features

Gate Set (16 gates)

Type Gates
Single-qubit X, Y, Z, H, S, T, S†, T†, Rx, Ry, Rz
Two-qubit CNOT, CZ, SWAP, CRY, CRZ
Three-qubit Toffoli (CCX)

Noise Models (6 channels)

  • Depolarizing: Random Pauli errors with probability p
  • Amplitude damping: T1 relaxation (|1⟩ → |0⟩)
  • Phase damping: T2 dephasing
  • Bit flip, Phase flip, Bit-phase flip
  • Monte Carlo trajectory simulation with batched execution

Simulation Modes

  • State Vector: Scales to 28 qubits on RTX 4070 (8GB VRAM)
  • Density Matrix: Mixed state evolution (1-14 qubits, O(4^n) memory)
  • Batched Monte Carlo: Parallel trajectories for noise simulation

Software Engineering

  • MIT License with SPDX headers on all 31 source files
  • Modern C++17: [[nodiscard]], noexcept, move semantics
  • RAII Memory Management: CudaMemory<T> wrapper eliminates raw pointer handling
  • CUDA error checking: CUDA_CHECK() macro with file/line info
  • Fluent API: circuit.h(0).cnot(0, 1).rz(1, M_PI/4)
  • GoogleTest integration with CMake

Test Coverage

Test Suite Tests Focus
test_statevector 15 State initialization, normalization
test_gates 26 Gate correctness
test_gate_algebra 24 Gate composition
test_gpu_cpu_equivalence 13 CPU/GPU parity
test_boundary 18 Edge cases
test_noise 22 Noise channels
test_density_matrix 26 Mixed states
test_optimized_gates 8 Kernel optimizations
test_warmup 4 GPU initialization
Total 156

Technical Approach

RAII Memory Management

All GPU memory uses the CudaMemory<T> RAII wrapper:

CudaMemory<cuDoubleComplex> d_state(size);  // Allocates on construction
// Automatically freed when out of scope - no manual cudaFree needed

Single-Qubit Gate Strategy

For a single-qubit gate on qubit q:

  • State vector has 2^n elements
  • Gate affects pairs of elements differing only in bit q
  • Each CUDA thread handles one pair

Two-Qubit Gate Strategy

For CNOT on control c, target t:

  • Affects pairs where control bit is 1
  • Only flips target bit when control is 1
  • Careful indexing to avoid race conditions

Optimized Kernels

  • Shared memory tiling for data locality
  • Coalesced memory access patterns
  • Occupancy optimization

Integration with Other Projects

Stage Project Role
1 quantum-circuit-optimizer Optimizes circuits
2 cuda-quantum-simulator (this project) Validates circuits
3 QubitPulseOpt Generates control pulses

Use cases:

  • Validate quantum circuit optimizer (run before/after, compare results)
  • Test pulse sequences from QubitPulseOpt (simulate ideal vs noisy)
  • Benchmark optimization quality (depth reduction → faster simulation)

Technology Stack

Category Technology
Language CUDA C++ with C++17 host code
Build CMake with CUDA support
Memory RAII wrappers (CudaMemory<T>)
Testing GoogleTest (156 tests, 9 suites)
Profiling Nsight Systems
License MIT (SPDX headers on all files)

Roadmap

  • State vector simulation
  • Density matrix simulation
  • Comprehensive noise models
  • cuStateVec comparison benchmarks
  • Multi-GPU support (only remaining TODO)

Links

Why Build From Scratch?

Using cuQuantum would hide the interesting parts. Building a simulator from scratch teaches:

  • GPU memory hierarchy and kernel optimization
  • Coalescing, occupancy, and memory access patterns
  • Real CUDA programming beyond PyTorch abstractions