Status: Core Complete

Overview

A CUDA-based quantum state vector simulator that efficiently simulates quantum circuits on GPU. Features modern C++17 with RAII memory management, comprehensive noise modeling, and validated against Qiskit Aer.

Architecture

QuanCSGMVtitaeaura(talmctGesiuePudCiUKraitV)eetrermicPcneouatennirolttsrseIrnputS2CPCi^UromnDompAbplcaaeokbrmeiefprlolniareetgmxlyaasitaenmfxsopottrlrriaQrtHcieu,tsudikseXoies,nt,/foCCrnNsioOarmGTmqP,poUlpRitzni,gmieztecr.project

Key Features

Implemented (Core Complete)

  • State Vector Simulation: Up to 29 qubits on single GPU (12GB VRAM)
  • Single-qubit gates: X, Y, Z, H, S, T, Sdag, Tdag, Rx, Ry, Rz
  • Two-qubit gates: CNOT, CZ, CRY, CRZ, SWAP, Toffoli
  • Optimized kernels: Coalesced memory access, shared memory variants
  • Density matrix simulation: Mixed state evolution with Lindblad noise
  • Noise models: Depolarizing, amplitude damping, phase damping channels
  • Batched simulation: Multiple circuits in parallel for Monte Carlo
  • RAII memory management: CudaMemory<T> wrapper eliminates raw pointer handling
  • 9 test suites passing: Gates, algebra, boundary, noise, density matrix, optimized gates, GPU/CPU equivalence

Software Engineering

  • MIT License with SPDX headers on all 31 source files
  • Modern C++17: [[nodiscard]], noexcept, move semantics
  • CUDA error checking: CUDA_CHECK() macro with file/line info
  • Fluent API: circuit.h(0).cnot(0, 1).rz(1, M_PI/4)
  • GoogleTest integration with CMake

Hardware Target

  • RTX 4070 (12GB VRAM)
  • Maximum qubits: ~29 (2^29 * 16 bytes ≈ 8GB)
  • Practical target: 20-25 qubits for fast iteration

Technical Approach

RAII Memory Management

All GPU memory uses the CudaMemory<T> RAII wrapper:

CudaMemory<cuDoubleComplex> d_state(size);  // Allocates on construction
// Automatically freed when out of scope - no manual cudaFree needed

Single-Qubit Gate Strategy

For a single-qubit gate on qubit q:

  • State vector has 2^n elements
  • Gate affects pairs of elements differing only in bit q
  • Each CUDA thread handles one pair

Two-Qubit Gate Strategy

For CNOT on control c, target t:

  • Affects pairs where control bit is 1
  • Only flips target bit when control is 1
  • Careful indexing to avoid race conditions

Noise Modeling

  • Depolarizing: Random Pauli errors with probability p
  • Amplitude damping: T1 relaxation (|1⟩ → |0⟩)
  • Phase damping: T2 dephasing
  • Monte Carlo trajectory simulation with batched execution

Benchmarking

Validation

  • Compare against Qiskit Aer statevector simulator
  • Test cases: Bell state, GHZ state, random circuits

Performance Metrics

  • Gates per second for different gate types
  • Qubit scaling (time vs number of qubits)
  • GPU vs CPU speedup comparison
  • Profiling via Nsight Systems

Integration with Other Projects

qucaundtau-Qmqu-ubcaiintrtPcuuumli(-(stOsVe-piaOotmlppiuittmldiiaamzttieoezdreprcuilrsceuifti)deOTlGpHietItniSyem)riPazRteOesJsEcCciTor:nctuVriaotllsidpautlessescircuits

Use cases:

  • Validate quantum circuit optimizer (run before/after, compare results)
  • Test pulse sequences from QubitPulseOpt (simulate ideal vs noisy)
  • Benchmark optimization quality (depth reduction → faster simulation)

Technology Stack

  • Language: CUDA C++ with C++17 host code
  • Build: CMake with CUDA support
  • Memory: RAII wrappers (CudaMemory<T>)
  • Testing: GoogleTest (9 test suites)
  • Profiling: Nsight Systems
  • License: MIT (SPDX headers on all files)

Why Build From Scratch?

Using cuQuantum would hide the interesting parts. Building a simulator from scratch teaches:

  • GPU memory hierarchy and kernel optimization
  • Coalescing, occupancy, and memory access patterns
  • Real CUDA programming beyond PyTorch abstractions