Memory & Data-Layout¶
This page describes the project’s memory subsystem in depth: Structure-of-Arrays (SoA) fields with linear indexing, a single owner with non-owning views, explicit alignment and pools on CPU/GPU, efficient host↔device transfers (pinned + streams), optional Unified Memory with prefetch/advice, and scalable halo exchange (faces/edges/corners) via non-blocking MPI.
High-level goals¶
Keep inner loops pointer-fast and vectorizable on CPU and coalesced on GPU. [BPG]
Centralize ownership (RAII) and expose zero-overhead views for kernels. [SPAN]
Minimize (de)allocation cost via pooling on device and aligned pools on host. [CUDA-POOLS]
Overlap data movement and compute where practical (streams + non-blocking MPI). [BPG], [MPI-NB]
Keep the API policy-agnostic: explicit copies by default; UM as an opt-in. [UM-API]
Core concepts¶
Structure-of-Arrays (SoA)¶
Each primitive (rho
, ux
, uy
, uz
, p
, …) lives in a separate, contiguous 1-D array. Neighboring threads/lanes touch neighboring elements, enabling:
(1) CPU SIMD unit-stride vector loads/stores; (2) GPU coalesced global memory transactions. [INTEL-OPT], [BPG]
Layout & linear indexing¶
Data are stored row-major and indexed linearly:
// (i,j,k) → idx for dimensions Nx, Ny, Nz (row-major: x varies fastest)
inline size_t idx(size_t i, size_t j, size_t k,
size_t Nx, size_t Ny) noexcept {
return (k*Ny + j)*Nx + i;
}
Row-major formulas and tradeoffs are standard; use unit-stride in the innermost loop. [ROWMAJOR]
Ownership & views¶
MemoryManager
owns raw buffers and returns non-owning views that carry shape/strides (host) or device pointers for kernels. On CPU, views model std::span
semantics (contiguous, non-owning). For portability, Kokkos “unmanaged views” are an analogous concept. [SPAN], [KOKKOS-VIEW]
Alignment & allocators¶
Host (CPU)¶
Cache lines are 64 B on modern x86; aligning arrays to ≥64 B avoids line splits and false sharing. [INTEL-OPT]
Use
std::aligned_alloc
for explicitly aligned pools; size must be a multiple of alignment (C++17 rule). [ALIGNED-ALLOC]
Device (GPU)¶
CUDA runtime/driver allocations are ≥256-byte aligned; some I/O paths (e.g., GDS) require larger (4 KiB) alignment. [RMM-ALIGN]
Prefer stream-ordered memory pools (
cudaMallocAsync
/mem pools) to amortize allocation overhead and reduce sync. [CUDA-POOLS]
Host↔Device data movement¶
Pinned (page-locked) memory¶
Use pinned host buffers with ``cudaMemcpyAsync`` in streams to overlap copies with kernels and to reach higher PCIe/NVLink bandwidth. Pageable memory falls back to synchronous behavior. [BPG]
Policy: explicit mirrors (default)¶
Host owns canonical SoA arrays; device mirrors are created once and reused.
Transfers: pack halo faces (if needed), enqueue H2D/D2H on dedicated streams, record events, and overlap with compute. [BPG]
Policy: Unified Memory (opt-in)¶
UM simplifies ownership (single pointer) but still benefits from prefetch and advice for performance-critical paths:
Halo exchange¶
Ghost layers & neighborhoods¶
We maintain a one-cell (configurable) ghost layer around local subdomains and exchange faces, edges, and corners (26-neighbor in 3-D) each step for stencil updates. [MPI-HALO]
Non-blocking progression & overlap¶
The exchange uses MPI_Irecv/Isend
+ MPI_Waitall
; interior compute proceeds while messages progress. Overlap is implementation-dependent, but the non-blocking pattern is the standard route to expose concurrency. [MPI-NB]
Datatype option (packing-free faces)¶
Where convenient, use ``MPI_Type_create_subarray`` (or vector/contiguous types) to describe faces/edges directly in memory and avoid manual pack/unpack. [MPI-SUBARRAY]
Threading notes¶
On hybrid nodes, OpenMP tasks/threads can dedicate a team to halo progress while others compute local cells. [MPI-OMP]
Error handling & invariants¶
All allocations come from a single owner; views never free.
Host allocations meet alignment invariants (≥64 B); device meets ≥256 B alignment.
Transfers that claim asynchrony must originate from pinned buffers.
MPI requests are completed before buffer reuse.
UM mode must prefetch before first-touch kernels in tight loops.
References¶
NVIDIA, CUDA C++ Best Practices Guide. Coalesced access, pinned memory & async copies with streams; guidance on overlapping copy/compute. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ (accessed Aug 25 2025)
NVIDIA, CUDA Runtime API — Memory Pools / Stream-Ordered Allocator (cudaMallocAsync
, cudaMemPool*
). https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html
RAPIDS RMM Docs, Memory Resources — CUDA allocations are aligned to at least 256 bytes; some paths (e.g., GDS) need larger alignment. https://docs.rapids.ai/api/rmm/nightly/librmm_docs/memory_resources/
Intel, Intel® 64 and IA-32 Architectures Optimization Reference Manual — cache line is 64 B; unit-stride & alignment guidance. https://cdrdv2-public.intel.com/814198/248966-Optimization-Reference-Manual-V1-049.pdf
cppreference, std::aligned_alloc
(C++17) — size must be an integral multiple of alignment. https://en.cppreference.com/w/cpp/memory/c/aligned_alloc
Wikipedia, Row- and column-major order — linear index formulas & row-major background. https://en.wikipedia.org/wiki/Row-_and_column-major_order
cppreference, std::span
— non-owning view over a contiguous sequence (analogy for host views). https://en.cppreference.com/w/cpp/container/span.html
Kokkos, View — Multidimensional array — unmanaged/wrapping existing allocations. https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html
NVIDIA Docs, CUDA C++ Programming Guide / Runtime API — Unified Memory (cudaMemPrefetchAsync
, cudaMemAdvise
). https://docs.nvidia.com/cuda/cuda-c-programming-guide/
NVIDIA Developer Blog, Maximizing Unified Memory Performance in CUDA — when/why to prefetch & advise. https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
ORNL OLCF Training, CUDA Unified Memory slides — concise overview & best practices. https://www.olcf.ornl.gov/wp-content/uploads/2019/06/06_Managed_Memory.pdf
NASA HECC (2025), Simplifying GPU Programming with Unified Memory. https://www.nas.nasa.gov/hecc/support/kb/simplifying-gpu-programming-with-unified-memory_703.html
SC’24 Poster / arXiv (2025), Persistent and Partitioned MPI for Stencil Communication — defines halo exchange (3-D faces/edges/corners). https://arxiv.org/html/2508.13370v1
ENCCS, Non-blocking point-to-point — performant stencil workflow — overlap is implementation-dependent, pattern for correctness. https://enccs.github.io/intermediate-mpi/non-blocking-communication-pt1/
RookieHPC, MPI_Type_create_subarray — using subarray datatypes for strided faces. https://rookiehpc.org/mpi/docs/mpi_type_create_subarray/index.html
ENCCS, MPI and threads in practice — OpenMP tasking with halo exchange. https://enccs.github.io/intermediate-mpi/mpi-and-threads-pt2/