Memory & Data-Layout

This page describes the project’s memory subsystem in depth: Structure-of-Arrays (SoA) fields with linear indexing, a single owner with non-owning views, explicit alignment and pools on CPU/GPU, efficient host↔device transfers (pinned + streams), optional Unified Memory with prefetch/advice, and scalable halo exchange (faces/edges/corners) via non-blocking MPI.

High-level goals

  • Keep inner loops pointer-fast and vectorizable on CPU and coalesced on GPU. [BPG]

  • Centralize ownership (RAII) and expose zero-overhead views for kernels. [SPAN]

  • Minimize (de)allocation cost via pooling on device and aligned pools on host. [CUDA-POOLS]

  • Overlap data movement and compute where practical (streams + non-blocking MPI). [BPG], [MPI-NB]

  • Keep the API policy-agnostic: explicit copies by default; UM as an opt-in. [UM-API]

Core concepts

Structure-of-Arrays (SoA)

Each primitive (rho, ux, uy, uz, p, …) lives in a separate, contiguous 1-D array. Neighboring threads/lanes touch neighboring elements, enabling: (1) CPU SIMD unit-stride vector loads/stores; (2) GPU coalesced global memory transactions. [INTEL-OPT], [BPG]

Layout & linear indexing

Data are stored row-major and indexed linearly:

// (i,j,k) → idx for dimensions Nx, Ny, Nz (row-major: x varies fastest)
inline size_t idx(size_t i, size_t j, size_t k,
                  size_t Nx, size_t Ny) noexcept {
  return (k*Ny + j)*Nx + i;
}

Row-major formulas and tradeoffs are standard; use unit-stride in the innermost loop. [ROWMAJOR]

Ownership & views

MemoryManager owns raw buffers and returns non-owning views that carry shape/strides (host) or device pointers for kernels. On CPU, views model std::span semantics (contiguous, non-owning). For portability, Kokkos “unmanaged views” are an analogous concept. [SPAN], [KOKKOS-VIEW]

Alignment & allocators

Host (CPU)

  • Cache lines are 64 B on modern x86; aligning arrays to ≥64 B avoids line splits and false sharing. [INTEL-OPT]

  • Use std::aligned_alloc for explicitly aligned pools; size must be a multiple of alignment (C++17 rule). [ALIGNED-ALLOC]

Device (GPU)

  • CUDA runtime/driver allocations are ≥256-byte aligned; some I/O paths (e.g., GDS) require larger (4 KiB) alignment. [RMM-ALIGN]

  • Prefer stream-ordered memory pools (cudaMallocAsync/mem pools) to amortize allocation overhead and reduce sync. [CUDA-POOLS]

Host↔Device data movement

Pinned (page-locked) memory

Use pinned host buffers with ``cudaMemcpyAsync`` in streams to overlap copies with kernels and to reach higher PCIe/NVLink bandwidth. Pageable memory falls back to synchronous behavior. [BPG]

Policy: explicit mirrors (default)

  • Host owns canonical SoA arrays; device mirrors are created once and reused.

  • Transfers: pack halo faces (if needed), enqueue H2D/D2H on dedicated streams, record events, and overlap with compute. [BPG]

Policy: Unified Memory (opt-in)

UM simplifies ownership (single pointer) but still benefits from prefetch and advice for performance-critical paths:

  • cudaMemPrefetchAsync(ptr, nbytes, device, stream) to stage pages near the next kernel.

  • cudaMemAdvise (SetPreferredLocation, SetAccessedBy, SetReadMostly) to reduce page thrash. [UM-API], [UM-BLOG], [UM-ORNL], [UM-NASA]

Halo exchange

Ghost layers & neighborhoods

We maintain a one-cell (configurable) ghost layer around local subdomains and exchange faces, edges, and corners (26-neighbor in 3-D) each step for stencil updates. [MPI-HALO]

Non-blocking progression & overlap

The exchange uses MPI_Irecv/Isend + MPI_Waitall; interior compute proceeds while messages progress. Overlap is implementation-dependent, but the non-blocking pattern is the standard route to expose concurrency. [MPI-NB]

Datatype option (packing-free faces)

Where convenient, use ``MPI_Type_create_subarray`` (or vector/contiguous types) to describe faces/edges directly in memory and avoid manual pack/unpack. [MPI-SUBARRAY]

Threading notes

On hybrid nodes, OpenMP tasks/threads can dedicate a team to halo progress while others compute local cells. [MPI-OMP]

Error handling & invariants

  • All allocations come from a single owner; views never free.

  • Host allocations meet alignment invariants (≥64 B); device meets ≥256 B alignment.

  • Transfers that claim asynchrony must originate from pinned buffers.

  • MPI requests are completed before buffer reuse.

  • UM mode must prefetch before first-touch kernels in tight loops.

References

[BPG] (1,2,3,4,5)

NVIDIA, CUDA C++ Best Practices Guide. Coalesced access, pinned memory & async copies with streams; guidance on overlapping copy/compute. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ (accessed Aug 25 2025)

[CUDA-POOLS] (1,2)

NVIDIA, CUDA Runtime API — Memory Pools / Stream-Ordered Allocator (cudaMallocAsync, cudaMemPool*). https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html

[RMM-ALIGN]

RAPIDS RMM Docs, Memory Resources — CUDA allocations are aligned to at least 256 bytes; some paths (e.g., GDS) need larger alignment. https://docs.rapids.ai/api/rmm/nightly/librmm_docs/memory_resources/

[INTEL-OPT] (1,2)

Intel, Intel® 64 and IA-32 Architectures Optimization Reference Manual — cache line is 64 B; unit-stride & alignment guidance. https://cdrdv2-public.intel.com/814198/248966-Optimization-Reference-Manual-V1-049.pdf

[ALIGNED-ALLOC]

cppreference, std::aligned_alloc (C++17) — size must be an integral multiple of alignment. https://en.cppreference.com/w/cpp/memory/c/aligned_alloc

[ROWMAJOR]

Wikipedia, Row- and column-major order — linear index formulas & row-major background. https://en.wikipedia.org/wiki/Row-_and_column-major_order

[SPAN] (1,2)

cppreference, std::span — non-owning view over a contiguous sequence (analogy for host views). https://en.cppreference.com/w/cpp/container/span.html

[KOKKOS-VIEW]

Kokkos, View — Multidimensional array — unmanaged/wrapping existing allocations. https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html

[UM-API] (1,2)

NVIDIA Docs, CUDA C++ Programming Guide / Runtime API — Unified Memory (cudaMemPrefetchAsync, cudaMemAdvise). https://docs.nvidia.com/cuda/cuda-c-programming-guide/

[UM-BLOG]

NVIDIA Developer Blog, Maximizing Unified Memory Performance in CUDA — when/why to prefetch & advise. https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/

[UM-ORNL]

ORNL OLCF Training, CUDA Unified Memory slides — concise overview & best practices. https://www.olcf.ornl.gov/wp-content/uploads/2019/06/06_Managed_Memory.pdf

[UM-NASA]

NASA HECC (2025), Simplifying GPU Programming with Unified Memory. https://www.nas.nasa.gov/hecc/support/kb/simplifying-gpu-programming-with-unified-memory_703.html

[MPI-HALO]

SC’24 Poster / arXiv (2025), Persistent and Partitioned MPI for Stencil Communication — defines halo exchange (3-D faces/edges/corners). https://arxiv.org/html/2508.13370v1

[MPI-NB] (1,2)

ENCCS, Non-blocking point-to-point — performant stencil workflow — overlap is implementation-dependent, pattern for correctness. https://enccs.github.io/intermediate-mpi/non-blocking-communication-pt1/

[MPI-SUBARRAY]

RookieHPC, MPI_Type_create_subarray — using subarray datatypes for strided faces. https://rookiehpc.org/mpi/docs/mpi_type_create_subarray/index.html

[MPI-OMP]

ENCCS, MPI and threads in practice — OpenMP tasking with halo exchange. https://enccs.github.io/intermediate-mpi/mpi-and-threads-pt2/