.. _memory:

Memory & Data-Layout
====================

This page describes the project’s memory subsystem in depth: **Structure-of-Arrays (SoA) fields with linear indexing**, a **single owner** with non-owning **views**, explicit **alignment** and **pools** on CPU/GPU, efficient **host↔device transfers** (pinned + streams), optional **Unified Memory** with prefetch/advice, and scalable **halo exchange** (faces/edges/corners) via non-blocking MPI.

High-level goals
----------------

- Keep inner loops pointer-fast and vectorizable on CPU and coalesced on GPU.  [BPG]_
- Centralize ownership (RAII) and expose zero-overhead views for kernels.  [SPAN]_
- Minimize (de)allocation cost via pooling on device and aligned pools on host.  [CUDA-POOLS]_
- Overlap data movement and compute where practical (streams + non-blocking MPI).  [BPG]_, [MPI-NB]_
- Keep the API policy-agnostic: explicit copies by default; UM as an opt-in.  [UM-API]_

Core concepts
-------------

Structure-of-Arrays (SoA)
^^^^^^^^^^^^^^^^^^^^^^^^^
Each primitive (``rho``, ``ux``, ``uy``, ``uz``, ``p``, …) lives in a separate, contiguous 1-D array. Neighboring threads/lanes touch neighboring elements, enabling:
(1) **CPU SIMD** unit-stride vector loads/stores; (2) **GPU** coalesced global memory transactions.  [INTEL-OPT]_, [BPG]_

Layout & linear indexing
^^^^^^^^^^^^^^^^^^^^^^^^
Data are stored row-major and indexed linearly:

.. code-block:: c++

   // (i,j,k) → idx for dimensions Nx, Ny, Nz (row-major: x varies fastest)
   inline size_t idx(size_t i, size_t j, size_t k,
                     size_t Nx, size_t Ny) noexcept {
     return (k*Ny + j)*Nx + i;
   }

Row-major formulas and tradeoffs are standard; use unit-stride in the innermost loop.  [ROWMAJOR]_

Ownership & views
^^^^^^^^^^^^^^^^^
``MemoryManager`` owns raw buffers and returns non-owning views that carry shape/strides (host) or device pointers for kernels. On CPU, views model ``std::span`` semantics (contiguous, non-owning). For portability, Kokkos “unmanaged views” are an analogous concept.  [SPAN]_, [KOKKOS-VIEW]_

Alignment & allocators
----------------------

Host (CPU)
^^^^^^^^^^
- **Cache lines are 64 B** on modern x86; aligning arrays to ≥64 B avoids line splits and false sharing.  [INTEL-OPT]_
- Use ``std::aligned_alloc`` for explicitly aligned pools; **size must be a multiple of alignment** (C++17 rule).  [ALIGNED-ALLOC]_

Device (GPU)
^^^^^^^^^^^^
- CUDA runtime/driver allocations are **≥256-byte aligned**; some I/O paths (e.g., GDS) require larger (4 KiB) alignment.  [RMM-ALIGN]_
- Prefer **stream-ordered memory pools** (``cudaMallocAsync``/mem pools) to amortize allocation overhead and reduce sync.  [CUDA-POOLS]_

Host↔Device data movement
-------------------------

Pinned (page-locked) memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Use **pinned** host buffers with **``cudaMemcpyAsync`` in streams** to overlap copies with kernels and to reach higher PCIe/NVLink bandwidth. Pageable memory falls back to synchronous behavior.  [BPG]_

Policy: explicit mirrors (default)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Host owns canonical SoA arrays; device mirrors are created once and reused.
- Transfers: pack halo faces (if needed), enqueue H2D/D2H on dedicated streams, record events, and overlap with compute.  [BPG]_

Policy: Unified Memory (opt-in)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UM simplifies ownership (single pointer) but still benefits from **prefetch** and **advice** for performance-critical paths:

- ``cudaMemPrefetchAsync(ptr, nbytes, device, stream)`` to stage pages near the next kernel.  
- ``cudaMemAdvise`` (``SetPreferredLocation``, ``SetAccessedBy``, ``SetReadMostly``) to reduce page thrash.  
  [UM-API]_, [UM-BLOG]_, [UM-ORNL]_, [UM-NASA]_

Halo exchange
-------------

Ghost layers & neighborhoods
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We maintain a one-cell (configurable) ghost layer around local subdomains and exchange **faces, edges, and corners** (26-neighbor in 3-D) each step for stencil updates.  [MPI-HALO]_

Non-blocking progression & overlap
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The exchange uses ``MPI_Irecv/Isend`` + ``MPI_Waitall``; interior compute proceeds while messages progress. Overlap is **implementation-dependent**, but the non-blocking pattern is the standard route to expose concurrency.  [MPI-NB]_

Datatype option (packing-free faces)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Where convenient, use **``MPI_Type_create_subarray``** (or vector/contiguous types) to describe faces/edges directly in memory and avoid manual pack/unpack.  [MPI-SUBARRAY]_

Threading notes
^^^^^^^^^^^^^^^
On hybrid nodes, OpenMP tasks/threads can dedicate a team to halo progress while others compute local cells.  [MPI-OMP]_

Error handling & invariants
---------------------------

- All allocations come from a **single owner**; views never free.  
- Host allocations meet **alignment** invariants (≥64 B); device meets **≥256 B** alignment.  
- Transfers that claim asynchrony **must** originate from **pinned** buffers.  
- MPI requests are completed before buffer reuse.  
- UM mode must prefetch before first-touch kernels in tight loops.

References
----------

.. [BPG] NVIDIA, *CUDA C++ Best Practices Guide*. Coalesced access, pinned memory & async copies with streams; guidance on overlapping copy/compute. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/  (accessed Aug 25 2025)

.. [CUDA-POOLS] NVIDIA, *CUDA Runtime API — Memory Pools / Stream-Ordered Allocator* (``cudaMallocAsync``, ``cudaMemPool*``). https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html

.. [RMM-ALIGN] RAPIDS RMM Docs, *Memory Resources* — CUDA allocations are aligned to **at least 256 bytes**; some paths (e.g., GDS) need larger alignment. https://docs.rapids.ai/api/rmm/nightly/librmm_docs/memory_resources/

.. [INTEL-OPT] Intel, *Intel® 64 and IA-32 Architectures Optimization Reference Manual* — cache line is 64 B; unit-stride & alignment guidance. https://cdrdv2-public.intel.com/814198/248966-Optimization-Reference-Manual-V1-049.pdf

.. [ALIGNED-ALLOC] cppreference, ``std::aligned_alloc`` (C++17) — **size must be an integral multiple of alignment**. https://en.cppreference.com/w/cpp/memory/c/aligned_alloc

.. [ROWMAJOR] Wikipedia, *Row- and column-major order* — linear index formulas & row-major background. https://en.wikipedia.org/wiki/Row-_and_column-major_order

.. [SPAN] cppreference, ``std::span`` — non-owning view over a contiguous sequence (analogy for host views). https://en.cppreference.com/w/cpp/container/span.html

.. [KOKKOS-VIEW] Kokkos, *View — Multidimensional array* — unmanaged/wrapping existing allocations. https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html

.. [UM-API] NVIDIA Docs, *CUDA C++ Programming Guide / Runtime API — Unified Memory* (``cudaMemPrefetchAsync``, ``cudaMemAdvise``). https://docs.nvidia.com/cuda/cuda-c-programming-guide/

.. [UM-BLOG] NVIDIA Developer Blog, *Maximizing Unified Memory Performance in CUDA* — when/why to prefetch & advise. https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/

.. [UM-ORNL] ORNL OLCF Training, *CUDA Unified Memory slides* — concise overview & best practices. https://www.olcf.ornl.gov/wp-content/uploads/2019/06/06_Managed_Memory.pdf

.. [UM-NASA] NASA HECC (2025), *Simplifying GPU Programming with Unified Memory*. https://www.nas.nasa.gov/hecc/support/kb/simplifying-gpu-programming-with-unified-memory_703.html

.. [MPI-HALO] SC’24 Poster / arXiv (2025), *Persistent and Partitioned MPI for Stencil Communication* — defines halo exchange (3-D faces/edges/corners). https://arxiv.org/html/2508.13370v1

.. [MPI-NB] ENCCS, *Non-blocking point-to-point — performant stencil workflow* — overlap is implementation-dependent, pattern for correctness. https://enccs.github.io/intermediate-mpi/non-blocking-communication-pt1/

.. [MPI-SUBARRAY] RookieHPC, *MPI_Type_create_subarray* — using subarray datatypes for strided faces. https://rookiehpc.org/mpi/docs/mpi_type_create_subarray/index.html

.. [MPI-OMP] ENCCS, *MPI and threads in practice* — OpenMP tasking with halo exchange. https://enccs.github.io/intermediate-mpi/mpi-and-threads-pt2/