Managed Memory

Objectives

  • Understand the difference between host memory, device memory, and managed memory

  • Allocate managed memory and understand its synchronisation requirements

  • Understand synchronisation between host and device

Instructor note

  • 30 min teaching

  • 30 min exercises

Memory spaces

A CUDA system has two main memory spaces that are physically separate:

  • Host (main) memory — CPU RAM, accessible by host code

  • Device (global) memory — GPU VRAM, accessible by device code

Without special annotations or allocation functions, memory resides in host memory and is not accessible from the device. To use data on the GPU, you must either:

  1. Use managed memory — the CUDA runtime handles transfers automatically, or

  2. Explicitly allocate device memory and copy data between host and device.

Memory spaces: host memory and device (global) memory

Managed memory (unified memory)

Managed memory is the simplest way to share data between host and device. The CUDA runtime automatically migrates data as needed.

Since CUDA 4, Unified Virtual Addressing (UVA) provides a single address space shared by CPU and all GPUs. Managed memory (available since CUDA 6) builds on UVA to provide automatic data migration.

__device__ __managed__ double X[1024];

__global__ void add_kernel(double* X, double* Y, int length) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < length)
        X[index] += Y[index];
}

int main(void) {
    double *Y, *X;
    cudaMallocManaged((void**)&Y, 1024 * sizeof(double));
    cudaMallocManaged((void**)&X, 1024 * sizeof(double));
    add_kernel<<<4, 256>>>(X, Y, 1024);
    cudaDeviceSynchronize();  // Must synchronise before host access!
    cudaFree(Y);
    cudaFree(X);
}

Advantages:

  • Data is moved automatically — no need to manage transfers

  • “Oversubscription” is possible (data can exceed GPU memory)

Disadvantages:

  • The programmer must still synchronise before accessing data on the other side

  • Automatic transfers may not be optimal — the runtime cannot always predict access patterns

Important rules:

  • const-qualified variables and C++ references cannot be declared as managed memory

  • C++ classes/structs with __managed__ members have many restrictions; Fortran derived types with managed members work more freely

  • Managed memory can only be allocated and freed in host code

Managed memory API

Asynchronous execution and synchronisation

Since kernel launches are asynchronous, you must call cudaDeviceSynchronize() before accessing managed memory on the host after a kernel that modifies it. Without synchronisation, the host may read stale data — or worse, the access may cause a segfault.

Incorrect: host reads data before kernel finishes

Incorrect: host reads partially written data

Correct: host waits for kernel to finish

Key synchronisation primitives:

  • cudaDeviceSynchronize() — blocks the host until all preceding kernels and memory operations complete

  • __syncthreads() (C) / call syncthreads() (Fortran) — barrier for all threads in a thread block

  • cudaMemcpy — synchronous by default (acts as an implicit barrier)

  • __threadfence() — ensures all global/shared memory writes by the calling thread are visible to other threads

Exercise: Vector-scalar multiplication on GPU

Vector-Scalar Multiplication with Managed Memory

Given the following CPU code that computes c[i] = a[i] * b for a vector:

Tasks:

  1. Modify the code to use cudaMallocManaged (C) or the managed attribute (Fortran) for the arrays.

  2. Write a kernel that performs the multiplication.

  3. Start with <<<1, 1>>> (single thread), then modify to use many threads and blocks.

  4. Add appropriate synchronisation.

Parallel Vector-Scalar Multiplication

Modify your solution to use many threads in many blocks:

  1. Use threadIdx.x, blockDim.x, blockIdx.x, and gridDim.x to distribute work across threads.

  2. Advanced: Can you remove the loop from the kernel entirely by using enough blocks to cover the vector?

Keypoints

  • Managed memory (cudaMallocManaged) is the simplest approach — data migrates automatically, but you must synchronise before host access

  • Always synchronise (cudaDeviceSynchronize) before accessing GPU-modified data on the host