Pinned and Constant Memory¶

Objectives

Know when and how to use pinned (page-locked) and constant memory

Instructor note

15 min teaching
0 min exercises

Page-locked (pinned) memory¶

Regular host memory allocated with malloc may be paged out by the OS. When cudaMemcpy copies data from pageable memory, it must first copy it to a pinned staging buffer, then transfer it over PCIe — this adds overhead.

Pinned (page-locked) memory is guaranteed to stay in physical memory and can be transferred directly, providing higher bandwidth.

Data transfer with staging buffer vs. pinned memory

Direct transfer with pinned memory

Advantages:

Faster copy to/from device (no staging buffer needed)
Can be accessed directly from device code (with UVA, no explicit cudaHostGetDevicePointer needed)
Required for asynchronous memory copies (covered in the Streams episode)

Caveats:

Be careful with race conditions when accessing mapped pinned memory from both host and device
CPU locks are not visible to the GPU; GPU atomics do not work on pinned host memory
Access over PCIe is slow compared to device memory
CUDA Fortran does not allow pinned memory in kernel arguments

cudaError_t cudaHostAlloc(void** pHost, size_t size, unsigned int flags);
cudaError_t cudaMallocHost(void** ptr, size_t size);
cudaError_t cudaFreeHost(void* ptr);
// flags: cudaHostAllocMapped to map into device address space

real, allocatable, pinned :: q(:)
allocate(q(1024))

Constant memory¶

Constant memory is a special read-only memory space that resides in global memory but uses a dedicated constant cache on each SM (8 KB). It is useful for data that is:

Read by all threads
Never written from device code
Small (maximum 64 KB per device)

Recall that variables passed by value to kernels are automatically stored in constant memory.

__constant__ float constData[256];

__global__ void myKernel() {
    float a = constData[0];  // Read from constant memory
}

// Host code: copy to/from constant memory
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data), 0, cudaMemcpyDefault);
cudaMemcpyFromSymbol(data, constData, sizeof(data), 0, cudaMemcpyDefault);

module kernels
  real :: c_d(100)
  attributes(constant) :: c_d

  attributes(global) subroutine init()
    real :: c2
    c2 = c_d(2)  ! Read from constant memory
  end subroutine
end module

program main
  use kernels
  real :: c(100)
  c_d = c       ! Copy host → constant memory
end program

Restrictions:

Cannot be dynamically allocated
Maximum 64 KB per device
Must not be written from device code
In C/C++ host code, constant memory variables are symbols — do not use pointer arithmetic on them

Keypoints

Pinned memory enables faster transfers and is required for asynchronous copies
Constant memory provides cached, read-only access for small data (max 64 KB)