Pinned and Constant Memory

Objectives

  • Know when and how to use pinned (page-locked) and constant memory

Instructor note

  • 15 min teaching

  • 0 min exercises

Page-locked (pinned) memory

Regular host memory allocated with malloc may be paged out by the OS. When cudaMemcpy copies data from pageable memory, it must first copy it to a pinned staging buffer, then transfer it over PCIe — this adds overhead.

Pinned (page-locked) memory is guaranteed to stay in physical memory and can be transferred directly, providing higher bandwidth.

Data transfer with staging buffer vs. pinned memory

Direct transfer with pinned memory

Advantages:

  • Faster copy to/from device (no staging buffer needed)

  • Can be accessed directly from device code (with UVA, no explicit cudaHostGetDevicePointer needed)

  • Required for asynchronous memory copies (covered in the Streams episode)

Caveats:

  • Be careful with race conditions when accessing mapped pinned memory from both host and device

  • CPU locks are not visible to the GPU; GPU atomics do not work on pinned host memory

  • Access over PCIe is slow compared to device memory

  • CUDA Fortran does not allow pinned memory in kernel arguments

cudaError_t cudaHostAlloc(void** pHost, size_t size, unsigned int flags);
cudaError_t cudaMallocHost(void** ptr, size_t size);
cudaError_t cudaFreeHost(void* ptr);
// flags: cudaHostAllocMapped to map into device address space

Constant memory

Constant memory is a special read-only memory space that resides in global memory but uses a dedicated constant cache on each SM (8 KB). It is useful for data that is:

  • Read by all threads

  • Never written from device code

  • Small (maximum 64 KB per device)

Recall that variables passed by value to kernels are automatically stored in constant memory.

__constant__ float constData[256];

__global__ void myKernel() {
    float a = constData[0];  // Read from constant memory
}

// Host code: copy to/from constant memory
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data), 0, cudaMemcpyDefault);
cudaMemcpyFromSymbol(data, constData, sizeof(data), 0, cudaMemcpyDefault);

Restrictions:

  • Cannot be dynamically allocated

  • Maximum 64 KB per device

  • Must not be written from device code

  • In C/C++ host code, constant memory variables are symbols — do not use pointer arithmetic on them

Keypoints

  • Pinned memory enables faster transfers and is required for asynchronous copies

  • Constant memory provides cached, read-only access for small data (max 64 KB)