Synchronization and Summary

Objectives

  • Understand synchronisation between host and device

  • Know the characteristics and trade-offs of the different CUDA memory types

  • Understand pointer rules for host and device code in C/C++

Instructor note

  • 15 min teaching

  • 0 min exercises

Memory kinds summary

Memory

When to use

Size

Notes

Global memory

Always — default place to store data

4–384 GB

Accessible from all threads

Shared memory

User-defined cache, sharing data within a block

64–228 KiB per SM

Fast; shares space with L1 cache

Constant memory

Device-constant data, read by all threads

64 KiB

Part of global memory with special caching

Local memory

Implicitly used for register spills

Part of global memory, private per thread

Pointer rules (C/C++)

  • In host code, only pointers to host memory can be dereferenced.

  • In device code with UVA, both host and device pointers can be dereferenced. Without UVA, device and host address spaces are separate.

  • However, only memory regions allocated with cudaMallocHost (or other CUDA functions) and mapped to the device address space can be accessed from the device.

  • Conclusion: pointers passed to kernels as arguments should (and without UVA: must) point to device or managed memory.

Synchronisation recap

Key synchronisation primitives covered in this module:

  • cudaDeviceSynchronize() — blocks the host until all preceding kernels and memory operations complete

  • __syncthreads() (C) / call syncthreads() (Fortran) — barrier for all threads in a thread block

  • cudaMemcpy — synchronous by default (acts as an implicit barrier)

  • __threadfence() — ensures all global/shared memory writes by the calling thread are visible to other threads

Rule of thumb: always call cudaDeviceSynchronize() before the host accesses data that was modified by a kernel.

Keypoints

  • Global memory is the default; shared memory provides fast intra-block communication

  • Constant memory uses a dedicated cache for small, read-only data (max 64 KB)

  • Always synchronise (cudaDeviceSynchronize) before accessing GPU-modified data on the host

  • In C/C++ host code, only host pointers can be dereferenced; pointers passed to kernels should point to device or managed memory