Synchronization and Summary¶

Objectives

Instructor note

Memory kinds summary¶

Memory	When to use	Size	Notes
Global memory	Always — default place to store data	4–384 GB	Accessible from all threads
Shared memory	User-defined cache, sharing data within a block	64–228 KiB per SM	Fast; shares space with L1 cache
Constant memory	Device-constant data, read by all threads	64 KiB	Part of global memory with special caching
Local memory	Implicitly used for register spills	—	Part of global memory, private per thread

In host code, only pointers to host memory can be dereferenced.
In device code with UVA, both host and device pointers can be dereferenced. Without UVA, device and host address spaces are separate.
However, only memory regions allocated with cudaMallocHost (or other CUDA functions) and mapped to the device address space can be accessed from the device.
Conclusion: pointers passed to kernels as arguments should (and without UVA: must) point to device or managed memory.

Key synchronisation primitives covered in this module:

cudaDeviceSynchronize() — blocks the host until all preceding kernels and memory operations complete
__syncthreads() (C) / call syncthreads() (Fortran) — barrier for all threads in a thread block
cudaMemcpy — synchronous by default (acts as an implicit barrier)
__threadfence() — ensures all global/shared memory writes by the calling thread are visible to other threads

Rule of thumb: always call cudaDeviceSynchronize() before the host accesses data that was modified by a kernel.

Keypoints

Global memory is the default; shared memory provides fast intra-block communication
Constant memory uses a dedicated cache for small, read-only data (max 64 KB)
Always synchronise (cudaDeviceSynchronize) before accessing GPU-modified data on the host
In C/C++ host code, only host pointers can be dereferenced; pointers passed to kernels should point to device or managed memory