Synchronization and Summary¶
Objectives
Understand synchronisation between host and device
Know the characteristics and trade-offs of the different CUDA memory types
Understand pointer rules for host and device code in C/C++
Instructor note
15 min teaching
0 min exercises
Memory kinds summary¶
Memory |
When to use |
Size |
Notes |
|---|---|---|---|
Global memory |
Always — default place to store data |
4–384 GB |
Accessible from all threads |
Shared memory |
User-defined cache, sharing data within a block |
64–228 KiB per SM |
Fast; shares space with L1 cache |
Constant memory |
Device-constant data, read by all threads |
64 KiB |
Part of global memory with special caching |
Local memory |
Implicitly used for register spills |
— |
Part of global memory, private per thread |
Pointer rules (C/C++)¶
In host code, only pointers to host memory can be dereferenced.
In device code with UVA, both host and device pointers can be dereferenced. Without UVA, device and host address spaces are separate.
However, only memory regions allocated with
cudaMallocHost(or other CUDA functions) and mapped to the device address space can be accessed from the device.Conclusion: pointers passed to kernels as arguments should (and without UVA: must) point to device or managed memory.
Synchronisation recap¶
Key synchronisation primitives covered in this module:
cudaDeviceSynchronize()— blocks the host until all preceding kernels and memory operations complete__syncthreads()(C) /call syncthreads()(Fortran) — barrier for all threads in a thread blockcudaMemcpy— synchronous by default (acts as an implicit barrier)__threadfence()— ensures all global/shared memory writes by the calling thread are visible to other threads
Rule of thumb: always call cudaDeviceSynchronize() before the host accesses data that was modified by a kernel.
Keypoints
Global memory is the default; shared memory provides fast intra-block communication
Constant memory uses a dedicated cache for small, read-only data (max 64 KB)
Always synchronise (
cudaDeviceSynchronize) before accessing GPU-modified data on the host- In C/C++ host code, only host pointers can be dereferenced; pointers passed to kernels should point to device or managed memory