Wavefronts and Workgroups

来源：互联网发布：c语言闰年个数计算编辑：程序博客网时间：2024/06/10 16:00

Programming Guide:

Wavefronts and groups are two concepts relating to compute kernels that provide
data-parallel granularity. A wavefront executes a number of work-items in lock
step relative to each other. Sixteen work-items are execute in parallel across the
vector unit, and the whole wavefront is covered over four clock cycles. It is the
lowest level that flow control can affect. This means that if two work-items inside
of a wavefront go divergent paths of flow control, all work-items in the wavefront
go to both paths of flow control.
Grouping is a higher-level granularity of data parallelism that is enforced in
software, not hardware. Synchronization points in a kernel guarantee that all
work-items in a work-group reach that point (barrier) in the code before the next
statement is executed.
Work-groups are composed of wavefronts. Best performance is attained when
the group size is an integer multiple of the wavefront size.

The GPU consists of multiple compute units. Each compute unit contains local
(on-chip) memory, L1 cache, registers, and 16 processing element (PE).
Individual work-items execute on a single processing element; one or more work-
groups execute on a single compute unit. On a GPU, hardware schedules groups
of work-items, called wavefronts, onto compute units; thus, work-items within a
wavefront execute in lock-step; the same instruction is executed on different
data.

Generally, it is not a good idea to make the work-group size something other than an integer multiple
of the wavefront size, but that usually is less important than avoiding channel conflicts.

Hardware acceleration also takes place when all work-items in a wavefront
reference the same constant address. In this case, the data is loaded from
memory one time, stored in the L1 cache, and then broadcast to all wave-
fronts. This can reduce significantly the required memory bandwidth.

The fundamental unit of work on AMD GPUs is called a wavefront. Each
wavefront consists of 64 work-items; thus, the optimal local work size is an
integer multiple of 64 (specifically 64, 128, 192, or 256) work-items per work-
group.

5.6.3.3 Work-Group Dimensions vs Size
The local NDRange can contain up to three dimensions, here labeled X, Y, and
Z. The X dimension is returned by get_local_id(0), Y is returned by
get_local_id(1), and Z is returned by get_local_id(2). The GPU hardware
schedules the kernels so that the X dimensions moves fastest as the work-items
are packed into wavefronts. For example, the 128 threads in a 2D work-group of
dimension 32x4 (X=32 and Y=4) would be packed into two wavefronts as follows
(notation shown in X,Y order):

5.6.4 Summary of NDRange Optimizations

• Select the work-group size to be a multiple of 64, so that the wavefronts are
fully populated.
• Use a work-group size of 64, and schedule four work-groups per compute
unit.
• Latency hiding depends on both the number of wavefronts/compute unit, as
well as the execution time for each kernel. Generally, two to eight
wavefronts/compute unit is desirable, but this can vary significantly,
depending on the complexity of the kernel and the available memory
bandwidth. The AMD APP Profiler and associated performance counters can
help to select an optimal value.

The global_work_size parameter specifies the number of work-items in each di-
mension of the NDRange, and local_work_size specifies the number of work-items
in each dimension of the workgroups.

opencl中文教程：

对LDS的访问是以 32个Thread为一组并行进行的，
假设当前 wave的大小为64，那么对 LDS的访问操作将被拆成两次进行。

在现在的 Evergreen GPU架构中，所有线程都是分为 64为一组 (一个wave)被调度的，而一个 wave
中的线程一旦遇到控制流分支，那么不同的分支路径将被串行的执行。以简单的 if else分支为例，只有
当 if条件在所有 64个线程都满足 (或都不满足 )时，才会造成仅有一个分支路径被执行；相反，如果 if
条件在某些线程中满足，在某些线程中不满足，那么 if和else两个分支将被串行执行，从而使处于分支
中的代码段执行效率降低。

而对于必须使用的控制流分支，尽量使它的“分支宽度”大于当前wave中read的个数

的。在图形加速卡上面，每 64个线程在硬件上划分为一组，称为一个 wave。
一组线程运行的总时间取决于运行时间最长的那一个线程，工作负载的不平均导致早 “完工”的线程
必须等待负载最大的个别线程。工作负载平衡应该在 wave的粒度上考虑，尽量让同一个wave的线程
有相同的工作负载。

异构计算：

Through the OpenCL API, we can query the number of compute units on the
device at runtime:
clGetDeviceInfo(..., CL_DEVICE_MAX_COMPUTE_UNITS, ... );

On AMD
GPUs, the hardware scheduling unit is a wavefront—a logical vector of 32 or work-items that runs on a hardware SIMD unit. OpenCL workgroups execute as sets
of wavefronts.

A single wavefront per compute unit is inadequate to fully utilize the machine.
The hardware schedules two wavefronts at a time interleaved through the ALUs of
the SIMD unit, and it schedules additional wavefronts to perform memory transac-
tions. An absolute minimum to satisfy the hardware and fill instruction slots on an
AMD GPU is then three wavefronts per compute unit, or 196 work-items on high-end
APU GPUs. However, to enable the GPU to perform efficient hiding of global mem-
ory latency in fetch-bound kernels, at least seven wavefronts are required per SIMD.
We cannot exceed the maximum number of work-items that the device supports
in a workgroup, which is 256 on current AMD GPUs. At a minimum, we need 64
work-items per group and at least three groups per compute unit. At a maximum,
we have 256 work-items per group, and the number of groups that fit on the compute
unit will be resource limited. There is little benefit to launching a very large number
of work-items. Once enough work-items to fill the machine and cover memory la-
tency have been dispatched, nothing is gained from scheduling additional work-
items, and there may be additional overhead incurred. Within this range, it is neces-
sary to fine-tune the implementation to find the optimal performance point.

理解：workgroup是执行在同一cu上的workitem的集合，通过wave调度，也就是说一个workgroup有n个wavefront，一个wavefront(正在执行)可以包含32/64个workitem。