Maximise Utilisation and Occupancy#

Utilisation is the efficiency of execution for each task with regards to the number of ALU Pipes being used at any given time. If a given task does not have a full 32 threads performing useful work, then available processing power is being wasted and utilisation is low. Developers should always strive to keep hardware busy.

Occupancy refers to the number of tasks that are queued and ready for execution in a USC. The FGS can swap tasks seamlessly if a task is already resident, which hides any latency that would otherwise be caused by stalls in a task that is currently running. The Execution section gives more details on how the hardware manages tasks and threads.

Work on Large datasets#

The most important part of ensuring high utilisation and occupancy is to ensure there are enough data points to make best use of the hardware. Datasets larger than about 512 items per USC on the device typically provide enough work to maintain high utilisation and occupancy, with larger numbers of items increasing efficiency further.

Aim for a dataset Size that is a Multiple of 32#

A dataset with a size that is a multiple of 32 is the best opportunity for full utilisation. In the case of sufficiently large datasets, the difference in utilisation should be negligible. However, care needs to be taken when doing this to ensure that either filler data does not affect the final result or with code in the kernel. For OpenCL, a kernel’s global work size can be set to a padded size, whilst the actual dataset is kept at its original size. In this case, the kernel must be careful to avoid out of bounds memory accesses.

Choose an Appropriate Work-Group Size#

The number of ALU Pipes active in any given cycle depends on the number of threads assigned to each task. A task consists of 32 threads and a few things affect whether these are all used fully or not. In OpenCL and OpenGL ES Compute, a developer has precise control over this by choosing a work-group size, so it should be chosen carefully.

Aim for a Work-Group Size of 32#

Ideally, a work-group size of 32 is perfect for execution on PowerVR hardware, as this exactly matches the size of a task. This ideal situation is not always possible, depending on the size of the total dataset actually being worked on and the local problem space. Suggested work-group sizes are listed below, in order of decreasing efficiency:

  • 32;

  • Multiples of 32, up to and including 512 - which is practically ideal;

  • 16, eight, or four, specified at compile time - which is only recommended if no alternative exists;

  • Any other size, though this should be avoided - will have several threads per work-group idle;

  • Two, though again, this is ideally avoided - half of the threads will be idle;

For a discussion on exactly how the compiler handles work-group sizes, see the section concerning execution.

Tell the Compiler the Work-Group Size#

Applications should always tell the compiler how large a work-group a kernel expects to work on.

For OpenGL and OpenGL ES this means using layout(local_size_x=X, local_size_y=Y, local_size_z=Z).

For OpenCL this means using __attribute__((reqd_work_group_size(X, Y, Z))).

Note

This is an important step to take to ensure best performance and should be used especially when using local memory alongside barriers.

By knowing the size, the compiler can better allocate on-chip resources, such as shared memory, and perform targeted optimisations at compile time. Deferring the choice until the kernel is enqueued or letting the runtime choose misses out on these opportunities.

In the case where a kernel is intended for use with multiple different work-group sizes, compiling the kernel multiple times with different work-group sizes is usually a much better option than deferring the choice to runtime. The performance gains can be quite significant in some cases and is usually worth the additional application complexity.

For work-group sizes of 16, eight, or four where barrier synchronisation is used, full utilisation can only be achieved if this attribute is specified.

Reduce Unified Store Usage (Private Memory)#

The USC’s Unified Stores are shared between all work-items that are resident on a USC at any given time and are used to store any temporary variables or OpenCL private memory allocations, and in the kernel.

The compiler prioritises high occupancy over high utilisation and will reduce the utilisation of each task based on private memory only. Having at least 512 threads resident per USC provides the best opportunity to hide any data fetch latency whilst maximising utilisation. This equates to a maximum of 40 scalar values used by the kernel.

If a kernel uses less private memory than this, more tasks can be made resident on most cores. The compiler will attempt to optimise unified store private memory use where possible, reducing the final amount. Under NDA, Imagination can provide a compiler which gives exact values for private register usage.

Reduce Common Store Usage (Shared/Local Memory)#

Shared (OpenCL local) memory is extremely useful as it can act as a software-controlled caching mechanism, thus reducing bandwidth requirements in an application, as discussed in Group Memory Access Together. A lot of optimisations usually start by caching into this store.

However, there are limits to how much can be used before affecting occupancy. As with Unified memory, resident work-groups share local memory on a USC. The more work-groups that can fit within the maximum available local memory, the greater the opportunity the core has to hide latencies in kernel execution.

Be Aware of Padding#

To improve data access speed, all data types less than 32-bits are padded to 32-bits in both the Unified and Common Stores. This means that a char or short takes up 32-bits of register space.

Note

When querying the amount of local memory available to an OpenCL device, the number returned is how many 32-bit values can be stored, which applies equally to float, int, or char values.

For some kernels, this padding means that register use may start going beyond best performance limits, as discussed previously, whereas it would not if they were not padded. In these cases, it may be useful to manually pack or unpack values from one or more 32-bit types. This allows values to be stored for their lifetime as smaller values and only being expanded when needed. There is a computation cost associated with this, so care should be taken.