Execute Tasks on the Most Appropriate Processor#

If the hardware platform includes both a CPU and Graphics Core, an application should be structured so that it exposes both serial and parallel tasks. Serial tasks are most efficiently executed on a CPU, whereas parallel tasks are good candidates for executing on a Graphics Core.

To maximise system performance, tasks should be running on the CPU and Graphics Core at the same time. For example, using the CPU to prepare the next batch of data while the Graphics Core processes the current batch. The Graphics Core is often a more power efficient processor if the algorithms used can be expressed as a highly parallel task. If the aim is to reduce power consumption, it becomes important to consider whether an algorithm can be expressed as a parallel task for execution on the Graphics Core, letting the CPU idle for this time.

Choose a Suitable Device#

Choosing a device varies depending on the API. The developer should consider the work that is to be split between the CPU and Graphics Core as work that would be split into tasks suited for sequential or parallel execution. Tasks for the CPU can be implemented in any language supported by the platform, such as C++ or Java.

In OpenGL and OpenGL ES, tasks for the Graphics Core are normally dispatched as Compute Shaders. There are some cases where suitable alternatives exist and can be successfully implemented. For example, using fragment shaders for image processing when displaying directly to screen or implementing time-evolving physical systems with vertex shaders and Transform Feedback. This document in general assumes compute shaders, as they are by far the most flexible of the three with the biggest area of application.

For OpenCL, the Installable Client Driver (ICD) enables multiple OpenCL drivers to coexist under the same system. If a hardware platform contains OpenCL drivers for other devices such as a CPU or DSP, an application can select between these devices at run-time.

RenderScript dynamically chooses whether a task can run on CPU or Graphics Core at runtime, depending on the workloads of each system and the suitability of the kernel for execution on each. Thus, a developer cannot explicitly choose Graphics Core execution, but by using RenderScript and noting any device restrictions, they give RenderScript the choice to do so.

Identifying and Creating Work for the Graphics Core#

The USCs rely on hardware multithreading to maximise use of their ALUs. As described in the Architecture Overview, the USC can execute instructions from a pool of resident threads, switching between them with zero-cost.

Keep the USCs Busy#

The most common reason a thread is not ready to execute is that its next instruction is a pending memory access, which can span hundreds of cycles of latency. If other threads are ready to execute during this entire period, the memory latency is completely hidden. This results in full use of the USC, eliminating the cost of the memory transfer.

If all threads are waiting for memory accesses, the USC will idle, reducing its efficiency. In other words, the application is bandwidth limited. This usually happens if the ratio of arithmetic to memory instructions in a kernel is low or there are not enough threads available to hide the latency. Other system operations could also be reducing the bandwidth available.

If an application is proving to be bandwidth limited, it is worth trying to reduce the number of memory fetches in each kernel or favouring run-time calculations over lookup tables. More work items can also be provided to help mask this. Tools such as PVRTune can be used to help detect when this happens.

Avoid Overly Short Kernels with Small Datasets#

For a given kernel execution, the first few threads often run with reduced efficiency due to the time taken to allocate work to the USCs. The last few executions also often run with reduced efficiency as the pool of resident tasks has emptied. The more work that is done per execution, the lower the percentage of this cost. This can be achieved by:

  • Using somewhat longer kernels, which can mask out the latency of the scheduling/descheduling;

  • Working on larger datasets, which create more threads per execution that can all be scheduled together. Typically, this should be in the order of several thousands to millions of data points.

If an algorithm has a small kernel length or works on a small dataset, it may be difficult to achieve high USC efficiency. In this case, the algorithm may be more suited for execution on a CPU.

Sharing the Graphics Core between Compute and Graphics Tasks#

A long-running kernel can starve system components that rely on 3D graphics processing, such as a user interface. In this situation, a long-running kernel can be split into multiple kernels executed one after another. This typically requires the first kernel to write some data to global memory and the second to read this data back, which will introduce additional bandwidth usage.