Identifying and Creating Work for the GPU Core¶
The USCs rely on hardware multithreading to maximise use of their ALUs. As described in the Architecture Overview, the USC can execute instructions from a pool of resident threads, switching between them with zero-cost.
Keep the USCs Busy¶
The most common reason a thread is not ready to execute is that its next instruction is a pending memory access, which can span hundreds of cycles of latency. If other threads are ready to execute during this entire period, the memory latency is completely hidden. This results in full use of the USC, eliminating the cost of the memory transfer.
If all threads are waiting for memory accesses, the USC will idle, reducing its efficiency. In other words, the application is bandwidth limited. This usually happens if the ratio of arithmetic to memory instructions in a kernel is low or there are not enough threads available to hide the latency. Other system operations could also be reducing the bandwidth available.
If an application is proving to be bandwidth limited, it is worth trying to reduce the number of memory fetches in each kernel or favouring run-time calculations over lookup tables. More work items can also be provided to help mask this. Tools such as PVRTune can be used to help detect when this happens.
Avoid Overly Short Kernels with Small Datasets¶
For a given kernel execution, the first few threads often run with reduced efficiency due to the time taken to allocate work to the USCs. The last few executions also often run with reduced efficiency as the pool of resident tasks has emptied. The more work that is done per execution, the lower the percentage of this cost. This can be achieved by:
Using somewhat longer kernels, which can mask out the latency of the scheduling/descheduling;
Working on larger datasets, which create more threads per execution that can all be scheduled together. Typically, this should be in the order of several thousands to millions of data points.
If an algorithm has a small kernel length or works on a small dataset, it may be difficult to achieve high USC efficiency. In this case, the algorithm may be more suited for execution on a CPU.