Optimising Register Pressure on PowerVR¶

Like any other GPU architecture, PowerVR has a limited amount of register space to use. Failure to stay within the bounds of available register space could cause register spilling, which may result in sub-optimal performance.

There are several ways to improve this:

The number of registers used at any time is equal to the number of variables a shader invocation needs to “remember” at that specific moment. Therefore, register pressure can be minimised by keeping global variable usage to a minimum, and also by making sure local variables stay in scope as little as possible.
Vector processors can be efficient as multiple values can be operated at the same time. In computer graphics it is common to find some operations done on a four-vector basis. However, for those cases where less than four vector components are used, the processing power is wasted, affecting power consumption. Scalar processors are more flexible and can therefore tackle general purpose processing, and are specially useful in compute-related applications. The PowerVR Rogue and Volcanic architectures have scalar ALUs. Therefore, if the last components are not needed, it is better to work with two or three component vectors, rather than four. This also applies to matrix operations – for example, when transforming a three-component vector, it is better to use a 3x3 matrix to save register space. This also applies if the w component will be 0. And finally, affine transformations usually only require a 4x3 matrix (rotate, scale, translate) as the last column is always (0,0,0,1).
The PowerVR Rogue and Volcanic architectures are particularly good at handling FP16 operations., as often this results in twice as many instructions executed in a single cycle. FP16 math also has the advantage that two variables can be packed into a single FP32 register. Therefore, FP16 is not only faster, but also results in less register pressure.

Note

Minimising the use of branching is recommended. On most GPU architectures, branching is always costly as GPUs are designed to handle parallel workloads. Branching not only results in extra cycles consumed, but also comes with increased register pressure.