Maximise Instruction Throughput#

To get the best performance, the hardware needs to do as many calculations as it can in as short a time as possible. Ensuring the right algorithm is used is the most important thing to do here, as detailed in Identifying and Creating Work for the Graphics Core. Once the correct algorithm is chosen, several optimisations can further improve performance, as detailed below.

Trade Precision for Speed#

Unless specifically needed for a particular scenario, precision should liberally be traded for speed. This minimises the use of arithmetic instructions with low throughput, as long as care is taken to ensure there is enough precision to produce acceptable results.

For OpenGL and OpenGL ES, consider using mediump floats whenever practical. These may have optimised paths by requiring less precision, resulting in some reasonable gains in some common instructions.

For OpenCL kernels, always use the -cl-fast-relaxed-math build option, as this enables aggressive compiler optimisations for floating-point arithmetic which the standard otherwise disallows. This flag will often result in good performance gains at the cost of very little precision. native_* math built-ins should also be used, as they are much faster than their un-prefixed variants for the cost of some arithmetic precision.

For RenderScript, enable the pragma operation #pragma rs_fp_relaxed to enable some optimisations or #pragma rs_fp_imprecise to allow the compiler to perform more aggressive optimisations.

Tweak Work for the Graphics Core#

The PowerVR Graphics Core has an impressive arithmetic throughput and fully supports integer and bitwise. Nevertheless, it is still beneficial to avoid instructions that are excessively long or complicated. It is important to remember that all Graphics Cores are optimised for floating point calculations and may achieve better throughputs with them.

The integer path is reasonably fast, especially in combinations of arithmetic (except division) bit-shifts. When tests are needed, it may be possible to process several of those per cycle.

That said, it will normally be faster still to work with floating point. Ideally, the numbers should be float in the first place. For kernels with a lot of arithmetic on comparatively few values per kernel, it may be beneficial to even turn integer values into float and back to integers after calculating. If in doubt, use a profiling compiler as these kinds of gains will be reflected in the cycle count of the kernel.

Integer division should always be avoided if possible. Even the worst case scenario of casting to float, dividing, and casting back to an integer will be a lot faster than true integer division. The caveat of that is the reduced range of input; not all integers can be represented by float, hence the results will be accurate for a specific range. The integer numbers that are exactly representable by a 32-bit float are up to roughly 16.7M. However, the reciprocal involved may cost extra accuracy, so be wary with numbers even in the low millions.

Avoid Excessive Flow Control#

Any instruction that changes what code is executed based on some condition is considered to be flow control. Any flow control statements (if, ?, switch, do, while, for) in a kernel may cause dynamic branching as discussed in the Divergence – Dynamic Branching section.

PowerVR GPUs are very flexible and efficient with typical branching, but a branch is always a branch and inherently inefficient in a parallel environment. Branching should be avoided where practical, either with compile-time conditionals using the pre-processor or by replacing them with an arithmetic statement instead. Note that when conditionals are used to decide whether to access memory, the bandwidth savings will normally far outweigh the cost of the branch.

Also note that when very few instructions are present in any path of a branch statement, predicates will often be used instead, which results in fewer instructions used overall. Whether this happens or not can be seen when using a disassembling compiler, such as PVRShaderEditor.

As an alternative to using actual branches, it is often possible to use built-in functions such as clamp, min, max, or even casting a bool to a float or int value and performing arithmetic. Replacing branches with arithmetic can be tricky, but in some cases can result in decent performance boosts when done well, as these do not cause any kind of divergence among threads.

Avoid Barrier Directly after Local or Constant Memory Access#

The hardware may have a latency of up to two cycles for calculating an array index and using it to fetch shared memory from the Common Store (see the Memory section). The compiler is generally able to hide this latency by rearranging instructions to carry out arithmetic during this time. If a barrier is present immediately after an access, as is often the case, then the compiler typically cannot do this. To allow the compiler to hide the latency, as much arithmetic that does not depend on the result of the access as possible should be performed before the barrier is inserted.

When the size of the workgroup is the same as the slot size barriers are free.

Use Built-ins for Type Conversions#

While generally not very expensive and sometimes can be completely hidden, the compiler may sometimes require some additional instructions to convert between different data types. When suitable, explicit pack/unpack or conversion functions should generally be preferred to manual operations.

OpenCL#

OpenCL provides a very clear and complete suite of type conversion functions. When they are present, developers should use the convert_type() functions. When conversions are required, follow the conventions:

  • For conversions from int to float, round to zero (rtz) is fastst.

  • For conversions from uint to float, round to zero (rtz) and round to negative infinity (rtn) are fastest. Round fo positive infinity (rtp) is relatively fast, and round to nearest even (rte) is slow.

Also useful are the as_type() functions, which are analogous to reinterpret casts in C++. These allow a kernel write to access two types that are conceptually the same size (e.g., int and char4) to be interpreted interchangeably. Types that are not actually the same size on-chip have a higher cost than those that do, as additional instructions are required to extract the correct bits. For example, a char4 uses 128-bits of register space, whereas as_type() treats it as 32-bits. Packing instructions need to be used to compensate, for example, as an int4 treated as float4 would typically be a NO-OP.

OpenGL and OpenGL ES#

For OpenGL and OpenGL ES, the relevant conversions are a suite of packXXXXXX or unpackXXXXXX, and generally let developers pack and unpack floating point values relevant to the graphics pipeline. Consult the specific API version for available instructions.