Optimisation Strategy Cheat Sheet

A brief list of abbreviated points on how to improve performance.

Choose the correct device for the task

  • Graphics Core: Large datasets, simple flow, coherent data, large numbers of computations in general
  • CPU: complex flow control and branching, one to a few threads of execution, little parallelism, one-shot tasks

Optimise the algorithm for convergence

  • Avoid algorithsm with complex flow control or threads in a work-group that are idle or exit early
  • Define and execute the kernel for maximum data coherency of nearby items
  • If having flow control, try to maximise the probabillity that all threads in a work-group will take the same path

If possible, define a work-group size that is compile-time known and a multiple of 32

  • Occupancy, register pressure, and many aggressive optimisations benefit from a work-group size that is known at compiletime and is a multiple of 32
  • Hitting both of these conditions maximises benefits but work singularly as well. Compile-time work-group size enables several optimsations, while a work-group that is a multiple of 32 maximises occupancy

Use a large dataset

  • Provide enough work for the Graphics Core to schedule so that latency can be hidden
  • A few thousands to a few millions should be the target
  • Only consider less than a few thousands if the kernel is sufficiently long enough to warrant setting up

Minimise bandwidth / maximise utilisation

  • Use shared memory to minimise bandwidth use
  • Use a healthy amount of airthmetic instructions interspersed with memory read/writes to hide latency
  • Schedule a lot of work simultaneously to assist the scheduler to hide memory fetch latency
  • Consider using arithmetic instead of using lookup tables

Maximise occupancy

  • Try to use temporary store (unified store) and shared memory (common store) sensibly so as to not hurt occupancy. Registers are not infinite. Up to ~40 32-bit registers per work-item should be sufficient.
  • Minimise necessary temporaries.
  • Use a healthy amoutn of arithmetic instructions interspersed with memory read/writes to hide this latency. Consider using airthmetic versus lookups when bandwidth is limited.

Balance minimising bandwidth use with maximising occupancy

  • work-group size, shared memory use, and private memory use must all be tweaked together
  • Bandwidth should be the first target of optimisation in most applications
  • However, too much shared or private memory use may force the compiler to reduce utilisation

Trade precision for speed wherever possible

  • Consider using half instead of float (OpenCL), mediump instead of highp (OpenGL ES)

Access shared memory sequentially on 2D accesses, in row-major order

  • Access sequential elements from kernel instances to maximise shared memory access speed and minimise bank conflicts.
  • Do not use stride when accessing values. Stride access in most cases directly increases the number of conflicts. Maximum shared memory speed si achieved by accessing sequential elements.

Access raw global memory linearly and textures in square patterns

  • Accessing memory linearly helps cache locality and burst transfers
  • Texture hardware is optimised for spatial locality of pixels

Balance the length of the kernel by aiming for a few tens to a few hundreds of hardware instructions

  • If the kernel is not long enough, the cost of setting up might end up being a big part of overall execution time.
  • Excessively long kernels, with thousands of instructions or long loops, are generally inefficient. They can also starve other system components, notably Graphics.

Use shared objects with API Interoperability to avoid epensive redundant copies

  • See what options and extensions are available to the combination of APIs being used, such as EglImage
  • Using a shared image or buffer and avoiding a round-trip to the CPU will almost always prove a big gain

Use burst transfers

  • Help the compiler identify burst transfers by grouping reads and writes together, making them easy to identify. Use explicit block-copy instructions wherever possible, such as the OpenCL async_work_group_copy.

Try to hide common-store latency

  • If possible, try to have a few unrelated calculations between accesses to shared memory and their corresponding barrier syncs