Minimise Bandwidth Usage#
If an application does not need to access memory or move data around, then it should not do so. Several changes can allow an application to avoid this depending on the algorithm used. Changes such as doing calculations at run-time instead of using a lookup or reducing the number of memory fetches for an image filter can help greatly. Other optimisations can be performed for the PowerVR Graphics Cores as explained over the following sections.
Avoid Redundant Copies#
One of the first steps in reducing memory throughput for an application is to ensure that hardware components using the same memory all access the same data without any intermediate copying required. Examples of hardware components include the CPU, Graphics Core, camera interfaces, and video decoders.
Create Allocations with Correct USAGE
Flags#
PowerVR devices are capable of allocating memory and then mapping the allocation to a CPU pointer but cannot map arbitrary host-allocated memory into the Graphics Core. Allocations should be explicitly synchronised across usages each time they are used in a different context to ensure that everything has the same view of memory. Synchronisations can cause copies to occur when memory is allocated in different memory spaces and so should only be used when necessary. If an algorithm can be designed to avoid synchronisations at any point, it should do so.
Create Buffers with the Correct USAGE
Hints#
In OpenGL and OpenGL ES, buffer allocation is handled by the driver semi-opaquely with usage hints during buffer allocation. The same recommendations apply as with OpenGL and OpenGL ES and it is therefore necessary to use the hints that are correct for the developer’s application. Mapping solutions (glMapBufferRange
) are preferred to explicit uploads (glBufferSubData
) when the data is going to be changed frequently.
Use Zero-Copy Paths if Possible#
This specific technique is of particular note to interop tasks. When processing input data from an external source such as a camera module, a zero-copy path can be enabled between the camera data and the API image used by the Graphics Core. This is achieved by using EGLImages. This typically means directly rendering an image - see the example below for communicating an image from the Camera module.
OpenGL and OpenGL ES: If the output image is to be used outside the graphics pipeline, it can also be shared with an EglImage
. If the data is not required outside the graphics pipeline, it may be worth it to consider directly using a fragment shader for rendering directly to the framebuffer instead of using Image Load/Store.
OpenCL: A zero-copy path can also be enabled between the OpenCL output image and the OpenGL ES input texture used for rendering by using EGLImages, as illustrated below. An example of how to do this (OpenCL Example) is included in the PowerVR SDK.
Group Memory Access Together#
The compiler uses several heuristics and can identify memory access patterns in a kernel that can be combined into burst transfers for read or write operations. To allow this, memory accesses should be grouped together as closely as possible to be as easy to identify as possible. Generally, putting reads at the start of a kernel and writes at the end allows for the best efficiency. Accesses to larger data types, such as vectors, also compile into single transfers wherever possible. This means that loading a single float4
is preferred over four separate float values.
Access Memory in Row-Major Order#
To make best use of the hardware’s memory architecture, all memory should be accessed in row-major order across threads. This applies to all memory banks in the architecture, though for subtly different reasons as described in Memory. Generally, memory can be considered as linear, even when allocated as an n-dimensional array. This means that two contiguous values in the last dimension are physically next to each other in memory and can often be accessed together in a single transfer. This is much more important when working on mobile devices with limited bandwidth on large data sets.
Note
This applies to most kinds of memory and buffers, but specifically does not apply to texture or image objects. These are optimised for cache efficiency based on n-dimensional access and dedicated hardware handles this. In these cases, it is important to access nearby pixels to ensure cache coherency. For example, a typical two-dimensional texture should generally be accessed in square/neighbouring pixels starting from (0, 0) and increasing. See Texture Processing Unit for more details.