Minimise Bandwidth Usage#

If an application does not need to access memory or move data around, then it should not do so. Several changes can allow an application to avoid this depending on the algorithm used. Changes such as doing calculations at run-time instead of using a lookup or reducing the number of memory fetches for an image filter can help greatly. Other optimisations can be performed for the PowerVR Graphics Cores as explained over the following sections.

Avoid Redundant Copies#

One of the first steps in reducing memory throughput for an application is to ensure that hardware components using the same memory all access the same data without any intermediate copying required. Examples of hardware components include the CPU, Graphics Core, camera interfaces, and video decoders.

Create Allocations with Correct USAGE Flags#

PowerVR devices are capable of allocating memory and then mapping the allocation to a CPU pointer but cannot map arbitrary host-allocated memory into the Graphics Core. Allocations should be explicitly synchronised across usages each time they are used in a different context to ensure that everything has the same view of memory. Synchronisations can cause copies to occur when memory is allocated in different memory spaces and so should only be used when necessary. If an algorithm can be designed to avoid synchronisations at any point, it should do so.

Create Shared Memory Objects with CL-ALLOC-HOST-PTR#

In OpenCL, host and device memory models are specified independently from each other, with interaction performed by either explicitly copying data or by sharing regions of a memory object. PowerVR devices are capable of allocating memory and then mapping the allocation to a CPU pointer but cannot map arbitrary host-allocated memory into the Graphics Core. Therefore, memory objects that are to be shared should be created using the flag CL-ALLOC-HOST-PTR. They must also be accessed using mapping functions clEnqueueMapBuffer or clEnqueueMapImage, and clEnqueueUnmapMemObject.

If an object is created using the flag CL-USE-HOST-PTR, prior to enqueueing a kernel, the driver will create a copy of it that can be mapped into the Graphics Core address space. This copy occurs every time data is transferred between the host and device and will increase system bandwidth demand.

Create Buffers with the Correct USAGE Hints#

In OpenGL and OpenGL ES, buffer allocation is handled by the driver semi-opaquely with usage hints during buffer allocation. The same recommendations apply as with OpenGL and OpenGL ES and it is therefore necessary to use the hints that are correct for the developer’s application. Mapping solutions (glMapBufferRange) are preferred to explicit uploads (glBufferSubData) when the data is going to be changed frequently.

Use Zero-Copy Paths if Possible#

This specific technique is of particular note to interop tasks. When processing input data from an external source such as a camera module, a zero-copy path can be enabled between the camera data and the API image used by the Graphics Core. This is achieved by using EGLImages. This typically means directly rendering an image - see the example below for communicating an image from the Camera module.

../../_images/zero-copy.svg

Fig. 9 Zero copy from a camera to OpenGL ES#

OpenGL and OpenGL ES: If the output image is to be used outside the graphics pipeline, it can also be shared with an EglImage. If the data is not required outside the graphics pipeline, it may be worth it to consider directly using a fragment shader for rendering directly to the framebuffer instead of using Image Load/Store.

OpenCL: A zero-copy path can also be enabled between the OpenCL output image and the OpenGL ES input texture used for rendering by using EGLImages, as illustrated below. An example of how to do this (OpenCL Example) is included in the PowerVR SDK.

../../_images/zero-copy-2.svg

Fig. 10 Zero copy from a camera to OpenCL and from OpenCL to OpenGL ES#

Group Memory Access Together#

The compiler uses several heuristics and can identify memory access patterns in a kernel that can be combined into burst transfers for read or write operations. To allow this, memory accesses should be grouped together as closely as possible to be as easy to identify as possible. Generally, putting reads at the start of a kernel and writes at the end allows for the best efficiency. Accesses to larger data types, such as vectors, also compile into single transfers wherever possible. This means that loading a single float4 is preferred over four separate float values.

Use Shared/Local Memory Sensibly#

Unlike Series5, the Type1 to Type4 GPUs have a good amount of shared memory available to make good use of work-groups. Kernels should generally pre-fetch data into shared memory if it is going to be accessed from multiple kernels. This is effectively a software-controlled cache, augmenting the hardware-controlled caches on the device.

Typically, the way to make good use of this is to evaluate how many accesses are going to be needed across work-items, then cache these values into shared memory at the start of a kernel. This work should ideally be done with explicit commands where available, such as OpenCL async_work_group_copy. Or the work can be split between available work items as evenly as possible, with no two threads caching the same values. Barriers should be used to synchronise all work-items as late as possible, but before using the values. Then, each kernel should proceed as usual, using these cached values.

Access Memory in Row-Major Order#

To make best use of the hardware’s memory architecture, all memory should be accessed in row-major order across threads. This applies to all memory banks in the architecture, though for subtly different reasons as described in Memory. Generally, memory can be considered as linear, even when allocated as an n-dimensional array. This means that two contiguous values in the last dimension are physically next to each other in memory and can often be accessed together in a single transfer. This is much more important when working on mobile devices with limited bandwidth on large data sets.

Note

This applies to most kinds of memory and buffers, but specifically does not apply to texture or image objects. These are optimised for cache efficiency based on n-dimensional access and dedicated hardware handles this. In these cases, it is important to access nearby pixels to ensure cache coherency. For example, a typical two-dimensional texture should generally be accessed in square/neighbouring pixels starting from (0, 0) and increasing. See Texture Processing Unit for more details.