Do Use On-chip Memory Efficiently for Deferred Rendering#

Making greater use of on-chip memory reduces overall system memory bandwidth usage.

Graphics techniques such as deferred lighting are often implemented by attaching multiple colour render targets to a frame buffer object, rendering the required intermediate data, and then sampling from this data as textures. While flexible, this approach, even when implemented optimally, still consumes a large amount of system memory bandwidth, which comes at a premium on mobile devices.

APIs have methods which allow efficient use of on-chip memory#

Both OpenGL ES (3.x) and Vulkan graphics APIs provide a method to enable communication between fragment shader invocations which cover the same pixel location – through intermediate on-chip buffers. This buffer can only be read from and written to by shader invocations at the same pixel coordinate.

The GLES extension shader_pixel_local_storage(2) and Vulkan transient attachments enable applications to store the intermediate per-pixel data in on-chip tile memory. While each method has its own implementation details, they both provide similar functionally and both bring the same benefits. For example the “G-Buffer” attachments in a deferred lighting pass that are only needed once can be stored in tile memory, and then completely discarded when drawing is complete.

These features can potentially reduce the amount of system memory bandwidth used by deferred rendering#

Both of the API features described above are extremely beneficial for tile-based renderers such as PowerVR graphics cores. The intermediate frame buffer attachments are never allocated or written out to system memory - they only exist in on-chip tile memory. This is extremely beneficial for mobile and embedded systems where memory bandwidth is at a premium.

Using these features correctly will result in a significant reduction in system memory bandwidth usage. Additionally, most techniques (such as deferred lighting) that write intermediate data out to system memory and then sample from it at the same pixel location can be optimised using these API features.