Separable Kernels

The trade-offs associated with single pass or multi-pass approaches

Many post-processing techniques such as full screen blur with a Gaussian blur, or gather and scatter operations like motion blur, depth of field, and bloom can be implemented with multiple separated kernels. The downside to using a multi-pass algorithm is that they can be very inefficient in terms of memory bandwidth. This is due to increased round trips to system memory; write-out, read-back, write-out, read-back, and so on. This results in significantly increased memory bandwidth usage and power consumption, and may result in poor performance.

The alternative solution is to use a single kernel (single pass – brute force) to achieve the desired graphical effect. However, condensing the algorithm into that single pass may result in worse performance than the multi-pass technique. This is because the algorithm may require many more samples when performing in single pass mode to achieve the same level of quality. This will result in increased system memory bandwidth usage over the multi-pass.

An example of this is Gaussian blur, which is commonly implemented as a multi-pass technique with a horizontal and vertical pass. This substantially simplifies the complexity of the algorithm when compared to a single pass approach, which requires significantly more samples. There are full screen blur techniques that work with a single pass which have been proven to be efficient, such as Epic’s single pass circular based filter algorithm, instead of a two-pass Gaussian. More information can be found here.

To choose the ideal single or multi-pass algorithm, profile the algorithm to determine which technique provides the most efficient usage of system memory bandwidth and USC.