Texture Sampling

Points to consider when texture sampling and filtering.

Texture filtering

Texture filtering can be used to increase the image quality of textures used in 3D scenes. However, as the complexity of the filtering used increases, so will the associated cost as more samples are required.

There are several common techniques employed for texture filtering. These include, in order of increasing image quality, but also increasing cost:

  1. nearest
  2. bilinear
  3. cubic
  4. tri-linear
  5. anisotropic

Performance can be gained by using an appropriate level of filtering, following the principle of “good enough". For instance, not using anisotropic if tri-linear is acceptable, or not using tri-linear if bilinear is acceptable.

Filtering works by either taking a single sample in the case of nearest filtering, or by taking multiple samples involving multiple texture fetch operations. These are then combined (interpolated) in order to produce as good a sampling value as possible to use in fragment calculations.

Retrieving multiple values requires more data to be fetched, possibly from disparate areas of memory, and so cache performance and bandwidth use can be affected. For instance, when tri-linear filtering is used, eight texel fetches are required, compared to only four for bilinear filtering or one for nearest filtering. This means the texture processing unit in the graphics core must spend more time and bandwidth fetching and filtering the required data as the complexity of the filtering increases.

The graphics core will attempt to hide memory access by scheduling USC tasks. If there is not enough work to hide the memory latency, then the texture fetches may cause the processing of a fragment to stall while the data is fetched from system memory. If the data is already in cache, then memory latency is much less an issue. More complex filtering techniques will result in additional data being transferred across the system memory bus in order to render a frame.

Note: When performing independent texture reads, texture sampling can begin before the execution of a shader. Therefore the latency of the texture fetch can be avoided, as the data is ready before shader execution.

On PowerVR hardware, bilinear filtering is always hardware accelerated, including shadow sampling. This is sampling a texture with depth comparison activated - sampler2DShadow. In the case of shadow sampling, the depth comparison operation is performed in software with USC instructions appended (patched) to the fragment shader. The exact cost of the depth comparison operation will vary depending on the exact hardware the application is deployed to. The cost of the operation can be determined by using PVRShaderEditor and setting the appropriate GLSL compiler.

Texel fetch

In certain cases, performing a texelFetch operation can be considerably faster than calling the texture function. For example, take the case of an application performing an expensive sampling operation such as anisotropic filtering. It will likely be faster to perform a texelFetch operation over a texture operation, although this should be verified through profiling.

On PowerVR hardware, both operations are driven by dedicated hardware known as the Texture Processing Unit (TPU). In some special cases texelFetch may translate to a DMA operation.

Dependent texture read

A dependent texture read is a texture read in which the texture co-ordinates depend on some calculation within the shader instead of on a varying. As the values of this calculation cannot be known ahead of time, it is not possible to pre-fetch texture data, and so stalls in shader processing occur.

Vertex shader texture lookups always count as dependent texture reads, as do texture reads in fragment shaders where the texture read is based on the .zw channels of a varying. On some driver and platform revisions Texture2DProj() also qualifies as a dependent texture read if given a Vec3 or a Vec4 with an invalid w.

The cost associated with a dependent texture read can be written off to some extent by hardware thread scheduling, particularly if the shader in question involves a lot of mathematical calculations. This process involves the thread scheduler suspending the current thread and swapping in another thread to process on the USC. This swapped thread will process as much as possible, with the original thread being swapped back once the texture fetch is complete.

Note: While the hardware will do its best to hide memory latency, dependent texture reads should be avoided wherever possible for good performance.

Dependent texture reads are significantly more efficient on PowerVR Rogue Graphics Cores than SGX. However, there are still small performance gains to be had. For this reason, applications should always calculate coordinates before fragment shader execution, unless the algorithm relies on this functionality.

Wide floating point textures

For textures that exceed 32-bits per texel, every additional 32-bits is counted as a separate texture read. This also applies to half float textures with three or four components, as well as float textures with two or more components. These larger formats should be avoided unless necessary for a particular effect.