This section compares Vulkan to the current leading mobile API and its associated problems. OpenGL ES is over 12 years old and the API that it is based on, OpenGL was designed over two decades ago. The hardware that OpenGL was designed for is far outdated by modern standards.

The State Machine

OpenGL is a large global state machine and every operation takes into account various pieces of the current state, such as blend modes, current shaders, depth test information, and so forth. Although everything seems like a simple switch that can be changed at will with little consequence due to being a function, this is simply not true for modern hardware. For example, a lot of the state will be translated to shader code.

It has already been discussed that shader patching is problematic for hitching and render-time CPU usage in Scaling to Multiple Threads. There is another issue that has not been raised yet, however: inefficiencies in the shader itself. If a state must be patched onto a shader, this process occurs after optimised compilation, meaning it is effectively tacked onto the rest of the shader. If the state was known at compilation time, it could have been optionally compiled in, avoiding a few instructions. A driver might do a background recompile to combat this, though this is a problem in and of itself as it costs additional CPU time.

Implicit Synchronisation

OpenGL ES tends to assume that a wide variety of thing s implicitly synchronise with each other. Only recently has there been anything that is assumed to work asynchronously in any form, thanks to the indtroduction of incoherent memory accesses, fences, compute shaders, and the associated side-effects. Large portions of the API "just work", when in fact this comes down to a lot of resource tracking, cache flushing, and dependnecy chain construction behind the scenes.

A driver is likely to be unable to detect with pinpoint accuracy which dependnecies are being used. The driver has to be quite conservative in order to still be a functional OpenGL ES implementation. As a result, caches will inevitably be flushed unecessarily or work will be serialised unnecessarily as the hardware does more work than it needs to.

Immediate Mode

Commands that are specified in OpenGL ES are assumed to execute, start-to-finish, in the exact order they were submitted. A single command, such as a draw call, is trated as a single whole unit of work, with each unit being queued up on the GPU as it is specified. This behaviour is roughly known as immediate mode execution, which means that every piece of work specified is in some form, immediately sent to the GPU for processing.

Immediate Mode Rendering (IMR) architectures have mapped to this way of thinking well in the past, but modern IMRs tend to batch work together in order to increase performance. In contrast, Tile-Based Renderers (TBR) or Tile-Based Deferred Renderers (TBDRs) have never really functioned this way and are by far the most prolific GPU architectuer types in use today. For these kinds of architecture, the main unit of work is much larger and more easily explainable as a render pass, being a collection of draw calls targeted at the same framebuffer.

Figure 1: Tile-Based Deferred Rendering Architecture

TBRs and TBDRs both feature two-stage rendering, with an early phase that processes geometry and sorts it into screen-space tiles. The second phase rasterizes each of these tiles, keeping an entire tile's worth of framebuffer completely on-chip, saving an enormous amount of bandwidth.

The key thing is that during the rasterization phase, a draw call is no longer meaningful, as a single draw call may result in rasterization tasks across many tiles, with each tile consisting of work from multiple draw calls. If something causes a flush between draws, it splits the entire render up, requiring two lots of tiles to be rendered. At the start and end of each tiling pass, framebuffer data must be loaded into tiles and then subsequently stored. When this is done enough times, the benefit of having a tile-based architecture begins to be lost, as high bandwidth had originally been avoided to begin with.

Ultimately, modern hardware prefers to batch work together and actually submitting an individual draw would lead to inefficiencies. There are several operation in OpenGL ES that force a driver to do this submission and not all of them are obvious. In the case of TBR or TBDR, this leads to unnecessary bandwidth that the driver can do little about.