Reducing Overhead with Draw*Indirect and MultiDraw*IndirectEXT


A standard OpenGL ES draw call requires passing the parameters of the draw via the function’s arguments. With the Draw*Indirect calls, it is possible to instead pass in a structure containing the draw parameters. An important benefit of this structure is that it does not have to be populated by the CPU, as the graphics driver and SSBOs can be used to populate it. This enables the application to issue a draw without any CPU-side involvement.

Example use case: Batched draws

For optimal performance, applications should batch draws by state to reduce the number of API calls. However, a separate draw call needs to be issued for each object in that batch, and draw calls have a CPU overhead it is better to avoid. With Draw*Indirect, an SSBO can be populated with the vertex data of all draws that share the same state. With this SSBO, only a single Draw*Indirect needs to be made.

Example use case: Particle systems

Another use case could be a particle system where the developer does not want to allocate a big array for particles up front. Instead, a compute shader could be used to determine how many particles need to be rendered each frame. For complex particle systems, particles could be removed from the render if they are obscured by opaque objects.


These API calls are very similar to Draw*Indirect. The key difference is that an array of Draw*IndirectCommand structures can be passed into each draw call.

Example use case: Occlusion culling

In complex 3D navigation systems, draw calls tend to be grouped by map tiles. If a map tile intersects the view frustum, all draw calls within the tile are issued to the graphics core. This can be optimised with occlusion queries to further reduce the number of draw calls that are issued. With MultiDraw*IndirectEXT there is a better option than occlusion queries. A compute shader can be used to populate an array of Draw*IndirectCommand structures. These can then be used to issue a single draw call for many objects sitting in many different tiles.


Instancing is extremely useful for drawing many hundreds or thousands of objects that share the same vertex data but have different world transformations.

Consider the example of drawing many thousands of leaf objects that are very simple in terms of geometry. With the non-instanced approach, the application would need to loop X times calling glDraw* on the same object each time. This is extremely expensive in terms of API overhead, even if the geometry is relatively simple in nature. Every time a draw call is issued, the CPU must spend time instructing the graphics core about how to draw the object. The actual rendering may be extremely fast, but the API overhead completely cripples performance.

In the same scenario described previously but using the instanced approach, the application needs only to call a single API function glDraw*Instanced once. This then allows the application to draw the object X number of times. The instanced function behaves almost identically to glDraw* but takes an extra parameter, primcount, which tells the graphics core how many instances of the object it should render. This approach results in significantly more efficient behaviour.

To achieve optimal performance when implementing instancing, wherever possible use a power of two instance divisor. The result of doing so is a reduction in the number of instructions required to stream the data to the unified shader cores (USCs), effectively eliminating a potential bottleneck.