Vulkan Mechanisms for Scaling

Vulkan does not automatically scale to multiple cores. As an explicit API, it provides the tools and mechanisms required to allow applications to scale as necessary. This can be achieved by design choices or by providing objects that explicitly cater to multi-threading.

No Global State

Often when calling a function, the driver looks up the context which is tied to the current thread. This is done using a form of Thread Local Storage, which is an inefficient technique. With OpenGL ES's bind-to-edit model, the majority of functions require further lookups within the context so as to obtain the currently bound object, leading to further issues.

Vulkan lacks a mutable global state. Instead, whenever a function needs to be called on an object, the object is passed as an argument to the function. This way, Vulkan can avoid global lookups and locks on the global state.

External Synchronisation

Modifying anything at any point in OpenGL ES is thread-safe, at least as far as the CPU is concerned. Though this sounds useful, it actually means that the driver is being forced to jump through hoops to avoid race conditions whether the application is multi-threaded or not. This means that there is often a mutex lock around any functions that cause such modifications, which is one of the most subtle causes of high overhead in OpenGL ES. Since the driver has little information on how an application is going to access state, at least without invoking heuristics, it becomes difficult to do anything but to run conservatively and lock down everything.

On the other hand, Vulkan only guarantees concurrent read access to objects and states. Should any thread modify an object, the application has to ensure that no otther access to the same object occurs concurrently. Thanks to careful API design choices, it is usually unnecessary to modify the same object from multiple threads at any given time. Should an application need to modify the same object on multiple threads, it has to use its own synchronisation mechanisms to avoid race conditions and data hazards.

By passing the synchronisation issue to developers, applications have the opportunity to function better than the driver. Synchronisation can commonly be done between threads at set communication points, possibly without the need for a mutex, which saves a lot of potential idle time. In other words, by being passed to the application, the expensive work of synchronisation is only done when required and not conservatively.

Multi-Threaded Command Generation

It is important to keep in mind that OpenGL ES makes no distinction between command generation and command submission. Whenever the glDraw call is made, it translates all the current states into something suitable for the hardware to consume and it also submits that information for execution. Generating commands is an expensive operation, worsened by the various inefficiencies present in OpenGL ES. Since all submissions need to be serialised, all the command generation has to be done one thread at a time.

In contrast, on Vulkan, the concept of generation and submission are completely separated. Commands are first recorded into command buffer objects and then submitted to hardware queues at a later time. Thanks to this, applications can record command buffers on worker threads, with the submission being a relatively cheap CPU operation that could be performed on a single thread (normally the main rendering thread) with low impact. As a result, Vulkan can achieve much more efficient division of work, with recording designed to scale across multiple threads without incurring additional processing costs.

Figure 1: Vulkan Multi-Threading

There is a cost to recording commands that could have been difficult to scale. Command buffers require memory to record to, and it is not practically possible to know how much memory is required in advance. Allocating memory at a base level is a global operation which requires a lock of some kind to block other threads. To alleviate this, command buffers employ a few strategies to avoid going to the system for memory:

  • Command buffers can be reset, allowing them to be re-recorded without freeing any memory that has been allocated. If the allocation size in a single command buffer remains stable, there is no need for memory to be allocated from the system every frame - only the first frame pays the costs.
  • Command pools allow a group of command buffers to share a larger allocation. If the workload-per-command buffer varies from frame to frame but the per-pool workload doesn't, then a level of stability is provided without every buffer enormously over-allocating.

The strategies highlighted here permeate through the entire API. Vulkan offers greater scalability than OpenGL ES, and always offers the option for using multiple threads. In practice, this means that on devices with more cores, developers will be able to manage threads better and allow rendering strategies that were not previously possible. This will all come together to lead to improved efficiency and better performance of applications that would otherwise max out single cores.