Single Instruction, Multiple Data¶

Typical CPUs are optimised to execute large, heavily-branched tasks on multiple pieces of data at a time. A thread running on a CPU is often unique and is executed on its own, largely independent of all other threads. Any given processing element will process just a single thread. Typical numbers of threads for a specific program on a CPU are commonly one to eight, up to a few tens at any period of time.

Graphics Cores work on the principle that the exact same piece of code will be executed on numerous multiple threads, often numbering into the millions, to handle the large screen resolutions of modern devices. These operations differ only in input and normally follow the exact same steps, instruction for instruction. To do this efficiently, each processor executes the same instruction on multiple threads concurrently, in a form of Single Instruction, Multiple Data (SIMD) processing.

This should not be confused with vector processing, which is another form of SIMD. SIMD processors may either be scalar; operating on one element at a time, or vector; operating on multiple elements at a time.

Parallelism¶

The main advantage of this type of architecture is that significant numbers of threads can be run in parallel for a correctly structured application. This type of parallelism is also done with extremely high efficiency and effectiveness. SIMD architectures are usually capable of running many orders of magnitude more threads at once than a typical CPU.

Operating on large coherent datasets is something they perform exceptionally well at, which is why they are traditionally used for graphical and image processing. Algorithms that function independently on such data are well suited to this sort of processor.

Divergence – Dynamic Branching¶

With each thread executing on one processor core, thread divergence must be handled differently to how it works on a typical CPU. A CPU is made with the assumption of a sequential pipe that will be branching, looping, doubling back, and otherwise following an extremely complex flow control in a very deep call stack. Modern CPUs are heavily optimised for this kind of work, with sophisticated branch prediction and other features aiming to optimise this kind of program.

A typical Graphics Core is optimised with the assumption that each “batch” of threads will execute the exact same code, instruction for instruction (lockstep), without any difference whatsoever. Additionally, it is expected but not required that the input will be coherent (i.e. only slightly different).

Optimising the design for these assumptions leads to the aforementioned efficiency for these cases. The downside to this is that when branches are encountered (if, while, dynamic for), each thread cannot just execute a different path. The typical implementation is to have the processing core execute all paths of a branch in every thread and just disable all threads that have not taken the path, unless none of the paths are used. The actual mechanisms for this may be different on different architectures, but the net effect is the same.

In terms of computation, the cost of executing any branching code is the cost of executing all branches (unless none of threads hit the branch) plus the branching instructions themselves. It is therefore necessary to be careful of excessive branching when executing on a SIMD processor. There are exceptions to this: one of the most common and notable cases is conditionally loading or storing data from main memory, where it is always worthwhile to branch. Usually, the bandwidth saved vastly outperforms the branch cost.