Measuring Peak Performance#
Throughput, in Operations-Per-Second, is one of the most important factors for performance. The theoretical maximum throughput for an arithmetic operation can be expressed as the following equation:
Type 1 & Type 2 cores:
Each USC has 16 ALU Pipes that run in parallel. Imagination USCs are optimised to maximise the number of scalar floating-point operations that can be processed on every cycle. Each ALU Pipe can use one of the following paths at any time:
The main unit which can execute up to two instructions in parallel, one of which must be a 32-bit float, the other either a 32-bit float or any integer. The supported native operations include multiply-and-add, multiply, and addition. This unit can also perform packing and test operations on the results of these and a final move operation.
A 16-bit
float
Sum-of-Products unit, which can execute two operations of the form(a * b) OP (c * d)
, where OP is either add, subtract, min, or max.A bitwise unit. The supported native operations include logic operations, logical shifts and arithmetic shifts. In one cycle it is possible to perform up to two shifts with a logical operation in between.
A 32-bit complex floating-point unit that can execute a single operation. The supported native operations include reciprocal, logarithmic, exponent, (cardinal) sine operations, and arc tangent. The complex unit may require more than a single instruction to execute these - for example, requiring range reduction before executing sines. In general, these are many times more efficient than a full software implementation.
The compiler implements other operations as routines on top of these, consuming multiple cycles.
Type 3 cores:
Each USC has 40 ALU Pipes that run in parallel. Imagination USCs are optimised to maximise the number of scalar floating-point operations that can be processed on every cycle. Each ALU Pipe can use one of the following paths at any time:
The main unit which can execute up to two instructions in parallel, one of which must be a 32-bit float, the other either a 32-bit float or any integer. The supported native operations include multiply-and-add, multiply, and addition. This unit can also perform packing and test operations on the results of these and a final move operation.
A bitwise unit. The supported native operations include logic operations, logical shifts and arithmetic shifts. In one cycle it is possible to perform up to two shifts with a logical operation in between.
A 32-bit complex floating-point unit that can execute a single operation. The supported native operations include reciprocal, logarithmic, exponent, (cardinal) sine operations, and arc tangent. The complex unit may require more than a single instruction to execute these - for example, requiring range reduction before executing sines. In general, these are many times more efficient than a full software implementation.
The compiler implements other operations as routines on top of these, consuming multiple cycles.
Type 4 cores:
Each USC has 128 ALU Pipes that run in parallel. Imagination USCs are optimised to maximise the number of scalar floating-point operations that can be processed on every cycle. Each ALU Pipe can use one of the following paths at any time:
The main unit which can execute up to two instructions in parallel, one of which must be a 32-bit float, the other either a 32-bit float or any integer. The supported native operations include multiply-and-add, multiply, and addition. This unit can also perform packing and test operations on the results of these and a final move operation.
A bitwise unit. The supported native operations include logic operations, logical shifts and arithmetic shifts. In one cycle it is possible to perform up to two shifts with a logical operation in between.
A 32-bit complex floating-point unit that can execute a single operation. The supported native operations include reciprocal, logarithmic, exponent, (cardinal) sine operations, and arc tangent. The complex unit may require more than a single instruction to execute these - for example, requiring range reduction before executing sines. In general, these are many times more efficient than a full software implementation.
The compiler implements other operations as routines on top of these, consuming multiple cycles.