Measuring Peak Performance#

Throughput, in Operations-Per-Second, is one of the most important factors for performance. The theoretical maximum throughput for an arithmetic operation can be expressed as the following equation:

../_images/measuring-peak-performance.svg

Type 1 & Type 2 cores:

Each USC has 16 ALU Pipes that run in parallel. Imagination USCs are optimised to maximise the number of scalar floating-point operations that can be processed on every cycle. Each ALU Pipe can use one of the following paths at any time:

  • The main unit which can execute up to two instructions in parallel, one of which must be a 32-bit float, the other either a 32-bit float or any integer. The supported native operations include multiply-and-add, multiply, and addition. This unit can also perform packing and test operations on the results of these and a final move operation.

  • A 16-bit float Sum-of-Products unit, which can execute two operations of the form (a * b) OP (c * d), where OP is either add, subtract, min, or max.

  • A bitwise unit. The supported native operations include logic operations, logical shifts and arithmetic shifts. In one cycle it is possible to perform up to two shifts with a logical operation in between.

  • A 32-bit complex floating-point unit that can execute a single operation. The supported native operations include reciprocal, logarithmic, exponent, (cardinal) sine operations, and arc tangent. The complex unit may require more than a single instruction to execute these - for example, requiring range reduction before executing sines. In general, these are many times more efficient than a full software implementation.

The compiler implements other operations as routines on top of these, consuming multiple cycles.

Type 3 cores:

Each USC has 40 ALU Pipes that run in parallel. Imagination USCs are optimised to maximise the number of scalar floating-point operations that can be processed on every cycle. Each ALU Pipe can use one of the following paths at any time:

  • The main unit which can execute up to two instructions in parallel, one of which must be a 32-bit float, the other either a 32-bit float or any integer. The supported native operations include multiply-and-add, multiply, and addition. This unit can also perform packing and test operations on the results of these and a final move operation.

  • A bitwise unit. The supported native operations include logic operations, logical shifts and arithmetic shifts. In one cycle it is possible to perform up to two shifts with a logical operation in between.

  • A 32-bit complex floating-point unit that can execute a single operation. The supported native operations include reciprocal, logarithmic, exponent, (cardinal) sine operations, and arc tangent. The complex unit may require more than a single instruction to execute these - for example, requiring range reduction before executing sines. In general, these are many times more efficient than a full software implementation.

The compiler implements other operations as routines on top of these, consuming multiple cycles.

Type 4 cores:

Each USC has 128 ALU Pipes that run in parallel. Imagination USCs are optimised to maximise the number of scalar floating-point operations that can be processed on every cycle. Each ALU Pipe can use one of the following paths at any time:

  • The main unit which can execute up to two instructions in parallel, one of which must be a 32-bit float, the other either a 32-bit float or any integer. The supported native operations include multiply-and-add, multiply, and addition. This unit can also perform packing and test operations on the results of these and a final move operation.

  • A bitwise unit. The supported native operations include logic operations, logical shifts and arithmetic shifts. In one cycle it is possible to perform up to two shifts with a logical operation in between.

  • A 32-bit complex floating-point unit that can execute a single operation. The supported native operations include reciprocal, logarithmic, exponent, (cardinal) sine operations, and arc tangent. The complex unit may require more than a single instruction to execute these - for example, requiring range reduction before executing sines. In general, these are many times more efficient than a full software implementation.

The compiler implements other operations as routines on top of these, consuming multiple cycles.