Example Device: 500MHz G6400#
The table below shows the theoretical rates of throughput achievable for a 500MHz G6400 device, with four USCs, for various simple operations.
Data type |
Operation |
Operations per instruction |
Cycles per instruction |
Theoretical throughput of 0.5GHz, 4xUSD G6400 |
---|---|---|---|---|
|
Sum-Of-Products |
6 |
1 |
(0.5 x 4 x 16 x 6) ÷ 1 = 192 GFLOPS |
|
Multiply-and-Add |
4 |
1 |
(0.5 x 4 x 16 x 4) ÷ 1 = 128 GFLOPS |
|
Multiply |
2 |
1 |
(0.5 x 4 x 16 x 2) ÷ 1 = 64 GFLOPS |
|
Add |
2 |
1 |
(0.5 x 4 x 16 x 2) ÷ 1 = 64 GFLOPS |
|
DivideA |
1 |
4 |
(0.5 x 4 x 16 x 2) ÷ 1 = 8 GFLOPS |
|
DivideB |
1 |
2 |
(0.5 x 4 x 16 x 2) ÷ 1 = 16 GFLOPS |
|
Multiply-and-Add |
2 |
1 |
(0.5 x 4 x 16 x 2) ÷ 1 = 64 GILOPS |
|
Multiply |
1 |
1 |
(0.5 x 4 x 16 x 2) ÷ 1 = 32 GILOPS |
|
Add |
1 |
1 |
(0.5 x 4 x 16 x 2) ÷ 1 = 32 GILOPS |
|
Divide |
1 |
30 |
(0.5 x 4 x 16 x 2) ÷ 1 = 1.07 GILOPS |
Note
By default, the compiler implements float
division as two range reductions, followed by reciprocal and multiplication instructions, requiring four cycles.
Note
For OpenGL GLSL shaders, or OpenCL kernels with -cl-finite-math-only
or -cl-fast-relaxed-math
, the compiler omits the range reduction, requiring only two cycles.