Example Device: 500MHz G6400¶

The table below shows the theoretical rates of throughput achievable for a 500MHz G6400 device, with four USCs, for various simple operations.

Data type	Operation	Operations per instruction	Cycles per instruction	Theoretical throughput of 0.5GHz, 4xUSD G6400
`16-bit float`	Sum-Of-Products	6	1	(0.5 x 4 x 16 x 6) ÷ 1 = 192 GFLOPS
`float`	Multiply-and-Add	4	1	(0.5 x 4 x 16 x 4) ÷ 1 = 128 GFLOPS
`float`	Multiply	2	1	(0.5 x 4 x 16 x 2) ÷ 1 = 64 GFLOPS
`float`	Add	2	1	(0.5 x 4 x 16 x 2) ÷ 1 = 64 GFLOPS
`float`	DivideA	1	4	(0.5 x 4 x 16 x 2) ÷ 1 = 8 GFLOPS
`float`	DivideB	1	2	(0.5 x 4 x 16 x 2) ÷ 1 = 16 GFLOPS
`int`	Multiply-and-Add	2	1	(0.5 x 4 x 16 x 2) ÷ 1 = 64 GILOPS
`int`	Multiply	1	1	(0.5 x 4 x 16 x 2) ÷ 1 = 32 GILOPS
`int`	Add	1	1	(0.5 x 4 x 16 x 2) ÷ 1 = 32 GILOPS
`int`	Divide	1	30	(0.5 x 4 x 16 x 2) ÷ 1 = 1.07 GILOPS

Note

By default, the compiler implements float division as two range reductions, followed by reciprocal and multiplication instructions, requiring four cycles.

Note

For OpenGL GLSL shaders, or OpenCL kernels with -cl-finite-math-only or -cl-fast-relaxed-math, the compiler omits the range reduction, requiring only two cycles.