Example Device: 500MHz G6400#

The table below shows the theoretical rates of throughput achievable for a 500MHz G6400 device, with four USCs, for various simple operations.

Data type

Operation

Operations per instruction

Cycles per instruction

Theoretical throughput of 0.5GHz, 4xUSD G6400

16-bit float

Sum-Of-Products

6

1

(0.5 x 4 x 16 x 6) ÷ 1 = 192 GFLOPS

float

Multiply-and-Add

4

1

(0.5 x 4 x 16 x 4) ÷ 1 = 128 GFLOPS

float

Multiply

2

1

(0.5 x 4 x 16 x 2) ÷ 1 = 64 GFLOPS

float

Add

2

1

(0.5 x 4 x 16 x 2) ÷ 1 = 64 GFLOPS

float

DivideA

1

4

(0.5 x 4 x 16 x 2) ÷ 1 = 8 GFLOPS

float

DivideB

1

2

(0.5 x 4 x 16 x 2) ÷ 1 = 16 GFLOPS

int

Multiply-and-Add

2

1

(0.5 x 4 x 16 x 2) ÷ 1 = 64 GILOPS

int

Multiply

1

1

(0.5 x 4 x 16 x 2) ÷ 1 = 32 GILOPS

int

Add

1

1

(0.5 x 4 x 16 x 2) ÷ 1 = 32 GILOPS

int

Divide

1

30

(0.5 x 4 x 16 x 2) ÷ 1 = 1.07 GILOPS

Note

By default, the compiler implements float division as two range reductions, followed by reciprocal and multiplication instructions, requiring four cycles.

Note

For OpenGL GLSL shaders, or OpenCL kernels with -cl-finite-math-only or -cl-fast-relaxed-math, the compiler omits the range reduction, requiring only two cycles.