Avoid Barrier Directly after Local or Constant Memory AccessΒΆ

The hardware may have a latency of up to two cycles for calculating an array index and using it to fetch shared memory from the Common Store, see the the Memory section (for Rogue and Volcanic respectively). The compiler is generally able to hide this latency by rearranging instructions to carry out arithmetic during this time. If a barrier is present immediately after an access, as is often the case, then the compiler typically cannot do this. To allow the compiler to hide the latency, as much arithmetic that does not depend on the result of the access as possible should be performed before the barrier is inserted.

When the size of the workgroup is the same as the slot size barriers are free.