Architecture Overview

Detailed explanation on PowerVR Rogue architecture.

PowerVR's Rogue architecture is a programmable architecture capable of executing general purpose computations as well as rendering graphics.

As shown in the following figure, Rogue provides different hardware for transferring vertex, pixel, and compute data between memory and the Graphics Core, and programmable processors for performing the actual operations. Unlike the Series5 architecture, Rogue has a dedicated path for compute tasks via the Compute Data Master (CDM), with the programmable arithmetic handled by Unified Shading Clusters (USCs).

Figure 1: Rogue architecture and data flow overview

Compute Data Master

Any compute jobs that have been dispatched by the host will need to be first translated into individual tasks for the Graphics Core to consume. The job of the Compute Data Master (CDM) is to take input from the host and generate these tasks.

Coarse Grain Scheduler

Once tasks have been generated by the CDM, they have to be scheduled for execution. The Coarse Grain Scheduler (CGS) takes the generated tasks and schedules tasks across the available Unified Shading Clusters (USCs) on a broad scale.

Unified Shading Cluster

Details on the Unified Shading Cluster component.

USCs are the programmable cores at the heart of the Rogue's computing power and are where the code in compute kernels is executed as illustrated below. The USC is a scalar SIMD processor, as described in the A Modern System on Chip section. Each USC normally has 16 ALUs.

Figure 2: Rogue Unified Shading Cluster

Execution

To execute the code within a compute kernel, the compiler has to translate the developer's intentions into something the USC can understand. It is equally useful for the kernel writer to understand the mapping between hardware and API to understand how to get the best performance.

Threads

Each work item of OpenCL, thread of OpenGL ES, or kernel invocation for RenderScript is handled as a thread of instructions in the hardware. These threads are executed together, in groups of up to 32, known as a task. In any given cycle, 16 threads - one per ALU Pipe, will execute the same instruction. This means that half of a task runs in lockstep, with the other half staggered so that it executes one cycle behind.

Tasks

Tasks are the grouping used by Rogue hardware for SIMD execution of threads, with each task treated as a unit of work. Each task is executed by all ALU Pipes, one at a time, with half of the threads executing in lockstep in one cycle and the other half executing in the next cycle.

In RenderScript, the number of tasks used by a given execution is managed by the RenderScript compiler. If the global work size is a multiple of 32, the compiler will be able to separate threads to complete tasks, where every thread in a task is actively doing useful work. Any other number of items will leave threads vacant in some tasks, with tasks filling until all kernel instances are attached to a thread. The actual number of tasks and vacant threads per task will vary.

For OpenCL and OpenGL, while work-groups do not map directly to tasks, their size dictates task allocation and is arguably very important. Work-groups are split into tasks according to their size, as detailed in the following points:

  • Work-group sizes that are a multiple of 32 allow each task generated to be fully used. Every thread in the task is actively doing useful work, with multiple tasks created until all work items are attached to a thread.
  • Work-group sizes of four, eight, and 16 typically execute with multiple work-groups per task – eight, four, or two respectively. However, if there is barrier synchronisation in the kernel and the work-group size was not specified at compile-time, tasks will execute one work-group per task. Unused threads in the task will be left vacant, causing a dramatic drop in utilisation.
Note: The maximum work-group size of an execution is 512. Higher sizes are unsupported.

Scheduling

The Fine Grained Scheduler (FGS) is responsible for scheduling and descheduling work on a per-task basis from a pool of resident tasks. The size of this task pool will vary from device to device, but can typically be expected to be 48 or higher. The more tasks that are in this pool, the better any latency can be hidden.

One very important aspect of these tasks is that it is zero-cost for the USC. Tasks that need to wait for any reason are switched out and a ready-to-execute task is scheduled in its place immediately. This means that having large datasets scheduled at the same time generally allows greater efficiency without incurring scheduling overhead. This is unlike, for example, CPU threads where scheduling cost is considerable.

Data Dependencies

When a data transfer dependency is encountered in a kernel, a task will need to wait for the data to be transferred before it can continue execution. The FGS will schedule out this task at no cost and schedule in another if one is available, until the task is ready to resume.

Data dependencies are not just data reads and writes but occur when the result of a read or write is required by another instruction. Any memory fences or barriers that occur within a kernel or implicit fences caused by using the result of a read will cause a dependency.

Memory

Memory space inside the USC comprises largely of two banks of registers. These are: the Common Store, which is shared across the entire USC; and the Unified Stores, which are allocated per four ALU Pipelines.

Unified Stores

The Unified Stores are four small banks of registers in the USC, with each shared equally between four ALU Pipes. Any local/temporary variables used in a kernel or OpenCL private memory will be allocated into this space.

Each Unified Store consists of several 128-bit registers, with 1280 registers available in each. The actual number of registers available to a thread will vary in practice depending on a number of factors. Generally, using a maximum of 10 registers per thread allows the minimum recommended occupancy of 512 threads per USC.

These 128-bit registers are effectively accessed by a kernel as if they were a vector of four 32-bit registers, so a kernel can use up to 40 32-bit values to hit the recommended utilisation. If a given kernel requires more than this, either the number of resident tasks goes down or the number of threads executed per task does, so that each thread gets more register space. Imagination's compiler only reduces utilisation, as high occupancy is important to hide any latency in the threads.

For very excessive memory allocations above this threshold, utilisation may get to a point where it will not go any lower, and allocations will start spilling to main memory. This has high bandwidth and latency costs, though most applications will never hit this limit. The actual point that this happens is under software control and will vary from device to device.

Note: The compiler will try to reduce Unified Store allocations where possible, though it must balance this with the instruction count use as well. For OpenCL, the exact amount of register usage can be determined with a disassembling compiler, which Imagination can provide under NDA.

Common Store

The Common Store is shared between all ALU Pipes in a given USC and is used for any data that is coherent between threads. Any shared memory (shared in OpenGL, local in OpenCL) is stored here. Constant memory allocations (const or uniform in OpenGL, constant in OpenCL) that have known sizes at compile time are also stored here. It is also used to store object handles such as image, texture, and sampler states, or pointers to main memory.

Data in the Common Store is allocated across four banks, each representing a column of data that is 128 bits wide, as shown below. Data transfers can be served by multiple banks at once and an entire row of data can be read in a single cycle or written in four cycles. Up to 16 threads in a task can each fetch a 32-bit value in one cycle if they fetch consecutive data from the same row, shown in green below.

Figure 3: Good (16 values per cycle) in green, and worst (one value per cycle) in red, shown on Common Store access.

However, if more than one thread tries to access data from the same column as another thread, as shown in red above, it will cause a bank clash as a bank can only serve one such request per cycle. Instead, it will serialise these fetches and it will take one cycle per row to access the same amount of data, slowing down the pipeline. To avoid bank clashes, all data in the common store should be accessed across rows. Maximum throughput is therefore 512 bits/cycle: four 128-bit registers, consisting of four 32-bit values for example, from different banks each cycle.

For instance, imagine a kernel that uses a block of shared memory as a linear array of float. In the best case, if the stride between each kernel’s access was exactly 32-bit, with each thread fetching a subsequent float variable, all 16 threads would be able to fetch one value each in one cycle. The worst case scenario is that if the stride between each kernel was 512-bit, with kernels trying to access every 16th float in the array, the shader would need 16 cycles, one for each value, to complete the fetch. Formally, we could express the rules as:

  • Four banks of 128-bit registers
  • Each read can fetch an entire 128-bit register
  • Each 128-bit register can hold up to four 32-bit values
  • Each read takes one cycle
  • Registers from the same bank cannot be read from in the same cycle

Some examples are shown below:

//OpenGL compute
shared float common[SIZE];

int gid = get_global_id(0);

//Best scenario : 32-bit stride(one float)
//100% throughput (512 bit/cycle), no clashes. One 32-bit value read/thread in one cycle. 
float temp = common[gid];

//Worst scenario: 512 bit (16 float or four vec4) stride
//6.25% throughput(32 bit/cycle), 16-way clash. One 32-bit value read/thread in 16 cycles 
float temp = common[gid * 16]; 

//Normal scenario : 128-bit stride (four float or one vec4), requesting 128 bits/thread
//100% throughput (512 bit/cycle), four-way clash for 16 threads 
//throughput. Four 32-bit values read/thread in four cycles
vec4 temp = vec4(common[gid*4],common[gid*4+1], common[gid*4+2],common  [gid*4+3]);
//OpenCL
local float common[SIZE];
int gid = gl_InvocationId.x;

//Best scenario : 32-bit stride (one float)
//100% throughput (512 bit/cycle), no clashes. One 32-bit value read/thread in one cycle. 
float temp = common[gid];

//Worst scenario: 512 bit stride (16 float or four float4)
//6.25% throughput(32 bit/cycle), 16-way clash. One 32-bit value read/thread in 16 cycles 
float temp = common[gid * 16]; 

//Normal scenario : 128-bit stride (four floats or one float4), requesting 128 bits/thread
//100% throughput (512 bit/cycle), four-way clash for 16 threads because we have exceeded max
//throughput. Four 32-bit values read/thread in four cycles
float4 temp = (float4 )(common[gid*4],common[gid*4+1], common[gid*4+2],common  [gid*4+3]);

The rule to avoid these scenarios is simple. At any one time, threads should read consecutive values at the same time without any gaps. The most common way one could accidentally enter that scenario would be having blocks of data per thread, instead of interleaved data.

A good way to iterate an array that contains multiple values for each thread would be:

float x = index * stride_per_index + thread_id

Conversely, the following is very likely to cause multi-way bank clashes and waste bandwidth:

float x = thread_id * stride_per_thread + index

System Memory Access

System memory is the Random Access Memory (RAM) on the device outside of the Graphics Core, used by all system components. This is the type of memory that is referred to when discussing "memory bandwidth" in general. As many different subsystems need to access this memory, accesses have to be scheduled between them, increasing the latency of memory fetches.

OpenCL memory marked as global, OpenCL constant memory with a size not known at compile time, images, OpenGL Shader Storage Buffer Objects, and RenderScript data in Allocators are stored in System Memory.

It is important to note that the System Memory budget is limited and shared between all resources. The performance of many applications will be limited by this resource, making it a prime target for optimisation.

Rogue hardware can perform burst transfers on data transfers to and from system memory. This allows multiple bytes of data to be transferred in one go without waiting for other devices. The more bytes that can be transferred in one burst, the lower the total latency of all memory accesses.

System Memory transactions are cached. A cache line in the L1 or System Level Cache is 128 bytes and maps to a 64-byte aligned segment in system memory. Transactions larger than 128 bytes are broken down into multiple cache-line requests that are issued independently. A cache line request is serviced at either the throughput of the L1 or L2 cache, in case of a cache hit, or otherwise at the throughput of system memory.

The hardware resolves System Memory accesses in row order for a given task's threads, so accessing data in row order will increase cache locality. Consider the instance of a task containing 16 threads that all perform an 8-byte memory transfer, and the transfers resolve to accesses that all fall within the same aligned 128-byte cache line. The access latency in that case would be the sum of the time taken to service a single cache miss, plus the time to service a further 15 cache hits.

This memory can be read and written in transactions of one, two, four, or multiples of four bytes, and must be aligned to on-chip memory appropriately. One-byte transactions are byte-aligned, two-byte transactions must be 2-byte aligned, and all other transactions must be 4-byte aligned. For best performance, data should generally be accessed from the kernel in this way. The compiler will split an unaligned transfer into multiple aligned transfers, which will be slower.

If a kernel performs all of its reads and writes at the same location in a kernel, it is possible for the compiler to combine multiple reads and writes into a single burst transaction. A burst transfer among the Unified Store and Global memory can transfer a maximum of 64 bytes at once. This includes any reads or writes within the shader or kernel between the system memory and temporaries or OpenCL private memory. In order to perform these burst transfers, global memory must be aligned to four bytes, which normally includes all data types or vectors equal to or larger than 32 bits. Examples of such are int, int2, int4, float, float2, float4, double, short2, char4, but not char, char2, bool, short.

Burst transfers between System Memory and the Common Store can transfer different amounts of data if the following conditions are met. In the case of OpenCL, this depends on the size of the work-group.

  • Transfer memory between local and global memory in any direction
  • A work-group's size is greater than or equal to 17 and specified at compile time
  • The OpenCL function async_work_group_copy is used

Then, a different burst transfer to the common store can be executed, which provides a maximum of 4096 bytes across the work-group with a single instruction. In the typical cases of preloading data into shared memory, this path should always be preferred. Otherwise, burst transfers to the Common Store (shared memory) can still be used efficiently, but not as much as the above method.

USC Pair

Each USC pair contains two USCs and a Texture Processing Unit (TPU) to allow the most efficient balance between texture access requirements and compute density. The USC is the main processing element in the Rogue architecture.

Texture Processing Unit

The TPU is a specialised piece of hardware used to accelerate access to images and textures from within kernel code. It handles image reads directly with its own cache to ensure that image data transfers are as fast as possible. This hardware exists to better handle the typically huge size of images and the specialised access required by most image processing algorithms.

L1 Mixed Cache

The L1 Mixed Cache is the main cache used by the Rogue architecture and all data transfers to and from memory go through here first. If this cache cannot serve a data fetch request, the request is passed to the System Level Cache.

System Level Cache

The System Level Cache interacts directly with the System RAM and is the last chance for a data fetch to hit a cache. Data fetches that miss this cache are fetched from System RAM.