PowerPC : A singular architecture

There are many processor architectures but few of them are known : the common idea is that computers are x86 based and that mobile phones are ARM based. That's true but what else ? Nobody knows that some set-top-boxes or the Sony's PSP has a MIPS inside. Some people remember that the previous generation of Apple computers had PowerPC processors but ignore that they populate all new generation consoles and are used in aeronautic, defense, networking, servers, ... Even if these examples prove that the PowerPC has strong arguments for a serious use, we know that in industries and consumer markets, technologies are not always chosen because they are the best but because they are cheaper or better sold. In companies, there is also historical reasons that make engineers staying with old modernized technologies, feared by a technology jump.

Anyway, the PowerPC deserves to be known because more than qualities, it also has specific features that will be described in this article. I would like to highlight the PowerPC strengths and what makes it different. First I will show how it is modern, clean, ... and then the instruction will show what is singular, with technical facts, of course. Comparisons will be made against other RISC architectures like ARM and MIPS. Another goal is to give keys to improve programming the PowerPC while understanding how the processor works and what is done by compilers.

Basics of the architecture

Even if it was inherited from the Power architecture, it had been designed as a new and clean architecture at the beginning of the nineties with the benefit of dozens of years in processor evolution. The PowerPC is a RISC processor which all major concepts were specified and implemented from the beginning :

fixed 32-bit wide instructions to ease their decoding
load-store model, all operations are done within registers
large number of registers (32 general purpose registers and 32 floating point registers)
atomic (or exclusive) load-store instructions for use in a multicore context
big endian order with the possibility to work in little endian
64-bit architecture with the behaviour of instructions specified for this mode
no specific role for general purpose registers (r1 used as the stack pointer is an ABI choice, not an architecture one)
MMU model not defined, it is implementation specific with globally 2 models for use in embedded devices or servers

Compared to other architectures, we see that 20 years ago, they thought about important choices that are still there. However, the architecture is not frozen, it evolves and is now ruled by the Power.org committee. It is described in the PowerISA documents which the latest version (2.06) was published in January with advanced features like virtualization for the embedded model.

Let's compare some points with another called RISC architecture, ARM that is very popular, it is notably embedded in all mobile phones, chosen at the beginning for its low consumption, that it tends to keep adding new features. In 2010, there is still no 64-bit ARM CPUs and multicore is coming with the Cortex-A9 (even if there was one multicore implementation with the ARM11MP previously). To give some examples and prove the validity of the initial PowerPC choices, ARMv5 added an instruction to count the leading zeros (clz) but many other features where added with ARMv6 in 2002 : the endianness management ("setend" to swap from one mode to another one), the exclusive data access (ldrex/strex), a first SIMD implementation (VFP, enhanced with NEON on ARMv7), ASIDs for processes, a cycle counter (CCNT) that is only 32-bit, ... The L2 cache became a standard with ARMv7. For all these examples, we see that the PowerPC has been living with these features from the beginning !

So, in what the PowerPC is singular ?

We will focus on the instruction set, that is the most visible and important part for application programmers. We will also explain how these properties are handled by compilers, with examples to illustrate how these instructions can be used. Last, when possible and because it is always interesting to compare, a note will be given about other architectures.

# Static branch prediction

In programs, branches must be avoided because it is a nightmare for the pipeline. This is why processors have sometimes mechanisms to improve dynamic prediction with specific caches, etc. For the static branch prediction, the PowerPC has few basic rules :

a forward branch is assumed to be not taken
a backward branch is assumed to be taken

The goal is to improve the efficiency of loops (moreover executing instruction that are in the cache). Sometimes, these default behaviours are not wanted. This is why the PowerPC architecture introduced a bit in the branch instruction opcodes to reverse the default prediction, this bit is mentioned with the sign + put as a suffix.

It is difficult to know what is the policy of a compiler and when the reverse bit is used. This is a simple example in the main function :

    if (argc == 1){
        printf("No arguments\n");
    }else{
        printf("%d arguments\n", argc - 1);
    }

In this case with both parts being after the comparison, argc is evaluated with the prediction that it is not equal to 1. A second example shows (in the assembly code generated by the compiler) that the block is taken but the reason is maybe there is only one block :

    handle = fopen(...);
    if (handle){
        /* Some code here */
    }

Except in loops, it is difficult for a compiler to know which block of an if/then statement must be chosen in priority. To give it a hint, GCC provides a builtin called __builtin_expect that is sometimes used in better knowns macros :

#define likely(x)   __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)

Thus, the compiler can reorganize the code moving blocks or reversing the branch condition (for example "eq" becoming "ne"). It could also use the reverse bit.

On ARM, a nice feature exists to avoid branches : conditional instructions. That means instructions are suffixed with the branch condition and are executed according to the evaluation of this condition :

    cmp   r0, r2       @ the condition register is updated here
    moveq r2, #0       @ if r0 is equal to r2 then r2 is cleared
    addgt r3, r3, r0   @ if r0 is greater than r2, r0 is added to r3

# Condition Register (CR)

The condition register is in fact a set of 8 bitfields that act independently as 8 condition registers, named from cr0 to cr7 ! When doing a comparison, the instruction encodes which field it wants to update, for example :

    cmpwi cr7, r30, 0
    ble   cr7, .L4

Here cr7 is used, so the default field cr0 is preserved. In nested loops, it can be very useful.

More than that, it is possible to apply logical operations (and, or, nand, eqv, ...) between the condition register bits ! For example, the expression "do ... while ((a == 6) || (b == c))" could be written like that :

    cmpwi cr7, r3, 6
    cmpw  cr0, r4, r5
    cror  cr7[eq], cr0[eq], cr7[eq]
    bne   cr7, loop_start

Thus, it saved one conditional branch, that has always a cost due to branch prediction. With comparisons and "cror" executed early enough, the prediction is computed before the branch execution, what saves cycles again.

Here is an example given to do a switch with a minimum amount of conditional branches : http://wall.riscom.net/books/proc/ppc/cwg/code1.html#15978 There is only one conditional branch for a switch with 4 cases going to the same code.

Another point that is not surprising with the RISC philosophy ("don't do what is not necessary") : the condition register is only updated when asked. The syntax is a dot put as a suffix of the instruction name.

On MIPS, there is no condition register ! In this case, branches are done comparing directly register values, for example "bge $4, $5, label" will cause a jump to the label if the content of the register $4 is greater or equal to the one of the register $5.

# Operations on bits

This is one of the biggest strength of the PowerPC that provides many powerful instructions to work on bits. For example, "andc" does in one cycle an AND operation where the mask is a complement of a value that is a usual operation in programs :

    res = value & ~mask;

Else, operations on bits are done with few instructions of the family "rotate left and mask". "rlwinm" has this syntax :

    rlwinm rA, rB, n, mB, mE

It shifts the value of rB by n bits and creates a mask going from bit mB to mE. Then, it applies an AND operation on rB with the given mask and stores the results in rA. That allows to simply shift or rotate a value but also to keep or clear a set of bits. Another instruction is "rlwimi" which name describes what it does : rotate left word immediate then mask insert. With a similar logic than in the previous example, it inserts a group of bits from a word in a destination word ! A typical example is the management of a word containing a RGB value, each component can be updated independently.

The instruction "rlwimi" is also used by compilers to write bits in a bitfield (in 1 cycle). Even if this data type is known to cause problems of portability and must be avoided for external interfaces, it can be very useful and avoids many operations : create a RGB value only requires 3 instructions.

As usual for a RISC processor, rotate right has no specific instruction because there is no need, it is in fact a rotate left operation, that is quite logical. To rotate a value with 5 bits to the right, use :

    rlwinm r3, r4, 27, 0, 31

And of course, the mnemonic "rotrwi" exists for convenience. To do the same thing than the previous case with a syntax more explicit, just write :

    rotrwi r3, r4, 5

# Load-store with byte reversed

Even if the PowerPC can be used in little endian mode, practically this never happens. But it provides powerful instructions to load and store 16-bit and 32-bit data swapping the bytes. That avoids for example loading a value in a register and calling additional operations to reach the same result. In that case, if the compiler doesn't provide a builtin, it must be used in assembly. An up-to-date function (usable with gcc > 4.3.0) is given on a great site at http://hardwarebug.org/2008/10/25/gcc-inline-asm-annoyance/ :

static inline uint32_t load_le32(const uint32_t *p)
{
    uint32_t v;
    asm ("lwbrx %0, %y1" : "=r"(v) : "Z"(*p));
    return v;
}

As the "Z" qualifier is recent, another syntax could be :

static inline uint32_t load_le32(const uint32_t *p)
{
    uint32_t v;
    asm ("lwbrx %0, 0, %1" : "=r"(v) : "r"(p));
    return v;
}

Else, if a byte swap has to be made without a memory access, recent versions of gcc provide a builtin called __builtin_bswap32. If it is not available, it can be coded like that to swap bytes in the register r3, with 4 instructions :

    rotlwi  r0, r3, 8
    rlwimi  r0, r3, 24, 0, 7
    rlwimi  r0, r3, 24, 16, 23
    mr      r3, r0

On ARM, a swap instruction (rev) appeared late (ARMv6) and on MIPS, it is simply missing.

# Count Register (CTR)

This counter allows to loop a given number of times without using a GPR and modifying the condition register (CR). For example, for a loop that is to be executed 100 times :

    mtctr r3, 99
label:
    
    bdnz label

This kind of loop can benefit a better branch prediction. And it is easy to read and is easily usable by compilers.

Another curious but interesting property is the possibility to use CR to branch to the address it contains. This is done by the instruction "bcctr", like "bclr" uses LR. There is no PC register (also called instruction pointer) on PowerPC, it is not possible to play directly with it. That may preserve the branch prediction. But a jump to a computed address can be done with CTR that can be used in switch statement with a table containing pointers to jump at.

In this article, we've seen 5 features of the PowerPC instruction set that make it different from other architectures. Even if your code is not written in assembly, we know understand how these instructions and mechanisms can be used with a higher level language like C, with the help of the compiler. As PowerPC is modern, research about optimization was maybe less intensive compared to other architectures but in past years, work on some other great features like AltiVec and cache instructions (see http://www.freevec.org) showed that the PowerPC was maybe considered at a time without knowing what was its full potential. Hopefully, the PowerPC is still developed, with new processors from IBM (Power7, 476FP, ...) and Freescale (QorIQ family) that are new chances to spread and let know this powerful and singular architecture.