Click on the advert above to visit the company web site

Product category: Microprocessors, Microcontrollers and DSPs
News Release from: Atmel Corporation | Subject: AVR32
Edited by the Electronicstalk Editorial Team on 14 February 2006

Novel 32bit core claims benchmark
supremacy

Request your FREE weekly copy of the Electronicstalk email newsletter. News about Microprocessors, Microcontrollers and DSPs and more every issue. Click here for details.

A new 32bit embedded CPU architecture features an integrated DSP and SIMD instruction set for computationally intensive power-constrained embedded systems.

Atmel has announced a new 32bit embedded CPU architecture, with an integrated DSP and SIMD instruction set for computationally intensive, power-constrained embedded systems The company plans to use the core in new families of 32bit standard product AVR controllers targeted at wireless, battery-powered applications that include consumer infotainment, point of sales terminals, biometric scanners, voice recognition and motion detection

The AVR32 core consistently outperforms competing 32bit MCUs 32bit cores in every EEMBC benchmark for performance and code density, allowing it to execute the same functionality with fewer clock cycles, substantially reducing power consumption.

For example, comparable processors, frequently used in portable applications, require up to 266MHz clock frequencies to execute quarter-VGA MPEG4 decoding at 30 frames per second.

The AVR32 can execute the same application with a 100MHz clock (166% faster, with obvious implications for power consumption).

Atmel's AVR group has achieved the AVR32 core's exceptional computational throughput with a number of cycle-saving features that: reduce the number of load/store cycles; maximise the utilisation of computational resources; provide zero-penalty branches; and reduce the number of cache "misses".

In addition the AVR32 core is architected specifically to minimise both active power consumption and current leakage.

On average 30% of a processor's cycles are spent, not on operations, but on load/store instructions.

The AVR32 reduces the required number of load/store instructions with byte (8bit), half-word (16bit), word (32bit) and double word (64bit) load /store instructions that are combined with extensive pointer arithmetic to efficiently access tables, data structures and random data in the fewest number of cycles.

For example, block cipher algorithms used in cryptography require a special array addressing operation with a long instruction sequence that can take 14 cycles to execute on a conventional processor.

The AVR32 has a novel "load with extracted index" (ld.w) instruction that reduces this operation to just seven cycles by performing all four memory accesses in four cycles, while keeping all four offsets in one register.

Another cycle-saving load/store instruction, "load multiple register" (ldm), can be used in combination with a "store multiple" (stm) instruction to fetch two and two registers from the instruction cache.

The instruction can be used to return from the subroutine as the last register written is the program counter, completely eliminating the need to execute a return instruction at the end of the subroutine.

By reducing the number of load/store instructions to be executed, the AVR32 core increases the throughput per cycle.

Altogether the AVR32 core has 28 instructions that increase the efficiency of load/store operations.

The AVR32 CPU has a seven-stage pipeline with three subpipelines (multiplication/MAC, load/store, and ALU) that allow arithmetic operations on nondependent data to be executed, out of order and in parallel.

A conventional architecture has a single pipeline that stalls the code until each instruction is completed.

This can waste valuable computational resources during multiple-cycle instructions.

Logic in the AVR32 pipeline allows nondependent instructions to be executed simultaneously, using available pipeline resources.

Out of order execution can increase the throughput per cycle.

Hazard detection logic detects and holds dependent instructions at the beginning of the pipeline until the operation on which they depend is complete.

The AVR32 eliminates many of the cycles used to write to and read from register files by forwarding data between the pipeline stages.

Instructions that finish execution before the writeback stage are immediately forwarded to the beginning of the pipelines for the execution of instructions waiting for their results.

By minimising the number of register file accesses, this feature saves both cycles and power consumption.

All AVR32 results are forwarded as they are finished.

SIMD instructions in the AVR32 architecture can quadruple the throughput of certain DSP algorithms that require the same operation to be executed on a stream of data (eg motion estimation for MPEG decoding).

An 8bit sum of absolution differences (SAD) calculation is executed by loading four 8bit pixels from memory in a single load operation, then executing a packed subtraction of unsigned bytes with saturation, adding together the high and low pair of packed bytes and unpacking them into packed half-words.

These are then added together to get the SAD value.

Although deep pipelines enable higher clock frequencies, they introduce significant cycle penalties whenever there are jumps in the program flow.

These branch penalties are particularly harsh for small inner loops.

To address this problem, the AVR32 pipeline has branch prediction logic that can accurately predict all change-of-flow instructions.

In addition, branches are "folded" with the target instruction, resulting in a zero-cycle branch penalty.

The AVR32 instruction set evolved from extensive benchmarking and refinement using advanced compiler technology.

The result is code that is 5 to 50% more dense than competing cores, using the EEMBC benchmark suite.

Denser code allows more instructions to be stored in the processor cache, thereby reducing the number of cache misses and increasing overall processor throughput per cycle.

The majority of CPU architectures were developed before operating system (OS) use became as pervasive as it is today.

As a result, CPU cores tend to waste cycles calling the OS or external applications.

The AVR32 architecture specifically supports the use of operating systems, in particular the Linux OS, with cycle saving instructions that include an application call (Acall) to a subroutine and a system call (Scall) that calls the operating system routine.

The AVR32's advanced MMU and security modes also support advanced operating systems such as Linux.

The superior throughput of the AVR32 core reduces the number of clock cycles required and hence reduces power consumption.

In addition, the AVR32 is designed to minimise active power consumption at any clock rate by keeping data close to the CPU and minimising the unnecessary movement of data on buses that consume a lot of power.

For example, older MCUs architectures copy the return address of a subroutine call to a memory stack, consuming unnecessary power.

The AVR32 eliminates this need by including a link register in the register file.

Another power-saving feature is to keep the status register and the return address for interrupts and exceptions in system registers, rather than moving data to and from the stack.

The high code density of the AVR32 also helps reduce power consumption by reducing the number of cycles and external memory accesses wasted on cache misses.

Atmel is developing families of ultra-low-power 32bit standard product MCUs based on the AVR32 core architecture which the company expects to announce in the second quarter of 2006.

Prices for the 32bit MCUs are expected to be to be comparable to those of high-end 32bit MCUs.

The AVR32 core is also available for implementation in custom ICs.

Atmel Corporation: contact details and other news
Email this article to a colleague
Register for the free Electronicstalk email newsletter
Electronicstalk Home Page

Search the Pro-Talk network of sites