Power-efficient compiler technology and processor architectures

Power-efficient processors and compiler technology

In the domain of mobile, battery-powered, multi-mode, multimedia and wireless appliances, the architectural requirements are severe. The architectures need to combine the possibility of running multiple applications, and even future versions of applications, with the possibility of being integrated in multiple products, at the required performance and quality level, and within given constraints for area, cost and power consumption. This requires the architecture cores, as well as the memory hierarchy to be optimized. Furthermore, application code should be optimized to make the most efficient use of the available resources.

Coarse-grain reconfigurable array application-specific instruction-set processors (ASIPs)

To meet the stringent energy requirements of embedded devices, IMEC is developing a low-power ASIP in the scope of its Apollo program. A compiler framework dynamically reconfigurable embedded system compiler (DRESC) and an architecture template, architecture for dynamically reconfigurable embedded system (ADRES), have been developed together for the low-power ASIP, enabling efficient and effective programming and architecture exploration.

ADRES

ADRES is a power-efficient flexible architecture template that combines a very long instruction word (VLIW) DSP with a coarse-grain reconfigurable array. The VLIW DSP efficiently executes control-flow code by exploiting instruction-level parallelism. The array, containing many functional units, accelerates data-flow loops by exploiting high degrees of loop-level parallelism.

Two instances of the ADRES template have been designed to demonstrate its efficiency in two domains: multimedia and wireless communications. This efficiency is obtained through both the inherent power efficiency of ADRES and the addition of domain-specific instruction set extensions. These include, for example, 64-bit SIMD operations for the wireless domain. Layout experiments for the wireless ADRES instance featuring 16 functional units in array mode and 3 functional units in VLIW mode have shown that the instance can be implemented in 6mm2 in 90nm commercial standard cell technology, including a 32kB I$, a 128kB L1 data memory, and a 128kB array configuration memory. Furthermore, a clock speed of 370MHz was achieved.

DRESC

For a complex architecture like ADRES, an automatic design methodology and tools are essential. The DRESC retargetable C-compiler framework targets both the VLIW processor and the array, and includes the compiler, the binary utilities (assembler and linker), several simulators, and register-transfer level (RTL) generators. During 2007, the DRESC framework has undergone a major rewrite, resulting in a significantly faster compilation for a much wider range of ADRES instances.

To facilitate verification of both processor design and application mapping, an instruction-set simulator and a cycle-true simulator of the ADRES processor have been designed.

Benchmarks

To demonstrate the efficiency of the ADRES architecture and the effectiveness of the DRESC tool chain, several demanding wireless standards have been mapped onto the wireless ADRES instance, including 802.11a, 802.11n, 3GPP-LTE and DL16e. An evaluation of the results has shown that the ADRES/DRESC combination is competitive with other ASIPs when it comes to power consumption and performance. At the same time, an ADRES ASIP offers the advantages of being programmed much more easily than its competitors, while also allowing much more fine-grained customization.

Multiple loop nest locality optimization

To optimize the exploitation of power-efficient memory hierarchies, a technique to improve the spatial and temporal locality of data in the layer 1 (L1) cache memory was proposed. This technique is scalable to large examples which involve multiple data arrays accessed in multiple loop nests. The proposed technique consists of two steps: access locality optimization followed by layout locality optimization. The access locality optimization improves the temporal locality of data accessed from the L1 memory. This step passes constraints to the following layout locality step, where the data layout is explored. The proposed technique can reduce the cache miss rate by up to 15-40% compared to the start-of-the-art techniques C14381.

Strength reduction for multiplications

To avoid the use of expensive core components such as multipliers, techniques have been explored that reduce the number of multiplications with constants. Multiplications are known to be a more expensive resource compared to additions and shifts. The proposed technique converts the multiplications with constants in the source code to a series of shifts and additions. The proposed technique explores various number representation systems, conversion techniques and others. This technique weighs the cost of a multi-cycle multiply operation against a series of lower strength addition and shifts. It was also shown on larger examples like motion compensation of MPEG2 that strength reduction can improve the performance (over 15%) as well as reduce the energy consumption substantially in VLIW architectures.

Stream registers and compilation towards stream registers

Stream registers are a very good alternative for regular register files in terms of performance as well as energy consumption. These stream registers C12852 can be up to 60% more energy efficient. These registers have an asymmetric interface between the register file and the memory, when compared to the interface between the register file and the datapath. These registers are specifically useful in case of applications with large spatial and temporal locality like wireless baseband and wireless forward error correction. Also, a transformation technique was proposed to improve the locality in these stream register file. In addition, a stream register allocator was proposed in a geometrical model in which access of data in space and time is represented in one single space.

top