PrePrint: Resilient High-Performance Processors with Spare RIBs
Resilience to defects and parametric variations is of utmost concern for future technology generations. Traditional redundancy to repair defects, however, can incur performance penalties due to multiplexing. In this work, we present a design incorporating bitsliced redundancy along the datapath. This approach allows us to tolerate defects without hurting performance, since we leave the same bit offset unused throughout the execution core. In addition, we can use this approach to enhance performance by avoiding excessively slow critical paths created by random delay variations. By adding a single bitslice, for instance, we can reduce the delay overhead of random process variations by 10% while providing fault tolerance for 15% of the execution core.
PrePrint: Evaluating the Overheads of Soft Error Protection Mechanisms in the Context of Multi-bit Errors at the Scope of a Processor Core
As circuit feature sizes shrink, multi-bit errors become more significant, while previously unprotected combinational logic becomes more vulnerable, requiring a reevaluation of the resiliency design space within a processor core. We present Svalinn, a framework that provides comprehensive analysis of multi-bit error protection overheads, to facilitate better architecture-level design choices. Supported protection techniques include hardening, parity, ECC, parity prediction, residue codes, as well as spatial and temporal redundancy. The overheads of these are characterized via synthesis and, as a case study, presented here in the context of a simple OpenRISC core. The analysis provided by Svalinn shows the difference in protection overheads per component and circuit category in terms of area, delay and energy. We show that the contribution of logic components to the area of a simple core increases from 35% to as much as 54% with comprehensive multi-bit error protection. We also observe that the overhead of protection could increase from 29% to as much as 97% when transitioning from single-bit to multi-bit protection. Svalinn analysis also suggests that storage components will continue benefiting from the use of ECC, while products requiring comprehensive coverage of logic components might use redundancy and potentially residue codes. Optimal core-level protection will require novel combinations of these.
PrePrint: Parallelism and Boosting Limits of Dim Silicon
Supply voltage scaling has stagnated in recent technology nodes, leading to so-called “dark silicon.” To increase overall CMP performance, it is necessary to improve the energy efficiency of individual tasks so more can be executed simultaneously within thermal limits. In this paper, we investigate the limit of voltage scaling together with task parallelization to maintain task completion latency while reducing energy consumption. Additionally, we examine improvements in energy efficiency and parallelism when serial portions of code can be overcome through quickly boosting the operating voltage of a core. When accounting for parallelization overheads, minimum task energy is obtained at “near threshold” supply-voltages across six commercial technology nodes and provides 4× improvement in overall CMP performance. Boosting is most effective when the task is modestly parallelizable, but not highly parallelizable.
PrePrint: Automating Stressmark Generation for Testing Processor Voltage Fluctuations
Rapid current changes (large di/dt) can lead to significant power supply voltage droops and timing errors in modern microprocessors. To test a processor’s resilience to such errors and to determine appropriate operating conditions, engineers generally create manual di/dt stressmarks that have large current variations at close to the resonance frequency of the power distribution network (PDN) to induce large voltage droops. Although this process can uncover potential timing errors and be used to determine processor design margins for voltage and frequency, it is time-consuming and may need to be repeated several times to generate appropriate stressmarks for different system conditions (e.g., different frequencies or di/dt throttling mechanisms). Furthermore, generating efficient di/dt stressmarks for multi-core processors is difficult due to their complexity and synchronization issues. In this article, we measure and analyze di/dt issues on state-of-the-art multi-core x86 systems. We present an AUtomated DI/dT stressmark generation framework, referred to as AUDIT, to generate di/dt stressmarks quickly and effectively for multi-core systems. We showcase AUDIT's capabilities to adjust to microarchitectural and architectural changes. We also present a dithering algorithm to address thread alignment issues on multi-core processors. We compare standard benchmarks, existing di/dt stressmarks, and AUDIT-generated stressmarks executing on multi-threaded, multi-core systems with complex out-of-order pipelines.
PrePrint: Improving Throughput of Many-core Processors Based on Unreliable Emerging Devices under Power Constraint
Devices based on emerging technologies such as carbon nanotubes (CNTs) are faster and consume much less power than CMOS devices for the same size device. However, current CNT devices exhibit a much higher defect rate than CMOS de-vices. To reduce the defect rate of CNT devices, a device-level redundancy technique can be adopted, but more redundancy in turn increases area, delay, and power consumption. In this article, we propose to use slightly less device-level redundancy than required for all the cores in a CNT processor to be defect-free cores for a yield target. This leads to some defective cores in a chip, but cores become smaller, faster, and consume less power, while many-core processors can tolerate some defective cores by design (i.e., core-level redundancy). We show that a CNT processor with such cores can provide up to 75% higher throughput while it is 46% smaller area than a CNT processor designed to operate all the cores with more device level redundancy for the same power consumption and yield.