QED Post-silicon Validation and Debug

During post-silicon validation and debug, manufactured integrated circuits (ICs) are tested in actual system environments to detect and fix design flaws (bugs). Traditional pre-silicon verification is inadequate; as a result, many critical bugs are detected only after ICs are manufactured (i.e., during post-silicon validation and debug). However, post-silicon validation and debug is challenging because traditional techniques are ad hoc (e.g., insertion of various Design for Debug structures based on various heuristics), and the associated costs are rising faster than design costs. These challenges are further magnified by the slowdown of silicon CMOS scaling, as ICs incorporate tremendous complexity to meet increasing demands for improvements in performance and energy efficiency. Examples include the use of multiple processor cores, co-processors, hardware accelerators, uncore components (defined as components in an SoC that are neither the processor cores nor the co-processors / accelerators; examples of uncore components include cache controllers, memory controllers, and interconnection networks), and power management units. This dissertation presents the Quick Error Detection (QED) technique to overcome post-silicon validation and debug challenges. QED is essential because long error detection latency, the time elapsed between the occurrence of an error caused by a bug and its manifestation as an observable failure, severely limits the effectiveness of existing post-silicon validation and debug approaches. Experimental results collected using several state-of-the-art commercial hardware platforms, as well as results obtained from simulations of various bug scenarios that occurred in commercial multi-core System-on-Chips (SoCs), demonstrate the effectiveness and practicality of QED: 1. QED improves error detection latencies by up to 9 orders of magnitude, from billions of clock cycles to very few clock cycles (generally fewer than 1,000 clock cycles for most bug scenarios). 2. QED enables up to 4-fold improvement in bug coverage (i.e., QED detects bugs that may be missed by traditional post-silicon validation approaches). 3. Symbolic Quick Error Detection (Symbolic QED) localizes difficult logic bugs automatically in a few hours (less than 7 hours for most bug scenarios), without requiring any additional hardware. Localizing a bug involves identifying a bug trace (defined as a sequence of inputs, e.g., instructions, that activates and detects the bug) and identifying the hardware design block where the bug is (possibly) located. This was demonstrated for an open-source multi-core SoC consisting of 500 millions transistors. In contrast, it might take days or weeks (or even months) of manual work, per bug, when traditional techniques are used. QED is effective for bugs inside processor cores, co-processors / software-programmable accelerators (which are components in an SoC that can be programmed using software to perform a specific set of functions, examples include graphic processing unit and digital signal processor), non-programmable hardware accelerators (which are components in a SoC that are designed to perform a pre-defined set of functions, but cannot be programmed using software, examples include accelerators for video or audio compression), and uncore components such as cache controllers, memory controllers, and interconnection networks. QED has been successfully used in industry during post-silicon validation and debug of a commercial multi-core SoC.

Download Now

Release Date
Pages pages
Rating 4/5 (18 users)