26 Figure 15: Adder Output for Mismatched Input Data .... 57 Figure 50: Current for Power Gating Implementation of 11-bit FIFO .... 58 Figure 52: Current for Power Gating with MTCMOS Imp
Trang 1SUBTHRESHOLD QUASI-DELAY-INSENSITIVE FILTER BANK DESIGNS
CHANG XIAOFEI
(B.Eng,(1 ST Hons.) NUS)
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF ENGINEERING
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2010
Trang 2I
ACKNOWLEDGEMENT
First I want to thank my supervisor, Dr Lian Yong He is always patient and helpful in my research Thanks to his trust, I can initialize and develop my ideas gradually along the way Besides the research topic itself, I have learnt how to define and solve problems This is quite valuable to me for my future research career In addition, each time when I was stopped by some difficulties, he could always enlighten me with some great ideas and approaches Without him, this topic and thesis could never be done
Second, I want to thank all my colleagues in the lab, especially Tan Jun, Zhang Jinghua and Wei Ying Thanks for their patience I know sometimes I appear annoying asking some silly questions Without them, the project won’t go that smoothly
Finally, I want to thank my wife, Lulu Thanks for her always being supportive and bearing disturbing life with me
Trang 3II
TABLE OF CONTENTS
Chapter 1 Introduction 1
1.1 Challenges and Opportunities 1
1.2 Handshake Protocols 5
1.3 Advantages and Disadvantages 7
1.4 State-of-the-art Asynchronous Systems 12
1.5 Subthreshold Design 16
1.6 Objectives 17
Chapter 2 Adders 19
2.1 Circuit Design 19
2.2 Simulation Results 24
2.3 Improvement 27
Chapter 3 FIR Filters 36
3.1 Latch-Based FIFO Implementation 38
3.2 Register-Based FIFO Implementation 44
3.3 Coefficients Implementation 47
3.4 Power Reduction Techniques 49
3.5 Simulation Results 60
Chapter 4 Filter Banks 66
4.1 ECG Signal Processing 68
4.2 Frequency Masking Techniques 69
Trang 4III 4.3 Implementation 72Chapter 5 Conclusion 79
Trang 5IV
SUMMARY
With increasing operating speed, circuit complexity and decreasing feature size of digital circuits,
it becomes difficult to design traditional synchronous circuits due to clock skew and jitter in clock distribution In addition, large power is consumed by clock tree itself Asynchronous circuits that synchronize the data communication with local handshake signals are believed to be one of candidates that is suitable for large systems
Quasi-delay-insensitive (QDI) circuit is a member of asynchronous family Its functionality is independent of the delays between different components in the circuit Therefore it has better robustness against delay variation caused by Process, Voltage and Temperature (PVT) variations than synchronous circuits As we know, this delay variation may fail synchronous design’s timing assumption Otherwise large timing margin must be built into the design to guarantee the functionality of synchronous circuits in all conditions However, this degrades the synchronous circuit’s performance greatly On the other hand, QDI circuits are able to run as fast as the physical environment allows
The effect of PVT variation is more significant in subthreshold region where delay changes dramatically along with it Operating circuits in subthreshold region is beneficial in terms of ultra-low power consumption and energy efficiency Therefore operating QDI circuits in subthreshold region, on one hand, reduces power consumption; on the other hand, suppresses the susceptibility to PVT variations
In this thesis, we study the subthreshold QDI circuit design for both combinational and sequential logic circuits We first show, by mean of examples, that subthreshold QDI full adder exhibits quite competitive Power-Delay Product (PDP) Then we move on to explore the design of a QDI
Trang 6V FIR filter Finally, complete system, i.e filter bank for Electrocardiograph (ECG) sensors in Body Sensor Networks, will be demonstrated
Trang 7VI
LIST OF FIGURES
Figure 1: Handshake Solution Design Flow 3
Figure 2: Four-Phase Dual-rail Protocol System-Level Schematic 5
Figure 3 : Four-phase Signal Transition 5
Figure 4: Muller C element Symbol and Truth Table 7
Figure 5: Implementation of Asynchronous AND Gate 8
Figure 6: Propagation Signal Generation Circuit 21
Figure 7: Carry-Out Signal Generation Circuit 22
Figure 8: Sum Signal Generation Circuit 23
Figure 9: System Level Implementation 23
Figure 10: Die Photo of Adder Chip 24
Figure 11: Layout of Adder Chip 24
Figure 12: Testing Results of 1-Bit QDI-DR Full Adder 25
Figure 13: Delays, Power and PDP under 200mV 26
Figure 14: Delays, Power and PDP under 250mV 26
Figure 15: Adder Output for Mismatched Input Data 27
Figure 16: Functional Error Due to Long Chain 28
Figure 17: Circuit Improvement to Balance Output - PUN 29
Figure 18: Circuit Improvement to Balance Output - PDN 29
Figure 19: Circuit Improvement to Reduce Competition 30
Figure 20: Output through Improved Adder for Mismatched Data 30
Figure 21: Layout of Improved Full Adder 31
Trang 8VII
Figure 22: Full Adder Functionality Demonstration 32
Figure 23: Lowest Operating Voltage for Improved Full Adder 32
Figure 24: Propagation Delays against Temperatures 33
Figure 25: Energy against Temperatures 34
Figure 26: Monte Carlo Analysis of Adder under 100mV 34
Figure 27: Monte Carlo Analysis of Adder under 150mV 35
Figure 28: Monte Carlo Analysis of Adder under 200mV 35
Figure 29: Monte Carlo Analysis of Adder under 250mV 35
Figure 30: General Filter System-Level Schematic 37
Figure 31: Asynchronous FIFO System-Level Schematic 38
Figure 32: FIFO Circuit Level Schematic 39
Figure 33: Post-Layout Simulation Result of FIFO Working under 200mV 41
Figure 34: Post-Layout Simulation Result of FIFO Working under 300mV 41
Figure 35: General Dual-Rail Filter System Level Schematic 42
Figure 36: Outputs of Latch-based Filter 43
Figure 37: Register-Based Dual-Rail Filter System Level Schematic 44
Figure 38: Modified Register-Based Dual-Rail Filter System Level Schematic 45
Figure 39: Basic Implementation of Rising-Edge-Triggered Register 50
Figure 40: Current for Basic Implementation of 1-bit FIFO 51
Figure 41: Stacking Implementation of Inverter 51
Figure 42: Stacking Implementation of Rising-Edge-Triggered Register 52
Figure 43: Current for Stacking Implementation of 1-bit FIFO 52
Trang 9VIII
Figure 44: MTCMOS Implementation of Rising-Edge-Triggered Register 53
Figure 45: Current for MTCMOS Implementation of 1-bit FIFO 54
Figure 46: Power Gating Implementation of Rising-Edge-Triggered Register 55
Figure 47: Current for Basic Implementation of 11-bit FIFO 56
Figure 48: Current for Stacking Implementation of 11-bit FIFO 56
Figure 49: Current for MTCMOS Implementation of 11-bit FIFO 57
Figure 50: Current for Power Gating Implementation of 11-bit FIFO 57
Figure 51: Power Gating and MTCMOS Implementation of Rising-Edge-Triggered Register 58
Figure 52: Current for Power Gating with MTCMOS Implementation of 11-bit FIFO 58
Figure 53: Frequency Responses of Filter Prototype 60
Figure 54: Impulse Response of FIR Filter (1) 61
Figure 55: Impulse Response of FIR Filter (2) 61
Figure 56: Current for Filter Prototype w/o Power Gating 63
Figure 57: Current for Filter Prototype with Power Gating 63
Figure 58: Delays VS Temperatures for 6th Order FIR Filler 64
Figure 59: Power VS Temperatures for 6th-Order FIR Filter 64
Figure 60: Monte Carlo Simulation of Filter at 170mV 65
Figure 61: Monte Carlo Simulation of Filter at 200mV 65
Figure 62: ECG Wave 66
Figure 63: Interpolation Implementation 69
Figure 64: Graph Demonstration of Interpolation Techniques 69
Figure 65: Concept of Complementary Filters 71
Trang 10IX
Figure 66: System Level Implementation of Filter Bank 72
Figure 67: Matlab Simulation of B1-B8 73
Figure 68: Matlab Simulation of b3 74
Figure 69: Register-Sharing Scheme 75
Figure 71: Coefficient-Sharing Scheme 77
Trang 11X
LIST OF TABLES
Table 1: Different States in Dual-rail Protocol 6
Table 2: Truth Table of Asynchronous AND Gate 8
Table 3: Comparison of PDP between Different Subthreshold Adders 25
Table 4: Monte Carlo Analysis of Improved Adder 34
Table 5: Implementation of Coefficient -2 47
Table 6: Incorrect Implementation of Coefficient ‘-2’ 48
Table 7: Comparison among Basic, Stacking and MTCMOS Implementations 54
Table 8: Comparison among Different Power Reducing Techniques 59
Table 9: Impulse Response of FIR Filter 62
Table 10: Monte Carlo Simulation of 1b Filter 65
Table 11: Sub-bands of Filter Bank 73
Table 12: Coefficients Positions for Band-Edge Shaping Filter 76
Table 13: Comparison of Power between RS and CS 77
Table 14: Power of Masking Filter 78
Trang 12XI
LIST OF ABBREVIATION
CAM: Caltech Asynchronous Microprocessor
CS: Coefficient Sharing
CSP: Communicating Sequential Processes
DCVSL: Differential Cascode Voltage Switch Logic
DI: Delay Insensitive
DPA: Differential Power Analysis
DR: Dual Rail
DTCMOS: Dynamic Threshold CMOS
ECG: Electrocardiogram
EDA: Electronic Design Automation
EMI: Emission of Electromagnetic Interference
FIFO: First-In-First-Out
FIR: Finite Impulse Response
HTCMOS: High-Threshold CMOS
IIR: Infinite Impulse Response
ITRS: International Technology Roadmap for Semiconductor
LTCMOS: Low-Threshold CMOS
MIPS: Million Instructions per Second
MTCMOS: Multiple-Threshold CMOS
PDN: Pull-Down Network
PUN: Pull-Up Network
Trang 13XII
PVT: Process, Voltage and Temperature
QDI: Quasi Delay Insensitive
RISC: Reduced Instruction Set Computer
RFID: Radio Frequency Identification
RS: Register Sharing
SDI: Scalable Delay Insensitive
SRAM: Static Random Access Memory
STG: State Transition Graph
UDVS: Ultra-Dynamic Voltage Scaling
Trang 141.1 Challenges and Opportunities
With the continuous demand of higher performance, higher complexity and smaller feature size, challenges associated with synchronous circuits become severe For example, as devices move into the 90nm range, it is becoming extraordinarily difficult to find and predict the critical-path delays And as technology works down to 45nm and below, conditions worsen Shot noise, charge sharing, thermal effects, supply-voltage noise and process variations conspire to confound delay calculations Unpredictable delay may fail the circuit at some given physical conditions Otherwise, large timing margin must be set to make the circuit ‘safe’, which in turn degrades the circuit performance
In addition to unpredictable delays, large power dissipation is another problem associated with synchronous design because all the components are active according to the rhythm of global
Trang 152
clock even if some components are not involved in current computation What is even worse, systems-on-chip with many millions of transistors involve large clock-current surges, which tax a circuit's power distribution nets as well as its thermal stability
Smart designs were proposed to temporarily deal with issues in synchronous domain For example, power is reduced using clock gating [2, 3, 4], Multiple Threshold CMOS (MTCMOS) [5, 6, 7], Dynamic Threshold CMOS (DTCMOS) [8] and substrate biasing [9] However, sometimes it is becoming difficult and even impossible to implement such smart design as the scale and complexity of the design increases For example, control logic for clock gating dissipates powers itself and it is becoming difficult to implement due to clock skew problem In addition, these smart designs come in coarse-grain manner It cannot pinpoint one device which should be turned off or assigned with high-threshold CMOS
In this circumstance, asynchronous design comes into the view The inherent communication style may solve those problems associated with synchronous design The prediction and development of this new area currently is encouraging
As evidenced by the 2007 International Technology Roadmap for Semiconductors’ (ITRS) prediction of a likely shift from synchronous to asynchronous design styles in order to increase circuit robustness, decrease power, and alleviate many clock-related issues The 2007 ITRS predicts that asynchronous circuits will account for 20% of chip area within the next 5 years, and 25% of chip area within the next 8 years [10]
As mentioned in Bernard Cole’s article [11], the development of asynchronous design is bright
First, companies active in developing asynchronous logic are shifting from selling a particular IP approach to becoming fabless IC companies Examples are Handshake Solution, Camgian
Trang 163
Microsystems and Fulcrum Microsystems Handshake Solutions designed Timeless Design Environment (TiDE) with HASTE language which is a high-level design language based on Communicating Sequential Processes (CSP) [12] Together with third-party EDA flows (Cadence, Synopsys, Magma and Mentor Graphics), it can support complete design process from entry to layout as shown in Figure 1 [13]
Figure 1: Handshake Solution Design Flow
Camgian Microsystems specializes in low-power wireless-enabled sensor network which helps monitoring remote assets and environments They provide services including battlefield situational awareness, border security, cargo tracking, industrial monitoring and physical security [14]
Fulcrum Microsystems has become a leading provider of highly-integrated, high-performance switch chips using its patented technology [15]
Trang 17Fourth, in-house EDA tools and design flows have been developed Starting from the middle of 1980s, Caltech program-transformation approach [16] and Chu’s state-transition-graph (STG) approach [17] came into view Soon after, Burns and Martin [18], Brunvand, and van Berkel [19] proposed similar methods for the syntax directed compilation of high-level description into asynchronous circuits In the meanwhile, another approach is implementing asynchronous standard cell in order to make use of current existing EDA tools University of Southern California announced the introduction of asynchronous standard library that support Cadence DFII files for automatic place and route using Silicon Ensemble [20] The asynchronous design is believed to occupy more market share once there are more efficient EDA tools
Trang 18Figure 2 shows the schematic of four-phase dual-rail protocol As mentioned before, handshake
signals comprise two components, forward-moving Request signal and backward-moving Acknowledge signal However in Figure 2, Request is not seen This is due to the characteristics
of four-phase dual-rail protocol and it will be explained soon
Figure 2: Four-Phase Dual-rail Protocol System-Level Schematic
Four phase means there are four steps for each data communication process, as indicated by the dotted arrow in Figure 3
Figure 3 : Four-phase Signal Transition
Trang 196
In Phase1, when the data is ready and Acknowledge is low, sender puts valid data on the data rails and sets Request high to inform receiver that data is ready In Phase2, once receiver detects the high Request, it starts to absorb data After which, Acknowledge will be set high In Phase3, sender responds by taking Request low, from which point data is no longer valid In Phase4, receiver drives Acknowledge low and this completes one full cycle of data communication The
circuit is now ready for a new run of handshaking and data communication
Dual-rail means each bit’s information is encrypted by two rails As shown in Table 1 below, bit
A is encrypted by A t (true rail) and A f (false rail) There are four combinations
Table 1: Different States in Dual-rail Protocol
As shown in Figure 2, there is no explicit data rail for Request since the validity of data is interpreted in dual-rail protocol Once data falls into the valid states, Request is assumed to be set Otherwise, Request is reset Simply speaking, the implicit Request is equivalent to (A t OR A f)
Trang 20However, for Delay-Insensitive (DI) circuits, their functionalities are independent of delays They can sense completion of computation, and run as quickly as the current physical environment allows [21].However, pure DI circuits can only make use of Muller C element as computational component, which makes the set of DI circuits quite small and impractical to implement actual circuits [22] The Muller C element is shown in Figure 4 When both inputs change to 1s or 0s, the output will change to 1 or 0 accordingly Otherwise, the output will remain as what it was What has been designed in this project is called Quasi-Delay-Insensitive (QDI) Circuits QDI circuits differ from DI circuits with additional timing assumption of isochronic fork In an isochronic fork, when a transition on one output is acknowledged, and thus completed, the transitions on all outputs are acknowledged and thus completed [23]
Figure 4: Muller C element Symbol and Truth Table
One simple dual-rail AND gate is used to illustrate the idea of delay insensitivity For synchronous AND gate, delays of inputs need to be balanced Otherwise, glitch is generated and
Trang 218
power is wasted For a dual-rail AND gate, delays will be taken care of automatically Table 2
shows the truth table of dual-rail asynchronous AND gate, where A and B are inputs and Y is the
The truth table can be categorized into three groups as shown in the left column
In Group1, data A and B are both in empty states – neither data arrives at the gate The output Y is
in empty states also This indicates that there is no computation if there is no data In Group2,
either data A or B is in empty states – one data arrives while the other lags behind The output Y
remains in empty states The QDI adder will not generate wrong output or glitches, but just keep
waiting until all valid data arrives In Group3, both data A and B is valid – both data arrives at the
gate The AND gate outputs valid logic values accordingly
The implementation of the asynchronous AND gate [24] is shown below,
Figure 5: Implementation of Asynchronous AND Gate
Trang 229
It is assumed the initial state is empty, that is, all data rails carry 0s in Figure 5 Only when both A and B become valid, one of the C elements will fire accordingly and the output will become valid Otherwise it will stay in empty state For Y t to be 1, both data A and B must be valid true – A t and
Bt are 1s For Y f to be 1, both A and B must be valid and at least one of A and B is valid false – one of A f and B f must be 1
1.3.2 Average-Case Performance
Asynchronous designs exhibit average-case performance instead of worst-case performance Synchronous designs cannot avoid worst-case performance since all possible computations must complete before results can be latched The clock speed must accommodate the critical path On the other hand, asynchronous design is capable of sensing computations’ completion and running
as fast as given environment allows
1.3.3 Reduced Power Consumption
Standard synchronous circuits have to toggle clock lines, and possibly charge and discharge circuit components unused in the current computation On the other hand, asynchronous circuits have transitions only in components involved in current computation Therefore asynchronous circuits are able to reduce standby power consumption and extend battery lifetime Some examples addressing asynchronous design’s low-power advantage would be that in 1994, Amulet1 [25] can achieve high standard in saving power compared to synchronous design In
1997, Intel developed an asynchronous, Pentium-compatible test chip that ran three times as fast,
on half the power, as its synchronous equivalent In 1998, Philips introduced a commercial clockless chip, the first to market, which let the company’s pagers last nearly twice as long on the same battery power [26] One of many convincing comparison between synchronous and
Trang 2310
asynchronous design approaches came out in 1998 [27] Cogency, in co-operation with LG Semicon, developed two processors – one synchronous and the other asynchronous With the same hardware features, similar organization and pipeline structure, under the same operating condition, the asynchronous processor showed a 47% reduction in current consumption running fax/modem application code In 1999, asynchronous IFIR filter bank implementations result a fivefold power reduction when processing typical data compared to a synchronous counterpart [28]
1.3.4 Handling of Metastable Conditions
Metastable signals cause troubles in synchronous systems In order to protect the system from metastability, a synchronous designer must ensure mutual exclusion of independent signals and synchronization of external signals through additional buffer circuitry Even with this, it is sometimes impossible to guarantee protection against incoming unstable signal states
Asynchronous systems naturally handle mutual exclusivity since they can wait an arbitrarily long time for such a condition to stabilize
1.3.5 Less Noise
Two types of noises are better dealt with in asynchronous circuits The first one is substrate coupling According to [29], in a synchronous design, simultaneous switching of components at the clock edge and the high frequency components of the global clock signal itself injects current with a high-frequency component.On the other hand, in asynchronous design, only the currently active components are switching along with handshake signals This lack of coherent switching results in an injected current that is lower in magnitude and high frequency composition
Trang 2411
The second one is Emission of Electromagnetic Interference (EMI) Current flowing along the loop which is formed by the conductor generates radiation in digital part In addition, these loops act as antennas to spread the radiations According to the comparison between one synchronous processor and one asynchronous processor [29], synchronous processor shows peaks due to clock harmonics while asynchronous processor mainly shows white noise
1.3.6 Security
Unprotected cryptographic hardware is vulnerable to a side-channel attack know as Differential Power Analysis (DPA) [30] Since computation is data-dependent, different data is input to the system and power is watched In this way, secrete key could be detected Circuits implemented in four-phase dual-rail protocol with balanced cells consume almost equal power for different data combinations and therefore provides security against this attack
1.3.7 Disadvantages
There are some disadvantages associated with asynchronous design First, as there is no global clock which can function to eliminate the effects of glitches, glitches due to any reason may be interpreted as a valid signal change, and cause the system to malfunction Second, control circuits including latch controllers and delay matching blocks will contribute power overhead as well as extra space in silicon Third, although institutes and universities strive to build up EDA tools, most commercially available EDA software currently is designed for synchronous systems Event-driven simulation, logic synthesis, partitioning, placement, routing and most other EDA offerings need modifications for asynchronous circuits, or simply are not applicable
Trang 2512
1.4 State-of-the-art Asynchronous Systems
Due to limitations of cost and design tools, most of the asynchronous systems existed are research-based Some popular asynchronous microcontrollers or microprocessors are introduced below [31]
1.4.1 Caltech Asynchronous Microprocessor (CAM)
CAM was the earliest known asynchronous processor, designed by Alain Martin’s group in the late 1980s CAM used a four-phase dual-rail protocol According to [32], the processor was first designed as a set of concurrent programs Each program was then compiled into a circuit by a series of program transformations Control and data path were first designed separately and then combined in a mechanical way
In a later periodical, testing results were released [33] The chip was functional on the first silicon
At room temperature, the 2µm version ran from 7V down to 0.35V And it reached 12 Million Instructions per Second (MIPS) at 7V When the chip was tested in liquid nitrogen environment, the 2µm version reached 20MIPS at 5V and 30MIPS at 12V In the meanwhile, the 1.6µm version reached 30MIPS at 5V
1.4.2 FAM
FAM was designed in early 1990s using four-phase dual-rail protocol [34] The unique part is that FAM was based on Differential Cascode Voltage Switch Logic (DCVSL) It was claimed much more advanced than CAM since it comprised one 32-bit data path and thirty-two 32-bit general registers The estimated performance was about 300MIPS In addition, average instruction cycle time for 0.5µm CMOS process technology was 3.5nS only What is more important, FAM
Trang 26In 1996, an improved version AMULET2e achieved 42MIPS based on the Dhrystone 2.1 benchmark when running at 3.3V [36] Running at peak rate, it consumed under 150mW On similar process technologies, ARM710 delivers 23MIPS at 120mW and ARM810 delivers 86MIPS at 500mW
1.4.4 Tokyo Institute of Technology Asynchronous Computer Chip (TITAC)
TITAC was a QDI general-purpose 8-bit microprocessor chip designed and implemented as a CMOS gate array in two-phase protocol [37] Three instruction sets were implemented including memory reference instructions, branch instructions and miscellaneous instructions
Four years later, an advanced version TITAC2 was developed As mentioned in [38], TITAC2 was a 32-bit asynchronous microprocessor whose architecture was based on MIPS R2000 including five-stage pipeline, on-chip cache, precise exception handling, external interruption and
Trang 2714
memory protection A significant feature of this processor was the introduction of a new delay model, called the Scalable-Delay-Insensitive (SDI) model TITAC2 was fabricated in 0.5µm CMOS process with three metal layers It could run as fast as 52.3 VAX MIPS using Dhrystone V2.1 benchmark consuming 2.11W at 3.3V and 20 °C
1.4.5 Philips 80C51 Microcontroller
This Tangram-compiled 80c51 Microcontroller was developed by Philips in 1995 and implemented in four-phase bundled-data protocol It was designed for low-power consumption [39] In addition, the first-time comparison of power consumption was carried out by visualizing photon emission Fabricated in a 0.5μm CMOS process with three layer metals, measurements showed a power benefit of a factor 4 over a recently designed synchronous implementation with the same function
1.4.6 MiniMIPS
In 1997, Caltech group designed MiniMIPS, an asynchronous version of the 32-bit MIPS R3000 microprocessor With a performance close to four times that of a clocked version in the same technology, the MiniMIPS was the fastest complete asynchronous processor ever fabricated until then MiniMIPS was implemented in four-phase dual-rail and 1-of-N protocol According to [40],
at 3.3V and 75 ºC, MiniMIPS dissipated 7W and ran at 280MIPS At 2V and 75 ºC, it dissipated
1 W and ran at 150MIPS It was claimed the performance exceeded AMULET2e’s by almost an order of magnitude
Trang 2815
1.4.7 Lutonium
Lutonium was the third generation of asynchronous microprocessor implemented by Caltech Group [41] This time, power efficiency was emphasized in this asynchronous microprocessor When supply voltage reduced from 1.8V to 0.5V, power dissipation, energy dissipation and energy efficiency decreased from 100mW to 0.17mW, from 500pJ per instruction to 43pJ per instruction and from 1800MIPS/W to 23000MIPS/W respectively Lutonium outperformed the closest competitor Dallas DS89C420 by a factor 25 with the figure of merit Et2
1.4.8 Achronix
Achronix embedded the clock signal information into data tokens with its picoPIPE technology, [78] which is one kind of encoding scheme in asynchronous design Based on this technology, their most recent FPGA has achieved 1.5GHz performance
Trang 2916
1.5 Subthreshold Design
Subthreshold circuit design refers to the design with operating voltage smaller than the threshold voltage of the device According to [42], there are three potential areas for subthreshold circuit design First, it is suitable for some energy-constrained applications that permit low performance such as micro-sensors, implants and RFIDs since optimal energy efficiency point usually resides
in the subthreshold region [43, 44] Second, some energy constrained portable devices must occasionally support high performance Ultra-Dynamic Voltage Scaling (UDVS) from strong inversion to subthreshold supports this type of burst operation [45] Third, subthreshold circuits support high-performance applications such as standby management For example, a subthreshold controller and sensors in [46] implement a closed-loop VDD scaling system to aggressively reduce SRAM leakage The filter bank to be designed in this project belongs to the first category
Subthreshold design shows increased sensitivity to PVT variations due to exponential relationship
of subthreshold drive current with Vth variation According to [47], variations in gate delay can
be as high as 300% from nominal In order to meet the timing requirement, synchronous overdesign is necessary to guarantee the functionality of the circuits However, this wastes energy efficiency to a large extent
Another challenge would be Ion/Ioff ratio Vth variation exponentially modifies Ion and Ioffcurrents, which can induce bad output logic levels Sometimes the bad output logic level cannot
be recognized by next-stage circuit, which results in functional breakdown of the gates In addition, some other factors worsened the case Complementary nature of static CMOS gates pits the on-current of PMOS against the off-current of NMOS and vice versa Topological features like stacking or parallel leakage paths further degrade gate Ion/Ioff
Trang 3017
1.6 Objectives
The objective of the thesis is to design ultra-low-power filter bank for QRS detection in implantable sensor applications The sampling speed of the whole system is 360Hz The requirement for this filter bank is to achieve as low power as possible with the given operating speed Hopefully the power should stay within sub-µW range As mentioned previously, subthreshold circuits, which are suitable for ultra-low-power applications, are sensitive to PVT variations On the other hand, QDI circuits’ functionality is independent of delay variations Therefore, the circuit topology combining asynchronous circuits and subthreshold circuits are proposed On one hand, it can achieve ultra-low power consumption On the other hand, QDI circuits are immune to PVT variations The feasibility of this topology is verified by circuits including adders [48], FIFOs and filters [49]
Asynchronous design’s low-power advantage is challenged where leakage starts to dominate This is more obvious in slow applications such as the implantable biomedical devices where leakage occupies a larger time window compared to dynamic power No matter the components are involved in current computation, they consume power through leakage Another objective of this thesis is to study individual contributions of leakage and dynamic power to the total power consumption After that, leakage control measures are proposed [50] taking advantage of empty states in four-phase dual-rail protocol
With proposed circuit topology and leakage control methods, a sub-μW filter bank [51] has been implemented for QRS detection in an implantable ECG sensor application
Trang 3118
1.7 Contents
Chapter1 introduces asynchronous design and subthreshold circuit design respectively Circuit
topology combining these two circuit designs is proposed to achieve ultra-low-power consumption as well as taking care of sensitivity to PVT variations
Chapter2 implements asynchronous subthreshold adders Robustness against PVT variations has
been tested Since input data in filter bank application is largely mismatched, the improved version is proposed
Chapter3 implements one asynchronous subthreshold FIR filter prototype Dynamic power and
leakage are analyzed Several leakage control measures are proposed
Chapter4 implements the filter bank for QRS detection in ECG sensor applications
Register-sharing scheme is compared with coefficients Register-sharing-scheme Lower-power register-Register-sharing scheme is chosen for the system implementation System-level power gating technique is proposed to further reduce the power consumption
Chapter5 makes some concluding remarks about the paper
Trang 322.1 Circuit Design
In order to avoid high stack in the PUN, the proposed adder is split into two stages and deals with
two inputs in each stage One internal signal called Propagation is introduced The proposed adder consists of three components, i.e., Propagation generator, Sum generator, and Carry-out
generator All these blocks share the same circuit topology; different inputs to the circuit produce
different intermediate and output signals The logic equations of Propagation, Sum, and out are given below,
Sum =P ∗Cin + ∗P Cin
, Sum f = ∗P Cin t t +P f ∗Cin f
Trang 33The circuit implementation is shown below P f signal generator as shown in Figure 6 is chosen to demonstrate the idea Pull-Up Network (PUN) has two paths detecting the empty states of the input data Pull-Down Network (PDN) is formed by two parallel pass-transistor logic gates
implementing A t * B t and A f * B f logic as mentioned above Pt signal generator is similar to that
of Pf, except Bt and Bf signals in Figure 6 are swop with each other to implement A t * B f and A f *
Bt
Trang 3421
Figure 6: Propagation Signal Generation Circuit
For four-phase dual-rail protocol, data arrives alternately in valid or empty states For example,
assuming both valid data A and B has arrived; PUN is off since one of A t and A f and one of B t and B f will be 1 A t or A f will turn on either M1 or M2, Bt or Bf value will pass to Pf or P t In this
case, either P t or P f is 1, which stands for the valid state of the output corresponding to the valid
states of the inputs
After that, both data A and B changes to empty states, i.e A t =A f =0=B t =B f PUN is turned on
Through the inverter, P f as well as P t are pulled down to 0s and wait for next evaluation which
will start after all valid input data arrives again
The circuit designed is QDI because no matter whether the input data paths are balanced or not,
the circuit is functional For example, A has arrived to the circuit while B, due to some delay,
hasn’t arrived yet The outputs are expected to remain at empty states at this time Let’s assume
At is 1 while Af , B t and Bf are all 0s It can be seen from Figure 6, M1 is turned on and Bt propagates to P f pulling Pf down to 0; in the meanwhile, Bt and Bf pull-up path is on and drives
Pf to zero as well The same applies to P t, which is also driven to 0 This meets the expectation and satisfies the definition of delay insensitivity Only when all valid data arrives, PUN will be
Trang 3522
off and PDN starts to evaluate Let’s assume B t finally arrives and A f and Bf are still 0s In this
case, data A and B are both valid 1 P f as the result of (A XNOR B), should output 1 As can be seen from Figure 6, A t is 1 and M1 is on Thus Bt is output to Pf driving Pf to 1 All other paths
are cut off in this situation
Carry-out can be generated by replacing several inputs in the Propagation generator as seen in Figure 7 Previous A and B in PUN are replaced by Propagation and Carry-in For PDN, when P t
is true, Cin t is passed to Coutt When P f is true, Cout t is the same as At Similarly, Cout f can be generated by replacing A t with A f As explained previously, Propagation is generated by two addends through the same circuit topology Therefore Propagation guarantees the delay insensitivity of data A and B Carry-out, which has Carry-in and Propagation as inputs, has delay insensitivity towards Carry-in and Propagation and thus has delay insensitivity towards all the
three inputs of the adder
Figure 7: Carry-Out Signal Generation Circuit
Similarly, Sum is implemented as shown in Figure 8
Trang 3623
Figure 8: Sum Signal Generation Circuit
All these blocks are combined to implement the full adder As shown in Figure 9, the left two
blocks are Propagation generators The right four blocks generate Sum and Carry-out When A and B are both valid, valid Propagation can be generated Cout and Sum can be generated when both Propagation and Cin are valid If one signal arrives later than other signals due to unbalanced path delays, Sum and Carry-out signals will remain in empty states
Figure 9: System Level Implementation
Trang 3724
2.2 Simulation Results
The adder is functional on the first silicon The die photo and layout of the whole chip for the adder are shown in Figure 11 and 12
Figure 10: Die Photo of Adder Chip
Figure 11: Layout of Adder Chip
In Figure 13, bottom wave is the input while top wave is the output of the adder (Cout t here)
Trang 3825
Figure 12: Testing Results of 1-Bit QDI-DR Full Adder
Power Delay Product (PDP) is used as the Figure of Merit (FoM) to compare the proposed adder with other state-of-the-art adders in the literature Since power and delay are dependent on input data patterns, all patterns are tested and average values of power and delays are calculated
Table 3: Comparison of PDP between Different Subthreshold Adders
Technology (µm)
Frequency (MHz)
Lowest Power (nW)
Best PDP (fJ)
The PDP of the proposed adder is also comparable with that of another recently developed adder [53], which is implemented in 65nm technology and operates at 1MHz Despite the operating speed disadvantage which may be caused by technology, the proposed adder has PDP value which is 74.7% of PDP of that adder
Trang 3926
As mentioned previously, the proposed adder is QDI; thus it should be robust against delay variation caused by PVT variations The proposed adder is simulated with temperatures ranging from 10ºC to 100°C at the step of 5°C under the voltage of 200mV and 250mV The worst input case has been simulated It is functional at all the temperatures under the two voltages In Figures
14 and 15, power is plotted to the primary Y-axis on the left while delays and PDP are plotted to the secondary Y-axis on the right
Figure 13: Delays, Power and PDP under 200mV
Figure 14: Delays, Power and PDP under 250mV
Delay is decreasing and power is increasing along with increasing temperature The delays at 10
ºC and at 100 ºC vary in a decade magnitude As argued, this large delay variation may fail timing assumption of synchronous design and asynchronous matched-delay design Otherwise, the performance is largely compromised
Trang 40Mismatched input data is shown in Figure 16 (Here the high and low levels refer to valid and empty states in four-phase dual-rail protocol They don’t refer to logic high and logic low.) The rising and falling edges of the input data divide each four-phase cycle into four regions, T1 to T4 The previous circuit topology states that the circuit will not output valid data until all valid inputs have arrived at the circuit Therefore, only in T3, the circuits will output valid data Empty states will be generated in all the other regions due to lack of valid inputs This produces an output with narrower data window compared to inputs
T1 T2 T3 T4
A
B
T5 Out
Figure 15: Adder Output for Mismatched Input Data
The output of the adder mentioned above is fed into the second adder B in Figure 17 is the output with narrow data window from previous stages (Out in Figure 13) A is the other input of the
current stage It can be seen that due to narrow data window, there is no overlap between two