Subthreshold quasi delay insensitive circuit designs

26 Figure 15: Adder Output for Mismatched Input Data .... 57 Figure 50: Current for Power Gating Implementation of 11-bit FIFO .... 58 Figure 52: Current for Power Gating with MTCMOS Imp

Trang 1

SUBTHRESHOLD QUASI-DELAY-INSENSITIVE FILTER BANK DESIGNS

CHANG XIAOFEI

(B.Eng,(1 ST Hons.) NUS)

A THESIS SUBMITTED

FOR THE DEGREE OF MASTER OF ENGINEERING

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

NATIONAL UNIVERSITY OF SINGAPORE

2010

Trang 2

I

ACKNOWLEDGEMENT

First I want to thank my supervisor, Dr Lian Yong He is always patient and helpful in my research Thanks to his trust, I can initialize and develop my ideas gradually along the way Besides the research topic itself, I have learnt how to define and solve problems This is quite valuable to me for my future research career In addition, each time when I was stopped by some difficulties, he could always enlighten me with some great ideas and approaches Without him, this topic and thesis could never be done

Second, I want to thank all my colleagues in the lab, especially Tan Jun, Zhang Jinghua and Wei Ying Thanks for their patience I know sometimes I appear annoying asking some silly questions Without them, the project won’t go that smoothly

Finally, I want to thank my wife, Lulu Thanks for her always being supportive and bearing disturbing life with me

Trang 3

II

TABLE OF CONTENTS

Chapter 1 Introduction 1

1.1 Challenges and Opportunities 1

1.2 Handshake Protocols 5

1.3 Advantages and Disadvantages 7

1.4 State-of-the-art Asynchronous Systems 12

1.5 Subthreshold Design 16

1.6 Objectives 17

Chapter 2 Adders 19

2.1 Circuit Design 19

2.2 Simulation Results 24

2.3 Improvement 27

Chapter 3 FIR Filters 36

3.1 Latch-Based FIFO Implementation 38

3.2 Register-Based FIFO Implementation 44

3.3 Coefficients Implementation 47

3.4 Power Reduction Techniques 49

3.5 Simulation Results 60

Chapter 4 Filter Banks 66

4.1 ECG Signal Processing 68

4.2 Frequency Masking Techniques 69

Trang 4

III 4.3 Implementation 72Chapter 5 Conclusion 79

Trang 5

IV

SUMMARY

With increasing operating speed, circuit complexity and decreasing feature size of digital circuits,

it becomes difficult to design traditional synchronous circuits due to clock skew and jitter in clock distribution In addition, large power is consumed by clock tree itself Asynchronous circuits that synchronize the data communication with local handshake signals are believed to be one of candidates that is suitable for large systems

Quasi-delay-insensitive (QDI) circuit is a member of asynchronous family Its functionality is independent of the delays between different components in the circuit Therefore it has better robustness against delay variation caused by Process, Voltage and Temperature (PVT) variations than synchronous circuits As we know, this delay variation may fail synchronous design’s timing assumption Otherwise large timing margin must be built into the design to guarantee the functionality of synchronous circuits in all conditions However, this degrades the synchronous circuit’s performance greatly On the other hand, QDI circuits are able to run as fast as the physical environment allows

The effect of PVT variation is more significant in subthreshold region where delay changes dramatically along with it Operating circuits in subthreshold region is beneficial in terms of ultra-low power consumption and energy efficiency Therefore operating QDI circuits in subthreshold region, on one hand, reduces power consumption; on the other hand, suppresses the susceptibility to PVT variations

In this thesis, we study the subthreshold QDI circuit design for both combinational and sequential logic circuits We first show, by mean of examples, that subthreshold QDI full adder exhibits quite competitive Power-Delay Product (PDP) Then we move on to explore the design of a QDI

Trang 6

V FIR filter Finally, complete system, i.e filter bank for Electrocardiograph (ECG) sensors in Body Sensor Networks, will be demonstrated

Trang 7

VI

LIST OF FIGURES

Figure 1: Handshake Solution Design Flow 3

Figure 2: Four-Phase Dual-rail Protocol System-Level Schematic 5

Figure 3 : Four-phase Signal Transition 5

Figure 4: Muller C element Symbol and Truth Table 7

Figure 5: Implementation of Asynchronous AND Gate 8

Figure 6: Propagation Signal Generation Circuit 21

Figure 7: Carry-Out Signal Generation Circuit 22

Figure 8: Sum Signal Generation Circuit 23

Figure 9: System Level Implementation 23

Figure 10: Die Photo of Adder Chip 24

Figure 11: Layout of Adder Chip 24

Figure 12: Testing Results of 1-Bit QDI-DR Full Adder 25

Figure 13: Delays, Power and PDP under 200mV 26

Figure 14: Delays, Power and PDP under 250mV 26

Figure 15: Adder Output for Mismatched Input Data 27

Figure 16: Functional Error Due to Long Chain 28

Figure 17: Circuit Improvement to Balance Output - PUN 29

Figure 18: Circuit Improvement to Balance Output - PDN 29

Figure 19: Circuit Improvement to Reduce Competition 30

Figure 20: Output through Improved Adder for Mismatched Data 30

Figure 21: Layout of Improved Full Adder 31

Trang 8

VII

Figure 22: Full Adder Functionality Demonstration 32

Figure 23: Lowest Operating Voltage for Improved Full Adder 32

Figure 24: Propagation Delays against Temperatures 33

Figure 25: Energy against Temperatures 34

Figure 26: Monte Carlo Analysis of Adder under 100mV 34

Figure 30: General Filter System-Level Schematic 37

Figure 31: Asynchronous FIFO System-Level Schematic 38

Figure 32: FIFO Circuit Level Schematic 39

Figure 33: Post-Layout Simulation Result of FIFO Working under 200mV 41

Figure 34: Post-Layout Simulation Result of FIFO Working under 300mV 41

Figure 35: General Dual-Rail Filter System Level Schematic 42

Figure 36: Outputs of Latch-based Filter 43

Figure 37: Register-Based Dual-Rail Filter System Level Schematic 44

Figure 38: Modified Register-Based Dual-Rail Filter System Level Schematic 45

Figure 39: Basic Implementation of Rising-Edge-Triggered Register 50

Figure 40: Current for Basic Implementation of 1-bit FIFO 51

Figure 41: Stacking Implementation of Inverter 51

Figure 42: Stacking Implementation of Rising-Edge-Triggered Register 52

Figure 43: Current for Stacking Implementation of 1-bit FIFO 52

Trang 9

VIII

Figure 44: MTCMOS Implementation of Rising-Edge-Triggered Register 53

Figure 45: Current for MTCMOS Implementation of 1-bit FIFO 54

Figure 46: Power Gating Implementation of Rising-Edge-Triggered Register 55

Figure 47: Current for Basic Implementation of 11-bit FIFO 56

Figure 48: Current for Stacking Implementation of 11-bit FIFO 56

Figure 49: Current for MTCMOS Implementation of 11-bit FIFO 57

Figure 50: Current for Power Gating Implementation of 11-bit FIFO 57

Figure 51: Power Gating and MTCMOS Implementation of Rising-Edge-Triggered Register 58

Figure 52: Current for Power Gating with MTCMOS Implementation of 11-bit FIFO 58

Figure 53: Frequency Responses of Filter Prototype 60

Figure 54: Impulse Response of FIR Filter (1) 61

Figure 55: Impulse Response of FIR Filter (2) 61

Figure 56: Current for Filter Prototype w/o Power Gating 63

Figure 57: Current for Filter Prototype with Power Gating 63

Figure 58: Delays VS Temperatures for 6th Order FIR Filler 64

Figure 59: Power VS Temperatures for 6th-Order FIR Filter 64

Figure 60: Monte Carlo Simulation of Filter at 170mV 65

Figure 61: Monte Carlo Simulation of Filter at 200mV 65

Figure 62: ECG Wave 66

Figure 63: Interpolation Implementation 69

Figure 64: Graph Demonstration of Interpolation Techniques 69

Figure 65: Concept of Complementary Filters 71

Trang 10

IX

Figure 66: System Level Implementation of Filter Bank 72

Figure 67: Matlab Simulation of B1-B8 73

Figure 68: Matlab Simulation of b3 74

Figure 69: Register-Sharing Scheme 75

Figure 71: Coefficient-Sharing Scheme 77

Trang 11

X

LIST OF TABLES

Table 1: Different States in Dual-rail Protocol 6

Table 2: Truth Table of Asynchronous AND Gate 8

Table 3: Comparison of PDP between Different Subthreshold Adders 25

Table 4: Monte Carlo Analysis of Improved Adder 34

Table 5: Implementation of Coefficient -2 47

Table 6: Incorrect Implementation of Coefficient ‘-2’ 48

Table 7: Comparison among Basic, Stacking and MTCMOS Implementations 54

Table 8: Comparison among Different Power Reducing Techniques 59

Table 9: Impulse Response of FIR Filter 62

Table 10: Monte Carlo Simulation of 1b Filter 65

Table 11: Sub-bands of Filter Bank 73

Table 12: Coefficients Positions for Band-Edge Shaping Filter 76

Table 13: Comparison of Power between RS and CS 77

Table 14: Power of Masking Filter 78

Trang 12

XI

LIST OF ABBREVIATION

CAM: Caltech Asynchronous Microprocessor

CS: Coefficient Sharing

CSP: Communicating Sequential Processes

DCVSL: Differential Cascode Voltage Switch Logic

DI: Delay Insensitive

DPA: Differential Power Analysis

DR: Dual Rail

DTCMOS: Dynamic Threshold CMOS

ECG: Electrocardiogram

EDA: Electronic Design Automation

EMI: Emission of Electromagnetic Interference

FIFO: First-In-First-Out

FIR: Finite Impulse Response

HTCMOS: High-Threshold CMOS

IIR: Infinite Impulse Response

ITRS: International Technology Roadmap for Semiconductor

LTCMOS: Low-Threshold CMOS

MIPS: Million Instructions per Second

MTCMOS: Multiple-Threshold CMOS

PDN: Pull-Down Network

PUN: Pull-Up Network

Trang 13

XII

PVT: Process, Voltage and Temperature

QDI: Quasi Delay Insensitive

RISC: Reduced Instruction Set Computer

RFID: Radio Frequency Identification

RS: Register Sharing

SDI: Scalable Delay Insensitive

SRAM: Static Random Access Memory

STG: State Transition Graph

UDVS: Ultra-Dynamic Voltage Scaling

Trang 14

1.1 Challenges and Opportunities

With the continuous demand of higher performance, higher complexity and smaller feature size, challenges associated with synchronous circuits become severe For example, as devices move into the 90nm range, it is becoming extraordinarily difficult to find and predict the critical-path delays And as technology works down to 45nm and below, conditions worsen Shot noise, charge sharing, thermal effects, supply-voltage noise and process variations conspire to confound delay calculations Unpredictable delay may fail the circuit at some given physical conditions Otherwise, large timing margin must be set to make the circuit ‘safe’, which in turn degrades the circuit performance

In addition to unpredictable delays, large power dissipation is another problem associated with synchronous design because all the components are active according to the rhythm of global

Trang 15

2

clock even if some components are not involved in current computation What is even worse, systems-on-chip with many millions of transistors involve large clock-current surges, which tax a circuit's power distribution nets as well as its thermal stability

Smart designs were proposed to temporarily deal with issues in synchronous domain For example, power is reduced using clock gating [2, 3, 4], Multiple Threshold CMOS (MTCMOS) [5, 6, 7], Dynamic Threshold CMOS (DTCMOS) [8] and substrate biasing [9] However, sometimes it is becoming difficult and even impossible to implement such smart design as the scale and complexity of the design increases For example, control logic for clock gating dissipates powers itself and it is becoming difficult to implement due to clock skew problem In addition, these smart designs come in coarse-grain manner It cannot pinpoint one device which should be turned off or assigned with high-threshold CMOS

In this circumstance, asynchronous design comes into the view The inherent communication style may solve those problems associated with synchronous design The prediction and development of this new area currently is encouraging

As evidenced by the 2007 International Technology Roadmap for Semiconductors’ (ITRS) prediction of a likely shift from synchronous to asynchronous design styles in order to increase circuit robustness, decrease power, and alleviate many clock-related issues The 2007 ITRS predicts that asynchronous circuits will account for 20% of chip area within the next 5 years, and 25% of chip area within the next 8 years [10]

As mentioned in Bernard Cole’s article [11], the development of asynchronous design is bright

First, companies active in developing asynchronous logic are shifting from selling a particular IP approach to becoming fabless IC companies Examples are Handshake Solution, Camgian

Trang 16

3

Microsystems and Fulcrum Microsystems Handshake Solutions designed Timeless Design Environment (TiDE) with HASTE language which is a high-level design language based on Communicating Sequential Processes (CSP) [12] Together with third-party EDA flows (Cadence, Synopsys, Magma and Mentor Graphics), it can support complete design process from entry to layout as shown in Figure 1 [13]

Figure 1: Handshake Solution Design Flow

Camgian Microsystems specializes in low-power wireless-enabled sensor network which helps monitoring remote assets and environments They provide services including battlefield situational awareness, border security, cargo tracking, industrial monitoring and physical security [14]

Fulcrum Microsystems has become a leading provider of highly-integrated, high-performance switch chips using its patented technology [15]

Trang 17

Fourth, in-house EDA tools and design flows have been developed Starting from the middle of 1980s, Caltech program-transformation approach [16] and Chu’s state-transition-graph (STG) approach [17] came into view Soon after, Burns and Martin [18], Brunvand, and van Berkel [19] proposed similar methods for the syntax directed compilation of high-level description into asynchronous circuits In the meanwhile, another approach is implementing asynchronous standard cell in order to make use of current existing EDA tools University of Southern California announced the introduction of asynchronous standard library that support Cadence DFII files for automatic place and route using Silicon Ensemble [20] The asynchronous design is believed to occupy more market share once there are more efficient EDA tools

Trang 18

Figure 2 shows the schematic of four-phase dual-rail protocol As mentioned before, handshake

signals comprise two components, forward-moving Request signal and backward-moving Acknowledge signal However in Figure 2, Request is not seen This is due to the characteristics

of four-phase dual-rail protocol and it will be explained soon

Figure 2: Four-Phase Dual-rail Protocol System-Level Schematic

Four phase means there are four steps for each data communication process, as indicated by the dotted arrow in Figure 3

Figure 3 : Four-phase Signal Transition

Trang 19

6

In Phase1, when the data is ready and Acknowledge is low, sender puts valid data on the data rails and sets Request high to inform receiver that data is ready In Phase2, once receiver detects the high Request, it starts to absorb data After which, Acknowledge will be set high In Phase3, sender responds by taking Request low, from which point data is no longer valid In Phase4, receiver drives Acknowledge low and this completes one full cycle of data communication The

circuit is now ready for a new run of handshaking and data communication

Dual-rail means each bit’s information is encrypted by two rails As shown in Table 1 below, bit

A is encrypted by A t (true rail) and A f (false rail) There are four combinations

Table 1: Different States in Dual-rail Protocol

As shown in Figure 2, there is no explicit data rail for Request since the validity of data is interpreted in dual-rail protocol Once data falls into the valid states, Request is assumed to be set Otherwise, Request is reset Simply speaking, the implicit Request is equivalent to (A t OR A f)

Trang 20

However, for Delay-Insensitive (DI) circuits, their functionalities are independent of delays They can sense completion of computation, and run as quickly as the current physical environment allows [21].However, pure DI circuits can only make use of Muller C element as computational component, which makes the set of DI circuits quite small and impractical to implement actual circuits [22] The Muller C element is shown in Figure 4 When both inputs change to 1s or 0s, the output will change to 1 or 0 accordingly Otherwise, the output will remain as what it was What has been designed in this project is called Quasi-Delay-Insensitive (QDI) Circuits QDI circuits differ from DI circuits with additional timing assumption of isochronic fork In an isochronic fork, when a transition on one output is acknowledged, and thus completed, the transitions on all outputs are acknowledged and thus completed [23]

Figure 4: Muller C element Symbol and Truth Table

One simple dual-rail AND gate is used to illustrate the idea of delay insensitivity For synchronous AND gate, delays of inputs need to be balanced Otherwise, glitch is generated and

Trang 21

8

power is wasted For a dual-rail AND gate, delays will be taken care of automatically Table 2

shows the truth table of dual-rail asynchronous AND gate, where A and B are inputs and Y is the

The truth table can be categorized into three groups as shown in the left column

In Group1, data A and B are both in empty states – neither data arrives at the gate The output Y is

in empty states also This indicates that there is no computation if there is no data In Group2,

either data A or B is in empty states – one data arrives while the other lags behind The output Y

remains in empty states The QDI adder will not generate wrong output or glitches, but just keep

waiting until all valid data arrives In Group3, both data A and B is valid – both data arrives at the

gate The AND gate outputs valid logic values accordingly

The implementation of the asynchronous AND gate [24] is shown below,

Figure 5: Implementation of Asynchronous AND Gate

Trang 22

9

It is assumed the initial state is empty, that is, all data rails carry 0s in Figure 5 Only when both A and B become valid, one of the C elements will fire accordingly and the output will become valid Otherwise it will stay in empty state For Y t to be 1, both data A and B must be valid true – A t and

Bt are 1s For Y f to be 1, both A and B must be valid and at least one of A and B is valid false – one of A f and B f must be 1

1.3.2 Average-Case Performance

Asynchronous designs exhibit average-case performance instead of worst-case performance Synchronous designs cannot avoid worst-case performance since all possible computations must complete before results can be latched The clock speed must accommodate the critical path On the other hand, asynchronous design is capable of sensing computations’ completion and running

as fast as given environment allows

1.3.3 Reduced Power Consumption

Standard synchronous circuits have to toggle clock lines, and possibly charge and discharge circuit components unused in the current computation On the other hand, asynchronous circuits have transitions only in components involved in current computation Therefore asynchronous circuits are able to reduce standby power consumption and extend battery lifetime Some examples addressing asynchronous design’s low-power advantage would be that in 1994, Amulet1 [25] can achieve high standard in saving power compared to synchronous design In

1997, Intel developed an asynchronous, Pentium-compatible test chip that ran three times as fast,

on half the power, as its synchronous equivalent In 1998, Philips introduced a commercial clockless chip, the first to market, which let the company’s pagers last nearly twice as long on the same battery power [26] One of many convincing comparison between synchronous and

Trang 23

10

asynchronous design approaches came out in 1998 [27] Cogency, in co-operation with LG Semicon, developed two processors – one synchronous and the other asynchronous With the same hardware features, similar organization and pipeline structure, under the same operating condition, the asynchronous processor showed a 47% reduction in current consumption running fax/modem application code In 1999, asynchronous IFIR filter bank implementations result a fivefold power reduction when processing typical data compared to a synchronous counterpart [28]

1.3.4 Handling of Metastable Conditions

Metastable signals cause troubles in synchronous systems In order to protect the system from metastability, a synchronous designer must ensure mutual exclusion of independent signals and synchronization of external signals through additional buffer circuitry Even with this, it is sometimes impossible to guarantee protection against incoming unstable signal states

Asynchronous systems naturally handle mutual exclusivity since they can wait an arbitrarily long time for such a condition to stabilize

1.3.5 Less Noise

Two types of noises are better dealt with in asynchronous circuits The first one is substrate coupling According to [29], in a synchronous design, simultaneous switching of components at the clock edge and the high frequency components of the global clock signal itself injects current with a high-frequency component.On the other hand, in asynchronous design, only the currently active components are switching along with handshake signals This lack of coherent switching results in an injected current that is lower in magnitude and high frequency composition

Trang 24

11

The second one is Emission of Electromagnetic Interference (EMI) Current flowing along the loop which is formed by the conductor generates radiation in digital part In addition, these loops act as antennas to spread the radiations According to the comparison between one synchronous processor and one asynchronous processor [29], synchronous processor shows peaks due to clock harmonics while asynchronous processor mainly shows white noise

1.3.6 Security

Unprotected cryptographic hardware is vulnerable to a side-channel attack know as Differential Power Analysis (DPA) [30] Since computation is data-dependent, different data is input to the system and power is watched In this way, secrete key could be detected Circuits implemented in four-phase dual-rail protocol with balanced cells consume almost equal power for different data combinations and therefore provides security against this attack

1.3.7 Disadvantages

There are some disadvantages associated with asynchronous design First, as there is no global clock which can function to eliminate the effects of glitches, glitches due to any reason may be interpreted as a valid signal change, and cause the system to malfunction Second, control circuits including latch controllers and delay matching blocks will contribute power overhead as well as extra space in silicon Third, although institutes and universities strive to build up EDA tools, most commercially available EDA software currently is designed for synchronous systems Event-driven simulation, logic synthesis, partitioning, placement, routing and most other EDA offerings need modifications for asynchronous circuits, or simply are not applicable

Trang 25

12

1.4 State-of-the-art Asynchronous Systems

Due to limitations of cost and design tools, most of the asynchronous systems existed are research-based Some popular asynchronous microcontrollers or microprocessors are introduced below [31]

1.4.1 Caltech Asynchronous Microprocessor (CAM)

CAM was the earliest known asynchronous processor, designed by Alain Martin’s group in the late 1980s CAM used a four-phase dual-rail protocol According to [32], the processor was first designed as a set of concurrent programs Each program was then compiled into a circuit by a series of program transformations Control and data path were first designed separately and then combined in a mechanical way

In a later periodical, testing results were released [33] The chip was functional on the first silicon

At room temperature, the 2µm version ran from 7V down to 0.35V And it reached 12 Million Instructions per Second (MIPS) at 7V When the chip was tested in liquid nitrogen environment, the 2µm version reached 20MIPS at 5V and 30MIPS at 12V In the meanwhile, the 1.6µm version reached 30MIPS at 5V

1.4.2 FAM

FAM was designed in early 1990s using four-phase dual-rail protocol [34] The unique part is that FAM was based on Differential Cascode Voltage Switch Logic (DCVSL) It was claimed much more advanced than CAM since it comprised one 32-bit data path and thirty-two 32-bit general registers The estimated performance was about 300MIPS In addition, average instruction cycle time for 0.5µm CMOS process technology was 3.5nS only What is more important, FAM

Trang 26

In 1996, an improved version AMULET2e achieved 42MIPS based on the Dhrystone 2.1 benchmark when running at 3.3V [36] Running at peak rate, it consumed under 150mW On similar process technologies, ARM710 delivers 23MIPS at 120mW and ARM810 delivers 86MIPS at 500mW

1.4.4 Tokyo Institute of Technology Asynchronous Computer Chip (TITAC)

TITAC was a QDI general-purpose 8-bit microprocessor chip designed and implemented as a CMOS gate array in two-phase protocol [37] Three instruction sets were implemented including memory reference instructions, branch instructions and miscellaneous instructions

Four years later, an advanced version TITAC2 was developed As mentioned in [38], TITAC2 was a 32-bit asynchronous microprocessor whose architecture was based on MIPS R2000 including five-stage pipeline, on-chip cache, precise exception handling, external interruption and

Trang 27

14

memory protection A significant feature of this processor was the introduction of a new delay model, called the Scalable-Delay-Insensitive (SDI) model TITAC2 was fabricated in 0.5µm CMOS process with three metal layers It could run as fast as 52.3 VAX MIPS using Dhrystone V2.1 benchmark consuming 2.11W at 3.3V and 20 °C

1.4.5 Philips 80C51 Microcontroller

This Tangram-compiled 80c51 Microcontroller was developed by Philips in 1995 and implemented in four-phase bundled-data protocol It was designed for low-power consumption [39] In addition, the first-time comparison of power consumption was carried out by visualizing photon emission Fabricated in a 0.5μm CMOS process with three layer metals, measurements showed a power benefit of a factor 4 over a recently designed synchronous implementation with the same function

1.4.6 MiniMIPS

In 1997, Caltech group designed MiniMIPS, an asynchronous version of the 32-bit MIPS R3000 microprocessor With a performance close to four times that of a clocked version in the same technology, the MiniMIPS was the fastest complete asynchronous processor ever fabricated until then MiniMIPS was implemented in four-phase dual-rail and 1-of-N protocol According to [40],

at 3.3V and 75 ºC, MiniMIPS dissipated 7W and ran at 280MIPS At 2V and 75 ºC, it dissipated

1 W and ran at 150MIPS It was claimed the performance exceeded AMULET2e’s by almost an order of magnitude

Trang 28

15

1.4.7 Lutonium

Lutonium was the third generation of asynchronous microprocessor implemented by Caltech Group [41] This time, power efficiency was emphasized in this asynchronous microprocessor When supply voltage reduced from 1.8V to 0.5V, power dissipation, energy dissipation and energy efficiency decreased from 100mW to 0.17mW, from 500pJ per instruction to 43pJ per instruction and from 1800MIPS/W to 23000MIPS/W respectively Lutonium outperformed the closest competitor Dallas DS89C420 by a factor 25 with the figure of merit Et2

1.4.8 Achronix

Achronix embedded the clock signal information into data tokens with its picoPIPE technology, [78] which is one kind of encoding scheme in asynchronous design Based on this technology, their most recent FPGA has achieved 1.5GHz performance

Trang 29

16

1.5 Subthreshold Design

Subthreshold circuit design refers to the design with operating voltage smaller than the threshold voltage of the device According to [42], there are three potential areas for subthreshold circuit design First, it is suitable for some energy-constrained applications that permit low performance such as micro-sensors, implants and RFIDs since optimal energy efficiency point usually resides

in the subthreshold region [43, 44] Second, some energy constrained portable devices must occasionally support high performance Ultra-Dynamic Voltage Scaling (UDVS) from strong inversion to subthreshold supports this type of burst operation [45] Third, subthreshold circuits support high-performance applications such as standby management For example, a subthreshold controller and sensors in [46] implement a closed-loop VDD scaling system to aggressively reduce SRAM leakage The filter bank to be designed in this project belongs to the first category

Subthreshold design shows increased sensitivity to PVT variations due to exponential relationship

of subthreshold drive current with Vth variation According to [47], variations in gate delay can

be as high as 300% from nominal In order to meet the timing requirement, synchronous overdesign is necessary to guarantee the functionality of the circuits However, this wastes energy efficiency to a large extent

Another challenge would be Ion/Ioff ratio Vth variation exponentially modifies Ion and Ioffcurrents, which can induce bad output logic levels Sometimes the bad output logic level cannot

be recognized by next-stage circuit, which results in functional breakdown of the gates In addition, some other factors worsened the case Complementary nature of static CMOS gates pits the on-current of PMOS against the off-current of NMOS and vice versa Topological features like stacking or parallel leakage paths further degrade gate Ion/Ioff

Trang 30

17

1.6 Objectives

The objective of the thesis is to design ultra-low-power filter bank for QRS detection in implantable sensor applications The sampling speed of the whole system is 360Hz The requirement for this filter bank is to achieve as low power as possible with the given operating speed Hopefully the power should stay within sub-µW range As mentioned previously, subthreshold circuits, which are suitable for ultra-low-power applications, are sensitive to PVT variations On the other hand, QDI circuits’ functionality is independent of delay variations Therefore, the circuit topology combining asynchronous circuits and subthreshold circuits are proposed On one hand, it can achieve ultra-low power consumption On the other hand, QDI circuits are immune to PVT variations The feasibility of this topology is verified by circuits including adders [48], FIFOs and filters [49]

Asynchronous design’s low-power advantage is challenged where leakage starts to dominate This is more obvious in slow applications such as the implantable biomedical devices where leakage occupies a larger time window compared to dynamic power No matter the components are involved in current computation, they consume power through leakage Another objective of this thesis is to study individual contributions of leakage and dynamic power to the total power consumption After that, leakage control measures are proposed [50] taking advantage of empty states in four-phase dual-rail protocol

With proposed circuit topology and leakage control methods, a sub-μW filter bank [51] has been implemented for QRS detection in an implantable ECG sensor application

Trang 31

18

1.7 Contents

Chapter1 introduces asynchronous design and subthreshold circuit design respectively Circuit

topology combining these two circuit designs is proposed to achieve ultra-low-power consumption as well as taking care of sensitivity to PVT variations

Chapter2 implements asynchronous subthreshold adders Robustness against PVT variations has

been tested Since input data in filter bank application is largely mismatched, the improved version is proposed

Chapter3 implements one asynchronous subthreshold FIR filter prototype Dynamic power and

leakage are analyzed Several leakage control measures are proposed

Chapter4 implements the filter bank for QRS detection in ECG sensor applications

Register-sharing scheme is compared with coefficients Register-sharing-scheme Lower-power register-Register-sharing scheme is chosen for the system implementation System-level power gating technique is proposed to further reduce the power consumption

Chapter5 makes some concluding remarks about the paper

Trang 32

2.1 Circuit Design

In order to avoid high stack in the PUN, the proposed adder is split into two stages and deals with

two inputs in each stage One internal signal called Propagation is introduced The proposed adder consists of three components, i.e., Propagation generator, Sum generator, and Carry-out

generator All these blocks share the same circuit topology; different inputs to the circuit produce

different intermediate and output signals The logic equations of Propagation, Sum, and out are given below,

Sum =P ∗Cin + ∗P Cin

, Sum f = ∗P Cin t t +P f ∗Cin f

Trang 33

The circuit implementation is shown below P f signal generator as shown in Figure 6 is chosen to demonstrate the idea Pull-Up Network (PUN) has two paths detecting the empty states of the input data Pull-Down Network (PDN) is formed by two parallel pass-transistor logic gates

implementing A t * B t and A f * B f logic as mentioned above Pt signal generator is similar to that

of Pf, except Bt and Bf signals in Figure 6 are swop with each other to implement A t * B f and A f *

Bt

Trang 34

21

Figure 6: Propagation Signal Generation Circuit

For four-phase dual-rail protocol, data arrives alternately in valid or empty states For example,

assuming both valid data A and B has arrived; PUN is off since one of A t and A f and one of B t and B f will be 1 A t or A f will turn on either M1 or M2, Bt or Bf value will pass to Pf or P t In this

case, either P t or P f is 1, which stands for the valid state of the output corresponding to the valid

states of the inputs

After that, both data A and B changes to empty states, i.e A t =A f =0=B t =B f PUN is turned on

Through the inverter, P f as well as P t are pulled down to 0s and wait for next evaluation which

will start after all valid input data arrives again

The circuit designed is QDI because no matter whether the input data paths are balanced or not,

the circuit is functional For example, A has arrived to the circuit while B, due to some delay,

hasn’t arrived yet The outputs are expected to remain at empty states at this time Let’s assume

At is 1 while Af , B t and Bf are all 0s It can be seen from Figure 6, M1 is turned on and Bt propagates to P f pulling Pf down to 0; in the meanwhile, Bt and Bf pull-up path is on and drives

Pf to zero as well The same applies to P t, which is also driven to 0 This meets the expectation and satisfies the definition of delay insensitivity Only when all valid data arrives, PUN will be

Trang 35

22

off and PDN starts to evaluate Let’s assume B t finally arrives and A f and Bf are still 0s In this

case, data A and B are both valid 1 P f as the result of (A XNOR B), should output 1 As can be seen from Figure 6, A t is 1 and M1 is on Thus Bt is output to Pf driving Pf to 1 All other paths

are cut off in this situation

Carry-out can be generated by replacing several inputs in the Propagation generator as seen in Figure 7 Previous A and B in PUN are replaced by Propagation and Carry-in For PDN, when P t

is true, Cin t is passed to Coutt When P f is true, Cout t is the same as At Similarly, Cout f can be generated by replacing A t with A f As explained previously, Propagation is generated by two addends through the same circuit topology Therefore Propagation guarantees the delay insensitivity of data A and B Carry-out, which has Carry-in and Propagation as inputs, has delay insensitivity towards Carry-in and Propagation and thus has delay insensitivity towards all the

three inputs of the adder

Figure 7: Carry-Out Signal Generation Circuit

Similarly, Sum is implemented as shown in Figure 8

Trang 36

23

Figure 8: Sum Signal Generation Circuit

All these blocks are combined to implement the full adder As shown in Figure 9, the left two

blocks are Propagation generators The right four blocks generate Sum and Carry-out When A and B are both valid, valid Propagation can be generated Cout and Sum can be generated when both Propagation and Cin are valid If one signal arrives later than other signals due to unbalanced path delays, Sum and Carry-out signals will remain in empty states

Figure 9: System Level Implementation

Trang 37

24

2.2 Simulation Results

The adder is functional on the first silicon The die photo and layout of the whole chip for the adder are shown in Figure 11 and 12

Figure 10: Die Photo of Adder Chip

Figure 11: Layout of Adder Chip

In Figure 13, bottom wave is the input while top wave is the output of the adder (Cout t here)

Trang 38

25

Figure 12: Testing Results of 1-Bit QDI-DR Full Adder

Power Delay Product (PDP) is used as the Figure of Merit (FoM) to compare the proposed adder with other state-of-the-art adders in the literature Since power and delay are dependent on input data patterns, all patterns are tested and average values of power and delays are calculated

Table 3: Comparison of PDP between Different Subthreshold Adders

Technology (µm)

Frequency (MHz)

Lowest Power (nW)

Best PDP (fJ)

The PDP of the proposed adder is also comparable with that of another recently developed adder [53], which is implemented in 65nm technology and operates at 1MHz Despite the operating speed disadvantage which may be caused by technology, the proposed adder has PDP value which is 74.7% of PDP of that adder

Trang 39

26

As mentioned previously, the proposed adder is QDI; thus it should be robust against delay variation caused by PVT variations The proposed adder is simulated with temperatures ranging from 10ºC to 100°C at the step of 5°C under the voltage of 200mV and 250mV The worst input case has been simulated It is functional at all the temperatures under the two voltages In Figures

14 and 15, power is plotted to the primary Y-axis on the left while delays and PDP are plotted to the secondary Y-axis on the right

Figure 13: Delays, Power and PDP under 200mV

Figure 14: Delays, Power and PDP under 250mV

Delay is decreasing and power is increasing along with increasing temperature The delays at 10

ºC and at 100 ºC vary in a decade magnitude As argued, this large delay variation may fail timing assumption of synchronous design and asynchronous matched-delay design Otherwise, the performance is largely compromised

Trang 40

Mismatched input data is shown in Figure 16 (Here the high and low levels refer to valid and empty states in four-phase dual-rail protocol They don’t refer to logic high and logic low.) The rising and falling edges of the input data divide each four-phase cycle into four regions, T1 to T4 The previous circuit topology states that the circuit will not output valid data until all valid inputs have arrived at the circuit Therefore, only in T3, the circuits will output valid data Empty states will be generated in all the other regions due to lack of valid inputs This produces an output with narrower data window compared to inputs

T1 T2 T3 T4

A

B

T5 Out

Figure 15: Adder Output for Mismatched Input Data

The output of the adder mentioned above is fed into the second adder B in Figure 17 is the output with narrow data window from previous stages (Out in Figure 13) A is the other input of the

current stage It can be seen that due to narrow data window, there is no overlap between two

Định dạng
Số trang	98
Dung lượng	1,67 MB