An Analog VLSI Deep Machine Learning Implementation

Third, a 3-layer, 7-node analog deep machine learning engine is designed featuring online unsupervised trainability and non-volatile floating-gate analog storage.. Keywords: analog signa

Trang 1

University of Tennessee, Knoxville

TRACE: Tennessee Research and Creative

University of Tennessee - Knoxville, jlu9@vols.utk.edu

Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss

Part of the Electrical and Electronics Commons , and the VLSI and Circuits, Embedded and Hardware Systems Commons

Recommended Citation

Lu, Junjie, "An Analog VLSI Deep Machine Learning Implementation " PhD diss., University of Tennessee,

2014

https://trace.tennessee.edu/utk_graddiss/2709

This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee

Research and Creative Exchange It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange For more information, please contact

trace@utk.edu

Trang 2

To the Graduate Council:

I am submitting herewith a dissertation written by Junjie Lu entitled "An Analog VLSI Deep Machine Learning Implementation." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the

requirements for the degree of Doctor of Philosophy, with a major in Electrical Engineering

Jeremy Holleman, Major Professor

We have read this dissertation and recommend its acceptance:

Benjamin J Blalock, Itamar Arel, Xiaopeng Zhao

Accepted for the Council: Carolyn R Hodges Vice Provost and Dean of the Graduate School (Original signatures are on file with official student records.)

Trang 3

An Analog VLSI Deep Machine Learning Implementation

A Dissertation Presented for the

Doctor of Philosophy

Degree The University of Tennessee, Knoxville

Junjie Lu May 2014

Trang 4

ii

Acknowledgement

I would like to express my sincere gratitude to my advisor, Dr Jeremy Holleman, for his

support, guidance and encouragement His profound knowledge and rigorous attitude toward

research inspires me to grow and will benefit me in my future professional and personal life

I am also deeply grateful to Dr Benjamin J Blalock, Dr Itamar Arel and Dr Xiaopeng

Zhao for serving as my Ph.D committee member Their valuable suggestions help me to

improve my research and dissertation

I would like to thank Dr Itamar Arel and Mr Steven Young for their great help and support

in the analog machine learning project Their expertise in machine learning is essential to this

project from architecture definition to testing and data processing

I would like to thank my colleagues in ISiS lab at the University of Tennessee, Mr Tan

Yang and Mr M Shahriar Jahan, for their help and friendship

Last but also the most important, I offer my deepest gratitude and love to my parents,

Minghua Lu and Huijun Wang, and my wife, Yang Xue, for their unconditional love, support

and confidence in me

Trang 5

iii

Abstract

Machine learning systems provide automated data processing and see a wide range of

applications Direct processing of raw high-dimensional data such as images and videos by

machine learning systems is impractical both due to prohibitive power consumption and the

“curse of dimensionality,” which makes learning tasks exponentially more difficult as dimension

increases Deep machine learning (DML) mimics the hierarchical presentation of information in

the human brain to achieve robust automated feature extraction, reducing the dimension of such

data However, the computational complexity of DML systems limits large-scale

implementations in standard digital computers Custom analog signal processing (ASP) can yield

much higher energy efficiency than digital signal processing (DSP), presenting a means of

overcoming these limitations

The purpose of this work is to develop an analog implementation of DML system

First, an analog memory is proposed as an essential component of the learning systems It

uses the charge trapped on the floating gate to store analog value in a non-volatile way The

memory is compatible with standard digital CMOS process and allows random-accessible

bi-directional updates without the need for on-chip charge pump or high voltage switch

Second, architecture and circuits are developed to realize an online k-means clustering

algorithm in analog signal processing It achieves automatic recognition of underlying data

pattern and online extraction of data statistical parameters This unsupervised learning system

constitutes the computation node in the deep machine learning hierarchy

Third, a 3-layer, 7-node analog deep machine learning engine is designed featuring online

unsupervised trainability and non-volatile floating-gate analog storage It utilizes massively

parallel reconfigurable current-mode analog architecture to realize efficient computation And

Trang 6

iv

algorithm-level feedback is leveraged to provide robustness to circuit imperfections in analog

signal processing At a processing speed of 8300 input vectors per second, it achieves 1×1012

operation per second per Watt of peak energy efficiency

In addition, an ultra-low-power tunable bump circuit is presented to provide similarity

measures in analog signal processing It incorporates a novel wide-input-range tunable

pseudo-differential transconductor The circuit demonstrates tunability of bump center, width and height

with a power consumption significantly lower than previous works

Keywords: analog signal processing, deep machine learning, floating gate memory, current

mode computation, k-means clustering, power efficiency

Trang 7

v

Table of Contents

Chapter 1 Introduction 1

1.1 Introduction to Machine Learning 1

1.1.1 Machine Learning: Concepts and Applications 1

1.1.2 Three Types of Machine Learning 3

1.1.3 DeSTIN - A Deep Learning Architecture 4

1.2 Analog Deep Machine Learning Engine - the Motivation 7

1.2.1 Analog versus Digital - the Neuromorphic Arguments 8

1.2.2 Analog Advantages 10

1.2.3 Inaccuracies in Analog Computation 11

1.2.4 Analog versus Digital – Parallel Computation 12

1.3 Original Contributions 13

1.4 Dissertation Organization 14

Chapter 2 A Floating-Gate Analog Memory with Random-Accessible Bidirectional Sigmoid Updates 15 2.1 Overview of Floating Gate Device 16

2.1.1 Principles of Operation 16

2.1.2 Fowler–Nordheim Tunneling 17

2.1.3 Hot Electron Injection 17

2.2 Literature Review on Floating Gate Analog Memory 18

Trang 8

vi

2.3 Proposed Floating Gate Analog Memory 20

2.3.1 Floating-Gate Analog Memory Cell 20

2.3.2 Floating Gate Memory Array 24

2.3.3 Measurement Results 25

Chapter 3 An Analog Online Clustering Circuit in 0.13 µm CMOS 28

3.1 Introduction and Literature Review of Clustering Circuit 28

3.2 Architecture and Algorithm 29

3.3 Circuit Implementation 30

3.3.1 Floating-Gate Analog Memory 30

3.3.2 Distance Computation (D3) Block 30

3.3.3 Time-Domain Loser-Take-All (TD-LTA) Circuit 32

3.3.4 Memory Adaptation (MA) Circuit 33

3.4 Measurement Results 34

Chapter 4 Analog Deep Machine Learning Engine 37

4.1 Introduction and Literature Review 38

4.2 Architecture and Algorithm 40

4.3 Circuit Implementation 45

4.3.1 Floating-Gate Analog Memory (FGM) 45

4.3.2 Reconfigurable Analog Computation (RAC) 47

4.3.3 Distance Processing Unit (DPU) 51

Trang 9

vii

4.3.4 Training Control (TC) 55

4.3.5 Biasing and Layout Design 55

4.4 Measurement Results 57

4.4.1 Input Referred Noise 58

4.4.2 Clustering Test 59

4.4.3 Feature Extraction Test 61

4.4.4 Performance Summary and Comparison 62

Chapter 5 A nano-power tunable bump circuit 64

5.1 Introduction and Literature Review 64

5.2 Circuit Design 65

5.3 Measurement Result 67

Chapter 6 Conclusions and Future Work 72

6.1 Conclusions 72

6.2 Future Work 73

References 75

Vita 85

Trang 10

viii

List of Tables

Table I Performances Summary of the Floating Gate Memory 27

Table II Performance Summary of the Clustering Circuit 35

Table III Performances Summary and comparison of the Improved FG Memory 46

Table IV Performances Summary of the Analog Deep Learning Engine 63

Table V Comparison to Previous Works 63

Table VI Performance Summary and Comparison of the Bump Circuit 71

Trang 11

ix

List of Figures

Figure 1-1: The DeSTIN hierarchical architecture [6] 6

Figure 1-2: Bump circuit, which computes tanh(V1−V2) and its derivative simultaneously [15] 11

Figure 2-1: Cross-section of a typical FG NFET in a bulk CMOS process [28] 16

Figure 2-2: Energy band diagram of Si/SiO2 interface (a) with and (b) without applied field [31] 17

Figure 2-3: Hot electron injection in PFET 18

Figure 2-4: Schematic of the proposed floating-gate analog memory cell 20

Figure 2-5: (a) Schematic of the transconductor and (b) its transfer function 21

Figure 2-6: (a) Tunneling current versus oxide voltage Vox (b) Injection current versus drain-to-source voltage of the injection transistor 21

Figure 2-7: Simplified schematics and typical nodal voltages of memory cells (a) not selected (b) selected for tunneling 23

Figure 2-8: Block diagram of the FG analog memory array, and a table showing control signal settings for different operation modes of the cells 24

Figure 2-9: (a) Chip micrograph of the memory array together with on-chip adaptation circuitry and (b) layout view of a single memory cell 25

Figure 2-10: Analog memory programming accuracy of 30 linearly spaced values 25

Figure 2-11: Ramping of the memory value, showing the update rules 26

Trang 12

x

Figure 2-12: Crosstalk among the 31 unselected cells when a selected cell is injected or tunneled

with a magnitude of 10 nA 27

Figure 3-1: The architecture of the proposed analog online clustering circuit, with the details of the memory and distance computation cell 29

Figure 3-2: The schematic of the D3 block 31

Figure 3-3: The simplified schematic of (a) the LTA network, (b) one cell of the LTA, (c) typical timing diagrams 32

Figure 3-4: (a) The simplified schematic and (b) timing diagram of the MA circuit 33

Figure 3-5: Classification test results 34

Figure 3-6: Clustering test result 35

Figure 4-1: The architecture of the analog deep machine learning engine and possible application scenarios 37

Figure 4-2(a): The node architecture The clustering algorithm implemented by the node is illustrated in (b)-(e) 42

Figure 4-3: Timing diagram of the intra-cycle power gating 45

Figure 4-4: The schematic of the improved floating gate analog memory 45

Figure 4-5: The layout of the new FGM 46

Figure 4-6: The schematic of the reconfigurable analog computation cell and the switch positions for three operation modes 47

Figure 4-7: The measured transfer functions with the RAC configured to belief construction mode 49

Trang 13

xi

Figure 4-8: Behavioral model of the RAC with gain errors (b) System's classification error rate

as a function of each error 51

Figure 4-9: The schematic of one channel of the distance processing unit 52

Figure 4-10: Timing diagram of data sampling across the hierarchy to enable pipelined operation

53

Figure 4-11: (a) The schematic of the sample and hold and (b) simulated charge injection and

droop errors 54

Figure 4-12: The schematic and timing diagram of the starvation trace circuit 54

Figure 4-13: Biasing schemes (a) Voltage distribution (b) Current distribution (c) Proposed

hybrid biasing (d) Measured mismatch of biasing 55

Figure 4-14: Conceptual diagram showing how the RAC array is assembled from the RAC cells

56

Figure 4-15: (a) Chip micrograph and (b) custom test board 57

Figure 4-16: (a) The system model for noise measurement (b) Measured classification results

and extracted Gaussian distribution 58

Figure 4-17: The clustering test results 59

Figure 4-18: The extracted parameters plotted versus their true values 60

Figure 4-19: Clustering results with bad initial condition without and with the starvation trace

enabled 60

Figure 4-20: The feature extraction test setup 61

Figure 4-21: (a) The convergence of centroid during training (b) Output rich feature from the top

Trang 14

xii

layer 61

Figure 4-22: Measured classification accuracy using the feature extracted by the chip 62

Figure 4-23: The performance and energy breakdown in the training mode 63

Figure 5-1: Schematic of the proposed tunable bump circuit 66

Figure 5-2: Bump circuit micrograph, layout, and the test setup 67

Figure 5-3: (a) Transconductor output, (b) normalized gm (IW=0) 68

Figure 5-4: The measured bump transfer functions showing (a) variable center, (b) variable width, (c) variable height 70

Figure 5-5: The measured 2-D bump output with different width on x and y dimensions 71

Trang 15

1

This chapter introduces the background and motivation of this work It first discusses some

basic ideas of machine learning and deep machine learning systems Then the advantages of

analog signal processing are analyzed, justifying the purpose of the analog deep machine

learning implementation The structure and organization of the dissertation is given in the last

part

Learning covers a broad range of activity and process and therefore is difficult to define

precisely In general, it involves acquiring new, or modifying and reinforcing existing

knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types

of information [1] Learning is first studied as a subject of psychologists and zoologists on

humans and animals And it is arguable that many techniques in machine learning are derived

from the learning process of human or animals

Machine learning is generally concerned with a machine that automatically changes its

structure, program, or data based on its inputs or in response to external information to improve

its performance The “changes” might be either enhancements to already performing systems or

synthesis of new functions or systems

In the past, machines are programmed to perform a certain task in the first place The

reasons behind the need for a “learning machine” are manifold

Trang 16

2

First, the environments in which the machines are used are often hard to define at the time of

programming and changes over time Machine learning methods can be used for on-the-job

improvement of existing design and adaptation to a changing environment, therefore reduce the

need for constant redesign

Second, machine learning provides means of automated data analysis, which is especially

important in the face of the deluge of data in our era It is possible that hidden among large piles

of data are important relationships and correlations For example, Wal-Mart handles more than

1M transactions per hour and has databases containing more than 2.5 petabytes (2.5×1015) of

information [2] From it, a machine learning algorithm can extract purchase patterns of people

from different demographics profiles and make customized buying recommendation to them [3]

Machine learning methods used to extract these relationships are called “data mining”

Another reason is that new knowledge about tasks is constantly being discovered There is a

constant stream of new events in the world and it is impractical to continuously redesign the

systems to accommodate new knowledge However, machine learning methods are able to keep

track of these new trends Since the methods are data-driven, the learning-based algorithms are

often more accurate than the stationary algorithm when facing the ever-changing world

Apart from the above mentioned reasons, there are many more reasons why machine

learning has become a heated area of research in recent years Moreover, it is not merely a

research topic, but has penetrated to the people’s lives and become a powerful and indispensable

tool in a wide variety of applications:

• To predict if patient will respond to particular drug/therapy based on microarray profiles in bioinformatics

• To categorize text to filter out spam emails

Trang 17

3

• In banking and credit card institution to detect fraud

• Optical character recognition

• Machine vision, face detection and recognition

• Natural language processing in automatic translation

• Market segmentation

• Robot control

• Classification of stars and galaxies

• Weather or stock market price forecast

• Electric power load prediction

Machine learning is usually divided into three main types [2]

In the supervised learning approach, an output or label is given to each input in the training

data set, and the machine learning system learns the mapping from inputs to outputs The

simplest training input can be a multi-dimensional vector of numbers, representing the feature of

the data being learned In general, however, the inputcan have a complex structure, representing

an image, a sentence, a time sequence, etc The output of the system can be either a categorical

label or a real-valued variable When the output is a label, the problem is referred to as

classification or pattern recognition And when the output is a real-valued scalar, the problem is

called regression

The second type of machine learning is the unsupervised learning, where only the inputs are

given and the learning system finds underlying patterns in the data The problem of unsupervised

learning is less well-defined compared to supervised learning, because the system is free to look

Trang 18

4

for any patterns, and there is no obvious error metric to correct the current perception However,

it is arguably more typical of human and animal learning, because we got most of our knowledge

without being told what the right answers are Unsupervised learning is also more widely

applicable because it does not require a human expert to manually label the data

There is a third type of machine learning, known as reinforcement learning The machine

learns how to act or behave based on occasional external signals In some applications, the

output of the system is a sequence of actions The machine is given occasional reward or

punishment signals based on the goodness of the actions, and the goal is to learn how to act or

behave to maximize the award and minimize the punishment Reinforced learning is employed in

applications where a single move is not so important as the rule or policy of the behavior, for

example, game playing or robot navigating

1.1.3.1 The Curse of Dimensionality

A machine learning system usually processes observations in a multi-dimensional space

When the dimension of the observations is large, such as that from an image or video, a

phenomenon called “curse of dimensionality” [4] arises This phenomenon stems from the fact

that as the dimensionality increases, the volume of the space increases exponentially and as a

result, the available data become sparse This sparsity reduces the predictive power of machine

learning systems In order to obtain a statistical sound and reliable result,the amount of data and

computational power needed to support the result often grows exponentially with the

dimensionality

Trang 19

5

1.1.3.2 Deep Machine Learning

When dealing with high dimensional data such as images or videos, it is often necessary to

pre-process the data to reduce its dimensionality to what can be efficiently processed, while still

preserving the “essence” of the data Such dimensionality reduction schemes are often referred to

as feature extraction techniques

The most effective feature extraction engine we know might be our brain The human brain

can process information with an efficiency and robustness that no machine can compare with

They are exposed to a sea of sensory data every second and able to capture the critical aspects of

them in a way that allows for future use in a concise manner Therefore, mimicking the

performance of the human brain has been a core goal and challenge in machine learning

research Recent neuroscience findings have provided insight into information representation in

the human brain One of the key findings has been that the sensory signals propagate through a

complex hierarchy of modules that, over time, learn to represent observations based on the

regularities they exhibit This discovery motivated the emergence of the subfield of deep

machine learning, which focuses on computational models for information representation that

exhibit similar characteristics to that of the neocortex [5]

Trang 20

6

1.1.3.3 Deep Spatiotemporal Inference Network (DeSTIN)

The deep learning architecture adopted in this work is based on the Deep Spatiotemporal

Inference Network (DeSTIN) architecture, first introduced in [6] DeSTIN consists of multiple

instantiations of identical functional unit called cortical circuits (nodes); each node is a

parameterized models which learns by means of an unsupervised learning process These nodes

are arranged in layers and each node is assigned children nodes from the layer below and a

parent node from the layer above as shown in Figure 1-1 Nodes at the lowest layer receive raw

sensory data while nodes at all other layers receive the belief states, or outputs, from their

children nodes as input Each node attempts to capture the salient spatiotemporal regularities

contained in its input and continuously update a belief state meant to characterize the input and

the sequences thereof The beliefs formed throughout the architecture can then be used as rich

Figure 1-1: The DeSTIN hierarchical architecture [6]

Trang 21

7

features for a classier that can be trained using supervised learning Beliefs extracted from the

lower layers will characterize local features and beliefs from higher layers will characterize

global features Thus, DeSTIN can be viewed as an unsupervised feature extraction engine that

forms features from data based on regularities it observes In this framework, a common cortical

circuit populates the entire hierarchy, and each of these nodes operates independently and in

parallel to all other nodes This solution is not constrained to a layer-by-layer training procedure,

making it highly attractive for implementation on parallel processing platforms Its simplicity

and repetitive structure facilitates parallel processing platforms and straightforward training [5]

Deep layered architectures offer excellent performance attributes However, the computation

requirements involved grow dramatically as the dimensionality of the input space increases

Compositional deep layered architectures compose multiple instantiations of a common cell and

the computation is performed concurrently In CPU based platforms, however, processing is

performed sequentially, thereby greatly increasing execution time [7] Therefore, many recent

efforts research focus on implementing DML systems on GPUs While GPUs have advantages

over CPU-based realizations in computation time and cost/performance ratio, they are power

hungry, making such schemes impractical in energy-constraint environments and limiting the

scale of these systems

Custom analog circuitry presents a means of overcoming the limitation of digital VLSI

technology By fully leveraging the computational power of transistors, exploiting the inherent

tolerance to inaccuracies of the learning algorithm and performing computation in a slow but

massively parallel fashion, the proposed analog deep machine learning engine promises to

Trang 22

8

largely improve the power efficiency of digital DML systems to take full advantage the scaling

potential of DeSTIN

It is meaningful to compare the energy efficiency between the brain and the digital

computer The human brain is estimated to perform roughly 1015 synapse operations at about

10 impulse/sec The total energy consumption of our brain is about 25 watts [8] This yields an

energy consumption of about 1015 operation per joule Today’s super computer can perform 8.2

billion megaflops with a power consumption of 9.9 million watts, enough to power 10000 houses

[9] Its energy efficiency is thus about 8.3×109 operation per joule, more than 6 orders of

magnitude lower than the human brain

The great discrepancy in energy efficiencies of neurobiology and electronics suggests that

there are fundamental differences in the ways they do computation One significant difference is

the state variables they use Digital computers employ only two state variables while ignoring all

the values in the middle to achieve noise immunity at the expense of dynamic range The

neurons, on the other hand, present and process information in analog domain: the firing rates are

continuous variables; and each neuron resembles a lossy integrator with the leakage controlled

by fluctuating number of ion channels Analog signaling allows a single wire to carry multi-bit

information, therefore largely increasing power and area efficiency It also interfaces naturally

with the analog computation primitives, as discussed below

The other important trait leading to enormous efficiency of neurological systems is their

clever exploitation of the physics they are built with The nervous system does basic aggregation

of information using the conservation of charge Kirchhoff’s current law implements current

summing, and this current is integrated with respect to the time by the node capacitance In the

Trang 23

9

neuron tissue, ions are in thermal equilibrium and their energies are Boltzmann distributed If an

energy barrier exists and is modulated with the applied voltage, the current through the barrier

will be an exponential function of that applied voltage [10] This principle is used to create active

devices and compute complex nonlinear functions in neural computation The principle of

operation of the transistors in the integrated circuit can be surprisingly similar to that of the

nervous operation: in weak inversion, the energy barrier for the carrier to travel from source to

drain is modulated by the gate voltage; therefore the drain current is exponentially dependent on

the gate voltage However, digital computers completely disregard these inherent computation

primitives in the device physics, and only use two extremes of the operation points: on and off

states, therefore represent the information with 0 or 1 only This also confines us to a set of very

limited elementary operations: NOT, NOR, OR or their equivalences This is in contrast with

how the neuron does computation and can cause a factor of 104 efficiency penalty [10] Analog

circuits provide a means to reclaim this efficiency loss: by exploiting the computational

primitives inherent in the device and physics like our brains do, operations can be naturally

carried out with much higher efficiency

The scaling of CMOS technology reduces the power of digital systems However, this

scaling trend is slowing down and seeing its end due to physical limitations such as the thickness

of gate oxide [11] In addition, the power in digital system does not scale as fast as the feature

size due to the saturation of threshold and supply voltage scaling in order to keep down

subthreshold leakage [12] On the other hand, analog system can also benefit from the

technology scaling The improved subthreshold slop in FinFET improves the transconductance

efficiency in weak inversion, and improves the computation efficiency [13] And the reduced

wiring parasitic capacitance improves computation throughput

Trang 24

10

Analog signal processing makes use of the physics of the devices: physical relations of

transistors, capacitors, resistors, Kirchhoff’s current and voltage laws and so on It also

represents the information with multi-bit encoding Therefore it can be far more efficient than

digital signal processing For example, addition of two numbers takes only one wire in analog

circuit by using Kirchhoff’s current law, whereas it takes about 240 transistors in static CMOS

digital circuits to implement an 8-bit adder Similarly, an 8-bit multiplication in the analog

domain using current-mode operation takes 4 to 8 transistors, whereas a parallel 8-bit digital

multiplier has approximately 3000 transistors [14] Another example is the bump circuit, shown

in Figure 1-2, which computes the derivative of tanh(·) The bump circuit simultaneously

provides a measure of similarity between two inputs and the tanh(·) of their difference The

bump function can also be used as a probability distribution, as it peaks with zero difference and

saturates to zero for large differences The bump circuit illustrates the power advantage that

analog computation holds over digital methods A bump circuit, biased at 200 fA, can evaluate

the similarity between a stored value at about 200 observations per second, according to

simulations using 0.24 µm transistors A single inverter consumes about four times that much

when switching at 200 Hz, and about half that much statically, without switching at all To

perform a comparable computation digitally would require dozens more transistors and one to

two orders of magnitude more current [15]

Trang 25

11

However, the analog computation does have some disadvantages when compared to digital,

and they are caused by the very same reasons that make it more efficient The analog systems are

much more sensitive to noise and offset than digital systems While the digital systems use

restoring logic at every computational step to obtain good noise immunity, the use of continuous

signal variables prevents analog systems from having any restoring mechanism Thus, the noise

accumulation in analog systems becomes severe as the system scales up It is found in [14] that

the cost for precision for analog system increases faster than digital: the power consumption is a

polynomial function of the required signal to noise ratio (SNR) in analog system; in digital

system, however, it is a logarithm function Therefore, analog computation is cheaper at low

values of accuracy but more expensive at high accuracy

However, in certain cases the feedback inherent to the learning algorithms naturally

compensates for inaccuracies introduced by the analog circuits Similarly, this lack of accuracy

in analog signal processing can also be found in neural computers The brain is known to be built

from noisy, inaccurate neurons For example, many behavioral responses, such as a fly making a

course correction after a disturbance, occur over a period of around 30 ms [16] Neural signals

Figure 1-2: Bump circuit, which computes tanh(V1−V2) and its derivative simultaneously [15]

Trang 26

12

integrated over comparable time windows typically exhibit a signal-noise ratio (SNR) in the

range of 1-10 [16], [17], [18], much lower than what can be easily achieved in moderate

precision analog electronics Comparisons between the noise and power tradeoffs in analog and

digital circuits and biological systems have also been explored in [14] The low SNR and

outstanding power efficiency of neural systems suggests that relaxed accuracy requirements for

electronic computational primitives could allow aggressive optimization for area and power

consumption

The power efficiency of a computational system can be expressed by its delay-power

product The delay-power product of a single stage can be approximated to be

/

where V DD is the supply voltage, C P is the equivalent parasitic capacitances associated with the

internal nodes, I D is the current consumption, and the g m is the equivalent trans-conductance of

the transistors The V DD and C P can be scaled down with the technology, and I D / g m is minimized

when the transistors are biased in weak inversion Therefore (1.1) indicates that an efficient

computational system can be built by slow but massively parallel computational elements biased

in weak inversion (or sub-threshold)

Subthreshold digital designs are difficult in that the high susceptibility to process variability

in subthreshold region causes timing errors [19] For high performance applications

low-threshold devices must be used and leakage becomes a significant problem [20] In a massively

paralleled system, the subthreshold leakage can consume a large portion of total power without

any contribution to the computation throughput

Trang 27

13

The error in analog systems behaves more benignly than that in digital systems: an error in

digital causes the complete loss of information (unless error correction is implemented), while

the errors in analog has much smaller magnitude and cause graceful degradation of performance

and if static, can be compensated by the feedback inherent to the learning algorithms Moreover,

the leakage is no longer a problem: the subthreshold channel current in analog circuit is used to

carry information and perform operation, instead of being deemed as wasted in digital computer

In this work, an analog signal processing system implementing DeSTIN, a state-of-art deep

machine learning algorithm is proposed The original contribution of this work is summarized

below:

• Characterized a floating gate device in 0.13 µm standard digital CMOS process

• Designed and tested a novel floating gate analog memory with random-accessible bidirectional sigmoid updates in 0.13 µm standard digital CMOS

• Proposed novel architecture and circuits to realize an analog online k-means clustering circuit with non-volatile storage, first reported in the literature

• Designed an analog deep machine learning engine to implement DeSTIN, first reported in the literature Proposed techniques to greatly increase power and area

Trang 28

14

The remaining chapters of this proposal will cover the design of components, circuits and

architectures to implement the analog deep machine learning engine, in a bottom-up way

Chapter 2 provides the implementation of the analog non-volatile memory, which is an

essential component in the learning system

Chapter 3 describes the design of an analog k-means clustering circuit, the key building

block in the deep machine learning engine

Chapter 4 presents the proposed analog deep machine learning engine, including its

architecture and circuit designs And the techniques to greatly improve the energy and area

efficiency are presented

Chapter 5 develops the ultra-low-power tunable bump circuit, which can have wide

application in analog signal processing systems

Chapter 6 concludes the dissertation and proposes potential future works

Trang 29

15

Random-Accessible Bidirectional Sigmoid Updates

Memory is an essential component in a computation system Modern digital memory can

afford very high read/write speed and density [21] However, most digital memories are volatile:

DRAM requires constant refreshing, and SRAM requires a minimum V DDfor state retention

This volatility precludes their use in intermittently powered devices such as those utilizing

harvested energy

Non-volatile digital memories such as flash memory [22] require special process FRAM

(Ferroelectric RAM) is reported to be embeddable using two additional mask steps during

conventional CMOS process [23], and has been proven to be commercially viable [24] Recent

researches in this area have proposed other types of memory such as ReRAM (Resistive RAM)

[25], and MRAM (Magnetoresistive RAM) [26] However, these technologies are still new and

not commercially available, and all require special processing

Another major challenge using digital memory in analog signal processing systems is that

A/D/A conversion is needed to interface the memories to other circuits This is especially

problematic in distributed-memory architectures, where the A/D/A cannot be shared among the

memory cells, and this leads to prohibitive area and power overhead

In this work, I propose a floating-gate current-output analog memory which interfaces

naturally with the current-mode analog computation system, and allows random-accessible

control of bidirectional updates, described in [27] The update scheme avoids the use of charge

pump, minimizes interconnection and pin count, and is compatible with standard digital process

The update rule is sigmoid-shaped, which is a smooth, monotonic and bounded function

Implemented in a commercially available 0.13µm single-poly digital CMOS process using

Trang 30

thick-16

oxide IO FETs, the memory cell achieves small area and low power consumption, and is suitable

for integration into systems that exploit the high-density digital logic available in modern CMOS

technology

2.1.1 Principles of Operation

Floating gate (FG) device utilizes the charge trapped on the isolated gate to store analog or

digital values in a non-volatile way The cross-section of a typical FG NFET in a bulk CMOS

process is shown in Figure 2-1 [28]; note that a double-poly process is used to obtain the control

gate The earliest research on this device can be dated back to the 1960's [29], and the modern

EEPROM and Flash memory are both based on FG devices Due to the excellent insulation from

the thermally-grown SiO2 surrounding the floating gate, the electron trapped on the gate can

have a retention time of more than 10 years [30] And the memory can be programmed by two

mechanisms: Fowler–Nordheim tunneling and hot-electron injection

Figure 2-1: Cross-section of a typical FG NFET in a bulk CMOS process [28]

Trang 31

17

Fowler–Nordheim (FN) tunneling is used to remove electrons from the floating gate The

potential difference applied across the poly-SiO2-Si structure reduces the effective thickness of

the gate-oxide barrier, facilitating electron tunneling from the floating gate, through the SiO2

barrier, into the oxide conduction band This is illustrated by Figure 2-2, showing the energy

band diagrams of the Si/SiO2 interface with and without applied field [31] At sufficiently high

field, the width of the barrier becomes small enough for electrons to tunnel through the silicon

conduction band into the oxide conduction band This phenomenon is first described by Fowler

and Nordheim in electrons tunneling through the vacuum barrier, and the FN-tunneling is found

in SiO2 in 1969 [32]

2.1.3 Hot Electron Injection

Hot electron injection is used to add electrons to the floating gate In a PFET, the carrier

holes are accelerated by the lateral field applied between its drain and source Near the drain

Figure 2-2: Energy band diagram of Si/SiO2 interface (a) with and (b) without applied field [31]

Trang 32

18

terminal, the kinetic energy of the holes is large enough to liberate electron-hole pairs (ehp)

when they collide with the silicon lattice Electrons scattered upwards with energy larger than

3.2 eV will be able to go over the Si–SiO2 work-function barrier into the oxide conduction band

These electrons are then swept over to the floating gate by the oxide electric field This process is

illustrated in Figure 2-3

Although FG devices are usually associated with digital memories such as EEPROM or

Flash memory to store binary values, they are intrinsically an analog device, because the charge

on the FG can be modified in a continuous way A floating-gate analog memory uses the charge

trapped on the isolated gate to store analog variables in a non-volatile way It has been widely

used in analog reconfigurable, adaptive and neuromorphic systems, such as electronic

potentiometer [33], precision voltage reference [34], offset-trimmed opamp [30], pattern

classifier [35], silicon learning networks [36], and adaptive filter [37]

Without direct electrical connections, the stored value of the memory is updated by

depositing electrons to the floating gate by hot-electron injection, or removing them by Fowler–

Nordheim tunneling Compared to injection, tunneling selectivity is harder to obtain because it

Figure 2-3: Hot electron injection in PFET

Trang 33

19

often involves controlling a high voltage (HV) on chip Therefore, many previous works [35],

[36] use tunneling as the global erase, and injection to program individual memory to its target

value However, in an online adaptive system as this work, a bidirectional update is preferable

because the stored values need to vary with the inputs Previous works have proposed

approaches to achieve selective tunneling In [38], the selected memory is tunneled by pulling up

the tunneling voltage and pulling down the control gate voltage simultaneously This approach

requires a number of tunneling control pins equal to the number of rows in the memory array,

which is not desirable for large-scale systems In [33], a HV switch is built with

lightly-doped-drain nFETs This device is not compatible with standard digital processes and consumes static

power because it cannot be completely turned off In [37], a charge pump is used to generate a

local HV for the selected memory A simple charge pump provides limited voltage boost, while a

more complex one consumes larger area and/or requires multi-phase clocks

Another important performance metric of analog memory is the update rule The dynamic of

the single-transistor FG memory [38] leads to exponential and value-dependent update, which, in

general, affects the stability of the adaptation [37] A linear update can be obtained by fixing the

FG node voltage during update with a capacitive feedback loop around a differential [33] or

single-ended amplifier [37]

Trang 34

20

2.3.1.1 Circuit Description

The schematic of the proposed FG analog memory cell is shown in Figure 2-4 The gate of

MP1-MP3 and the top plate of Cf form the FG The stored charge can be modified by the

injection transistor MP2 and the tunneling transistor MP3 The two MUXs at the sources of MP1

and MP2 control the tunneling and injection of the FG, which will be discussed later The

transconductor gm converts voltage V out to output current I out Vref determines the nominal

voltage of V out during operation

The negative feedback loop comprising the inverting amplifier MP1/MN1 and Cf keeps the

FG voltage V fg constant, ensuring a linear update of V out Tunneling or injection to the FG node

changes the charge stored in Cf, therefore changes the outputof the amplifier by ∆V out = ∆Q / C f

The transconductor is implemented with a differential pair MN2/MN3 and a cascode current

mirror MP4-MP7, depicted in Figure 2-5(a) Biased in deep sub-threshold region, the

Figure 2-4: Schematic of the proposed floating-gate analog memory cell

0 1 Inject

Gnd Vdd

Vinj

Tunnel

Vdd Vddt Vtun

MP1 MP2

Trang 35

21

transconductor exhibits a transfer curve resembling a tanh function, plotted in Figure 2-5(b) The

mild nonlinearity is smooth, monotonic and bounded From Figure 2-5(b), a ∆V in of 0.2 V is

enough to cause a change of I out from 0.1I B to 0.9I B, this reduced swing requirement further

improves the update linearity, and enables the selective tunneling

2.3.1.2 Floating Gate Charge Modification Modeling

The proposed analog memory uses Fowler–Nordheim tunneling to remove the electrons

from the FG and decrease the memory value The tunneling current I tun can be expressed by the

( ) 2

108.1 0.46 8

Trang 36

where V ox is the voltage across the tunneling transistor gate oxide, I tun0 and V f are process

dependent constants determined by measurements Figure 2-6(a) shows the measured tunneling

current I tun versus the oxide voltage V ox, and the fitted model

Hot-electron injection is employed to increase the stored value of the memory The injection

current I inj depends on the source current and the drain-to-source voltage of MP2 A simplified

empirical model derived from [39] approximates I inj as

δ

=

where I s is the injection transistor’s source current, V sd is the drain-to-source voltage, and α, β, δ

are fit constants In our memory cell, I s is set by the biasing current and the aspect ratios between

MP1 and MP2 Figure 2-6(b) shows the measured I inj versus V sd, and the fitted model

The extracted models above can be used in the future designs as well as to improve

programming convergence, as will be described in Section 2.3.3

2.3.1.3 Selective and Value-Independent Update Scheme

The proposed tunneling scheme exploits the steep change of tunneling current with regard to

V ox to achieve a good isolation between selected and unselected memories The operation of this

scheme can be described by Figure 2-7, showing the memory cell omitting components

irrelevant to tunneling process To show how V ox is changed, typical nodal voltages are

annotated The negative feedback keeps the FG voltage at

Trang 37

23

where V SG,P is the source to gate voltage of MP1 Therefore, reducing supply voltage of the

selected memory effectively reduces V fg and increases V ox In our design, the power supply is

switched from 3 V Vdd to a 1 V Vddt, resulting in isolation over 7 orders of magnitude according

to (2.1) In practice, the leakage at lower V ox may be degraded by direct tunneling, which is a

weaker function of the applied field [40], and parasitic coupling Isolation of 83.54 dB is

observed in measurement The condition that MN1 stays in saturation during tunneling can be

satisfied by choosing a proper V ref and using the proposed transconductor to reduce the V out

swing

Injection selectivity is achieved by switching the source voltage of the injection transistor

MP2 The source of MP2 in the unselected memory is connected to ground while the one in the

selected cell is connected to Vdd, enabling injection Therefore, the injection is also selective and

value-independent

Figure 2-7: Simplified schematics and typical nodal voltages of memory cells (a) not selected (b) selected for tunneling

Trang 38

24

32 proposed FG analog memory cells are connected to form a memory array They are

organized in two dimensions and can be randomly accessed (selected) for read and write

operations by setting both column and row inputs to high The block diagram is shown in Figure

2-8 with the cell symbolized The cells are augmented by digital logics controlling their

operation modes The list of digital control combinations and their corresponding operation

modes is shown in Figure 2-8

Once selected, a transmission gate connects the output of that cell to off-chip through

Iout_bus for read-out during programming The / signal sets the direction of memory

writing The magnitude of writing is controlled by the pulse width of Update signal When a cell

is not selected, it maintains its value and can be read or written by on-chip circuits to implement

Selected Update Inj/tun Mode

Trang 39

25

adaptive algorithms The proposed architecture is scalable because all signals and

inter-connections are shared among the cells, and the pin count does not increase with the size of the

array

The proposed FG memory array has been fabricated in a 0.13µm single-poly standard digital

CMOS process using thick-oxide IO FETs The die micrograph is shown in Figure 2-9 Due to

extensive metal fills in this process, details of the circuits cannot be seen So the Virtuoso layout

view is also presented

The area of a single memory cell is 35×14 µm2 It operates at 3 V power supply and

Figure 2-9: (a) Chip micrograph of the memory array together with on-chip adaptation circuitry and (b) layout view of a single memory cell

Figure 2-10: Analog memory programming accuracy of 30 linearly spaced values

1 2 3 4 5 6 7 8 9

Target Values (nA)

Trang 40

26

consumes 15 nA with an output range of 0-10 nA The biasing current is tunable and allows the

designer to balance between range, speed and power consumption

The test setup is built around a National Instruments data acquisition (DAQ) card and a host

PC The programming procedure is controlled by a Labview program in the host PC and based

on the models in Section 2.3.1.2 to achieve fast convergence The average number of iterations

required to achieve a 0.5% error is 5-6 Figure 2-10 demonstrates 30 memory cells programmed

to values between 1 and 9 nA The standard deviation of programming error is 76 pA, limited by

the external circuits and equipment, indicating a 7-bit programming resolution The memory

output noise is 20.5 pArms over 10 KHz bandwidth from simulation, indicating a 53.8 dB

dynamic range

To show the update rule, a memory is first ramped up then ramped down with fixed pulse

width of 1 ms The corresponding V out and I out are plotted in Figure 2-11 Both injection and

tunneling is linear to V out, and the current output has a smooth sigmoid update rule During the

same test, the stored values of the other 31 unselected cells are monitored to measure the writing

crosstalk The crosstalk from the injection and tunneling of the selected cell to the unselected

ones are plotted in Figure 2-12 There is no observable injection crosstalk Tunneling crosstalk is

2.2 2.4 2.6

Number of update pulses

Figure 2-11: Ramping of the memory value, showing the update rules

Định dạng
Số trang	99
Dung lượng	5,88 MB