Third, a 3-layer, 7-node analog deep machine learning engine is designed featuring online unsupervised trainability and non-volatile floating-gate analog storage.. Keywords: analog signa
Trang 1University of Tennessee, Knoxville
TRACE: Tennessee Research and Creative
University of Tennessee - Knoxville, jlu9@vols.utk.edu
Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss
Part of the Electrical and Electronics Commons , and the VLSI and Circuits, Embedded and Hardware Systems Commons
Recommended Citation
Lu, Junjie, "An Analog VLSI Deep Machine Learning Implementation " PhD diss., University of Tennessee,
2014
https://trace.tennessee.edu/utk_graddiss/2709
This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee
Research and Creative Exchange It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange For more information, please contact
trace@utk.edu
Trang 2To the Graduate Council:
I am submitting herewith a dissertation written by Junjie Lu entitled "An Analog VLSI Deep Machine Learning Implementation." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the
requirements for the degree of Doctor of Philosophy, with a major in Electrical Engineering
Jeremy Holleman, Major Professor
We have read this dissertation and recommend its acceptance:
Benjamin J Blalock, Itamar Arel, Xiaopeng Zhao
Accepted for the Council: Carolyn R Hodges Vice Provost and Dean of the Graduate School (Original signatures are on file with official student records.)
Trang 3An Analog VLSI Deep Machine Learning Implementation
A Dissertation Presented for the
Doctor of Philosophy
Degree The University of Tennessee, Knoxville
Junjie Lu May 2014
Trang 4ii
Acknowledgement
I would like to express my sincere gratitude to my advisor, Dr Jeremy Holleman, for his
support, guidance and encouragement His profound knowledge and rigorous attitude toward
research inspires me to grow and will benefit me in my future professional and personal life
I am also deeply grateful to Dr Benjamin J Blalock, Dr Itamar Arel and Dr Xiaopeng
Zhao for serving as my Ph.D committee member Their valuable suggestions help me to
improve my research and dissertation
I would like to thank Dr Itamar Arel and Mr Steven Young for their great help and support
in the analog machine learning project Their expertise in machine learning is essential to this
project from architecture definition to testing and data processing
I would like to thank my colleagues in ISiS lab at the University of Tennessee, Mr Tan
Yang and Mr M Shahriar Jahan, for their help and friendship
Last but also the most important, I offer my deepest gratitude and love to my parents,
Minghua Lu and Huijun Wang, and my wife, Yang Xue, for their unconditional love, support
and confidence in me
Trang 5iii
Abstract
Machine learning systems provide automated data processing and see a wide range of
applications Direct processing of raw high-dimensional data such as images and videos by
machine learning systems is impractical both due to prohibitive power consumption and the
“curse of dimensionality,” which makes learning tasks exponentially more difficult as dimension
increases Deep machine learning (DML) mimics the hierarchical presentation of information in
the human brain to achieve robust automated feature extraction, reducing the dimension of such
data However, the computational complexity of DML systems limits large-scale
implementations in standard digital computers Custom analog signal processing (ASP) can yield
much higher energy efficiency than digital signal processing (DSP), presenting a means of
overcoming these limitations
The purpose of this work is to develop an analog implementation of DML system
First, an analog memory is proposed as an essential component of the learning systems It
uses the charge trapped on the floating gate to store analog value in a non-volatile way The
memory is compatible with standard digital CMOS process and allows random-accessible
bi-directional updates without the need for on-chip charge pump or high voltage switch
Second, architecture and circuits are developed to realize an online k-means clustering
algorithm in analog signal processing It achieves automatic recognition of underlying data
pattern and online extraction of data statistical parameters This unsupervised learning system
constitutes the computation node in the deep machine learning hierarchy
Third, a 3-layer, 7-node analog deep machine learning engine is designed featuring online
unsupervised trainability and non-volatile floating-gate analog storage It utilizes massively
parallel reconfigurable current-mode analog architecture to realize efficient computation And
Trang 6iv
algorithm-level feedback is leveraged to provide robustness to circuit imperfections in analog
signal processing At a processing speed of 8300 input vectors per second, it achieves 1×1012
operation per second per Watt of peak energy efficiency
In addition, an ultra-low-power tunable bump circuit is presented to provide similarity
measures in analog signal processing It incorporates a novel wide-input-range tunable
pseudo-differential transconductor The circuit demonstrates tunability of bump center, width and height
with a power consumption significantly lower than previous works
Keywords: analog signal processing, deep machine learning, floating gate memory, current
mode computation, k-means clustering, power efficiency
Trang 7v
Table of Contents
Chapter 1 Introduction 1
1.1 Introduction to Machine Learning 1
1.1.1 Machine Learning: Concepts and Applications 1
1.1.2 Three Types of Machine Learning 3
1.1.3 DeSTIN - A Deep Learning Architecture 4
1.2 Analog Deep Machine Learning Engine - the Motivation 7
1.2.1 Analog versus Digital - the Neuromorphic Arguments 8
1.2.2 Analog Advantages 10
1.2.3 Inaccuracies in Analog Computation 11
1.2.4 Analog versus Digital – Parallel Computation 12
1.3 Original Contributions 13
1.4 Dissertation Organization 14
Chapter 2 A Floating-Gate Analog Memory with Random-Accessible Bidirectional Sigmoid Updates 15 2.1 Overview of Floating Gate Device 16
2.1.1 Principles of Operation 16
2.1.2 Fowler–Nordheim Tunneling 17
2.1.3 Hot Electron Injection 17
2.2 Literature Review on Floating Gate Analog Memory 18
Trang 8vi
2.3 Proposed Floating Gate Analog Memory 20
2.3.1 Floating-Gate Analog Memory Cell 20
2.3.2 Floating Gate Memory Array 24
2.3.3 Measurement Results 25
Chapter 3 An Analog Online Clustering Circuit in 0.13 µm CMOS 28
3.1 Introduction and Literature Review of Clustering Circuit 28
3.2 Architecture and Algorithm 29
3.3 Circuit Implementation 30
3.3.1 Floating-Gate Analog Memory 30
3.3.2 Distance Computation (D3) Block 30
3.3.3 Time-Domain Loser-Take-All (TD-LTA) Circuit 32
3.3.4 Memory Adaptation (MA) Circuit 33
3.4 Measurement Results 34
Chapter 4 Analog Deep Machine Learning Engine 37
4.1 Introduction and Literature Review 38
4.2 Architecture and Algorithm 40
4.3 Circuit Implementation 45
4.3.1 Floating-Gate Analog Memory (FGM) 45
4.3.2 Reconfigurable Analog Computation (RAC) 47
4.3.3 Distance Processing Unit (DPU) 51
Trang 9vii
4.3.4 Training Control (TC) 55
4.3.5 Biasing and Layout Design 55
4.4 Measurement Results 57
4.4.1 Input Referred Noise 58
4.4.2 Clustering Test 59
4.4.3 Feature Extraction Test 61
4.4.4 Performance Summary and Comparison 62
Chapter 5 A nano-power tunable bump circuit 64
5.1 Introduction and Literature Review 64
5.2 Circuit Design 65
5.3 Measurement Result 67
Chapter 6 Conclusions and Future Work 72
6.1 Conclusions 72
6.2 Future Work 73
References 75
Vita 85
Trang 10viii
List of Tables
Table I Performances Summary of the Floating Gate Memory 27
Table II Performance Summary of the Clustering Circuit 35
Table III Performances Summary and comparison of the Improved FG Memory 46
Table IV Performances Summary of the Analog Deep Learning Engine 63
Table V Comparison to Previous Works 63
Table VI Performance Summary and Comparison of the Bump Circuit 71
Trang 11ix
List of Figures
Figure 1-1: The DeSTIN hierarchical architecture [6] 6
Figure 1-2: Bump circuit, which computes tanh(V1−V2) and its derivative simultaneously [15] 11
Figure 2-1: Cross-section of a typical FG NFET in a bulk CMOS process [28] 16
Figure 2-2: Energy band diagram of Si/SiO2 interface (a) with and (b) without applied field [31] 17
Figure 2-3: Hot electron injection in PFET 18
Figure 2-4: Schematic of the proposed floating-gate analog memory cell 20
Figure 2-5: (a) Schematic of the transconductor and (b) its transfer function 21
Figure 2-6: (a) Tunneling current versus oxide voltage Vox (b) Injection current versus drain-to-source voltage of the injection transistor 21
Figure 2-7: Simplified schematics and typical nodal voltages of memory cells (a) not selected (b) selected for tunneling 23
Figure 2-8: Block diagram of the FG analog memory array, and a table showing control signal settings for different operation modes of the cells 24
Figure 2-9: (a) Chip micrograph of the memory array together with on-chip adaptation circuitry and (b) layout view of a single memory cell 25
Figure 2-10: Analog memory programming accuracy of 30 linearly spaced values 25
Figure 2-11: Ramping of the memory value, showing the update rules 26
Trang 12x
Figure 2-12: Crosstalk among the 31 unselected cells when a selected cell is injected or tunneled
with a magnitude of 10 nA 27
Figure 3-1: The architecture of the proposed analog online clustering circuit, with the details of the memory and distance computation cell 29
Figure 3-2: The schematic of the D3 block 31
Figure 3-3: The simplified schematic of (a) the LTA network, (b) one cell of the LTA, (c) typical timing diagrams 32
Figure 3-4: (a) The simplified schematic and (b) timing diagram of the MA circuit 33
Figure 3-5: Classification test results 34
Figure 3-6: Clustering test result 35
Figure 4-1: The architecture of the analog deep machine learning engine and possible application scenarios 37
Figure 4-2(a): The node architecture The clustering algorithm implemented by the node is illustrated in (b)-(e) 42
Figure 4-3: Timing diagram of the intra-cycle power gating 45
Figure 4-4: The schematic of the improved floating gate analog memory 45
Figure 4-5: The layout of the new FGM 46
Figure 4-6: The schematic of the reconfigurable analog computation cell and the switch positions for three operation modes 47
Figure 4-7: The measured transfer functions with the RAC configured to belief construction mode 49
Trang 13xi
Figure 4-8: Behavioral model of the RAC with gain errors (b) System's classification error rate
as a function of each error 51
Figure 4-9: The schematic of one channel of the distance processing unit 52
Figure 4-10: Timing diagram of data sampling across the hierarchy to enable pipelined operation
53
Figure 4-11: (a) The schematic of the sample and hold and (b) simulated charge injection and
droop errors 54
Figure 4-12: The schematic and timing diagram of the starvation trace circuit 54
Figure 4-13: Biasing schemes (a) Voltage distribution (b) Current distribution (c) Proposed
hybrid biasing (d) Measured mismatch of biasing 55
Figure 4-14: Conceptual diagram showing how the RAC array is assembled from the RAC cells
56
Figure 4-15: (a) Chip micrograph and (b) custom test board 57
Figure 4-16: (a) The system model for noise measurement (b) Measured classification results
and extracted Gaussian distribution 58
Figure 4-17: The clustering test results 59
Figure 4-18: The extracted parameters plotted versus their true values 60
Figure 4-19: Clustering results with bad initial condition without and with the starvation trace
enabled 60
Figure 4-20: The feature extraction test setup 61
Figure 4-21: (a) The convergence of centroid during training (b) Output rich feature from the top
Trang 14xii
layer 61
Figure 4-22: Measured classification accuracy using the feature extracted by the chip 62
Figure 4-23: The performance and energy breakdown in the training mode 63
Figure 5-1: Schematic of the proposed tunable bump circuit 66
Figure 5-2: Bump circuit micrograph, layout, and the test setup 67
Figure 5-3: (a) Transconductor output, (b) normalized gm (IW=0) 68
Figure 5-4: The measured bump transfer functions showing (a) variable center, (b) variable width, (c) variable height 70
Figure 5-5: The measured 2-D bump output with different width on x and y dimensions 71
Trang 151
This chapter introduces the background and motivation of this work It first discusses some
basic ideas of machine learning and deep machine learning systems Then the advantages of
analog signal processing are analyzed, justifying the purpose of the analog deep machine
learning implementation The structure and organization of the dissertation is given in the last
part
Learning covers a broad range of activity and process and therefore is difficult to define
precisely In general, it involves acquiring new, or modifying and reinforcing existing
knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types
of information [1] Learning is first studied as a subject of psychologists and zoologists on
humans and animals And it is arguable that many techniques in machine learning are derived
from the learning process of human or animals
Machine learning is generally concerned with a machine that automatically changes its
structure, program, or data based on its inputs or in response to external information to improve
its performance The “changes” might be either enhancements to already performing systems or
synthesis of new functions or systems
In the past, machines are programmed to perform a certain task in the first place The
reasons behind the need for a “learning machine” are manifold
Trang 162
First, the environments in which the machines are used are often hard to define at the time of
programming and changes over time Machine learning methods can be used for on-the-job
improvement of existing design and adaptation to a changing environment, therefore reduce the
need for constant redesign
Second, machine learning provides means of automated data analysis, which is especially
important in the face of the deluge of data in our era It is possible that hidden among large piles
of data are important relationships and correlations For example, Wal-Mart handles more than
1M transactions per hour and has databases containing more than 2.5 petabytes (2.5×1015) of
information [2] From it, a machine learning algorithm can extract purchase patterns of people
from different demographics profiles and make customized buying recommendation to them [3]
Machine learning methods used to extract these relationships are called “data mining”
Another reason is that new knowledge about tasks is constantly being discovered There is a
constant stream of new events in the world and it is impractical to continuously redesign the
systems to accommodate new knowledge However, machine learning methods are able to keep
track of these new trends Since the methods are data-driven, the learning-based algorithms are
often more accurate than the stationary algorithm when facing the ever-changing world
Apart from the above mentioned reasons, there are many more reasons why machine
learning has become a heated area of research in recent years Moreover, it is not merely a
research topic, but has penetrated to the people’s lives and become a powerful and indispensable
tool in a wide variety of applications:
• To predict if patient will respond to particular drug/therapy based on microarray profiles in bioinformatics
• To categorize text to filter out spam emails
Trang 173
• In banking and credit card institution to detect fraud
• Optical character recognition
• Machine vision, face detection and recognition
• Natural language processing in automatic translation
• Market segmentation
• Robot control
• Classification of stars and galaxies
• Weather or stock market price forecast
• Electric power load prediction
Machine learning is usually divided into three main types [2]
In the supervised learning approach, an output or label is given to each input in the training
data set, and the machine learning system learns the mapping from inputs to outputs The
simplest training input can be a multi-dimensional vector of numbers, representing the feature of
the data being learned In general, however, the inputcan have a complex structure, representing
an image, a sentence, a time sequence, etc The output of the system can be either a categorical
label or a real-valued variable When the output is a label, the problem is referred to as
classification or pattern recognition And when the output is a real-valued scalar, the problem is
called regression
The second type of machine learning is the unsupervised learning, where only the inputs are
given and the learning system finds underlying patterns in the data The problem of unsupervised
learning is less well-defined compared to supervised learning, because the system is free to look
Trang 184
for any patterns, and there is no obvious error metric to correct the current perception However,
it is arguably more typical of human and animal learning, because we got most of our knowledge
without being told what the right answers are Unsupervised learning is also more widely
applicable because it does not require a human expert to manually label the data
There is a third type of machine learning, known as reinforcement learning The machine
learns how to act or behave based on occasional external signals In some applications, the
output of the system is a sequence of actions The machine is given occasional reward or
punishment signals based on the goodness of the actions, and the goal is to learn how to act or
behave to maximize the award and minimize the punishment Reinforced learning is employed in
applications where a single move is not so important as the rule or policy of the behavior, for
example, game playing or robot navigating
1.1.3.1 The Curse of Dimensionality
A machine learning system usually processes observations in a multi-dimensional space
When the dimension of the observations is large, such as that from an image or video, a
phenomenon called “curse of dimensionality” [4] arises This phenomenon stems from the fact
that as the dimensionality increases, the volume of the space increases exponentially and as a
result, the available data become sparse This sparsity reduces the predictive power of machine
learning systems In order to obtain a statistical sound and reliable result,the amount of data and
computational power needed to support the result often grows exponentially with the
dimensionality
Trang 195
1.1.3.2 Deep Machine Learning
When dealing with high dimensional data such as images or videos, it is often necessary to
pre-process the data to reduce its dimensionality to what can be efficiently processed, while still
preserving the “essence” of the data Such dimensionality reduction schemes are often referred to
as feature extraction techniques
The most effective feature extraction engine we know might be our brain The human brain
can process information with an efficiency and robustness that no machine can compare with
They are exposed to a sea of sensory data every second and able to capture the critical aspects of
them in a way that allows for future use in a concise manner Therefore, mimicking the
performance of the human brain has been a core goal and challenge in machine learning
research Recent neuroscience findings have provided insight into information representation in
the human brain One of the key findings has been that the sensory signals propagate through a
complex hierarchy of modules that, over time, learn to represent observations based on the
regularities they exhibit This discovery motivated the emergence of the subfield of deep
machine learning, which focuses on computational models for information representation that
exhibit similar characteristics to that of the neocortex [5]
Trang 206
1.1.3.3 Deep Spatiotemporal Inference Network (DeSTIN)
The deep learning architecture adopted in this work is based on the Deep Spatiotemporal
Inference Network (DeSTIN) architecture, first introduced in [6] DeSTIN consists of multiple
instantiations of identical functional unit called cortical circuits (nodes); each node is a
parameterized models which learns by means of an unsupervised learning process These nodes
are arranged in layers and each node is assigned children nodes from the layer below and a
parent node from the layer above as shown in Figure 1-1 Nodes at the lowest layer receive raw
sensory data while nodes at all other layers receive the belief states, or outputs, from their
children nodes as input Each node attempts to capture the salient spatiotemporal regularities
contained in its input and continuously update a belief state meant to characterize the input and
the sequences thereof The beliefs formed throughout the architecture can then be used as rich
Figure 1-1: The DeSTIN hierarchical architecture [6]
Trang 217
features for a classier that can be trained using supervised learning Beliefs extracted from the
lower layers will characterize local features and beliefs from higher layers will characterize
global features Thus, DeSTIN can be viewed as an unsupervised feature extraction engine that
forms features from data based on regularities it observes In this framework, a common cortical
circuit populates the entire hierarchy, and each of these nodes operates independently and in
parallel to all other nodes This solution is not constrained to a layer-by-layer training procedure,
making it highly attractive for implementation on parallel processing platforms Its simplicity
and repetitive structure facilitates parallel processing platforms and straightforward training [5]
Deep layered architectures offer excellent performance attributes However, the computation
requirements involved grow dramatically as the dimensionality of the input space increases
Compositional deep layered architectures compose multiple instantiations of a common cell and
the computation is performed concurrently In CPU based platforms, however, processing is
performed sequentially, thereby greatly increasing execution time [7] Therefore, many recent
efforts research focus on implementing DML systems on GPUs While GPUs have advantages
over CPU-based realizations in computation time and cost/performance ratio, they are power
hungry, making such schemes impractical in energy-constraint environments and limiting the
scale of these systems
Custom analog circuitry presents a means of overcoming the limitation of digital VLSI
technology By fully leveraging the computational power of transistors, exploiting the inherent
tolerance to inaccuracies of the learning algorithm and performing computation in a slow but
massively parallel fashion, the proposed analog deep machine learning engine promises to
Trang 228
largely improve the power efficiency of digital DML systems to take full advantage the scaling
potential of DeSTIN
It is meaningful to compare the energy efficiency between the brain and the digital
computer The human brain is estimated to perform roughly 1015 synapse operations at about
10 impulse/sec The total energy consumption of our brain is about 25 watts [8] This yields an
energy consumption of about 1015 operation per joule Today’s super computer can perform 8.2
billion megaflops with a power consumption of 9.9 million watts, enough to power 10000 houses
[9] Its energy efficiency is thus about 8.3×109 operation per joule, more than 6 orders of
magnitude lower than the human brain
The great discrepancy in energy efficiencies of neurobiology and electronics suggests that
there are fundamental differences in the ways they do computation One significant difference is
the state variables they use Digital computers employ only two state variables while ignoring all
the values in the middle to achieve noise immunity at the expense of dynamic range The
neurons, on the other hand, present and process information in analog domain: the firing rates are
continuous variables; and each neuron resembles a lossy integrator with the leakage controlled
by fluctuating number of ion channels Analog signaling allows a single wire to carry multi-bit
information, therefore largely increasing power and area efficiency It also interfaces naturally
with the analog computation primitives, as discussed below
The other important trait leading to enormous efficiency of neurological systems is their
clever exploitation of the physics they are built with The nervous system does basic aggregation
of information using the conservation of charge Kirchhoff’s current law implements current
summing, and this current is integrated with respect to the time by the node capacitance In the
Trang 239
neuron tissue, ions are in thermal equilibrium and their energies are Boltzmann distributed If an
energy barrier exists and is modulated with the applied voltage, the current through the barrier
will be an exponential function of that applied voltage [10] This principle is used to create active
devices and compute complex nonlinear functions in neural computation The principle of
operation of the transistors in the integrated circuit can be surprisingly similar to that of the
nervous operation: in weak inversion, the energy barrier for the carrier to travel from source to
drain is modulated by the gate voltage; therefore the drain current is exponentially dependent on
the gate voltage However, digital computers completely disregard these inherent computation
primitives in the device physics, and only use two extremes of the operation points: on and off
states, therefore represent the information with 0 or 1 only This also confines us to a set of very
limited elementary operations: NOT, NOR, OR or their equivalences This is in contrast with
how the neuron does computation and can cause a factor of 104 efficiency penalty [10] Analog
circuits provide a means to reclaim this efficiency loss: by exploiting the computational
primitives inherent in the device and physics like our brains do, operations can be naturally
carried out with much higher efficiency
The scaling of CMOS technology reduces the power of digital systems However, this
scaling trend is slowing down and seeing its end due to physical limitations such as the thickness
of gate oxide [11] In addition, the power in digital system does not scale as fast as the feature
size due to the saturation of threshold and supply voltage scaling in order to keep down
subthreshold leakage [12] On the other hand, analog system can also benefit from the
technology scaling The improved subthreshold slop in FinFET improves the transconductance
efficiency in weak inversion, and improves the computation efficiency [13] And the reduced
wiring parasitic capacitance improves computation throughput
Trang 2410
Analog signal processing makes use of the physics of the devices: physical relations of
transistors, capacitors, resistors, Kirchhoff’s current and voltage laws and so on It also
represents the information with multi-bit encoding Therefore it can be far more efficient than
digital signal processing For example, addition of two numbers takes only one wire in analog
circuit by using Kirchhoff’s current law, whereas it takes about 240 transistors in static CMOS
digital circuits to implement an 8-bit adder Similarly, an 8-bit multiplication in the analog
domain using current-mode operation takes 4 to 8 transistors, whereas a parallel 8-bit digital
multiplier has approximately 3000 transistors [14] Another example is the bump circuit, shown
in Figure 1-2, which computes the derivative of tanh(·) The bump circuit simultaneously
provides a measure of similarity between two inputs and the tanh(·) of their difference The
bump function can also be used as a probability distribution, as it peaks with zero difference and
saturates to zero for large differences The bump circuit illustrates the power advantage that
analog computation holds over digital methods A bump circuit, biased at 200 fA, can evaluate
the similarity between a stored value at about 200 observations per second, according to
simulations using 0.24 µm transistors A single inverter consumes about four times that much
when switching at 200 Hz, and about half that much statically, without switching at all To
perform a comparable computation digitally would require dozens more transistors and one to
two orders of magnitude more current [15]
Trang 2511
However, the analog computation does have some disadvantages when compared to digital,
and they are caused by the very same reasons that make it more efficient The analog systems are
much more sensitive to noise and offset than digital systems While the digital systems use
restoring logic at every computational step to obtain good noise immunity, the use of continuous
signal variables prevents analog systems from having any restoring mechanism Thus, the noise
accumulation in analog systems becomes severe as the system scales up It is found in [14] that
the cost for precision for analog system increases faster than digital: the power consumption is a
polynomial function of the required signal to noise ratio (SNR) in analog system; in digital
system, however, it is a logarithm function Therefore, analog computation is cheaper at low
values of accuracy but more expensive at high accuracy
However, in certain cases the feedback inherent to the learning algorithms naturally
compensates for inaccuracies introduced by the analog circuits Similarly, this lack of accuracy
in analog signal processing can also be found in neural computers The brain is known to be built
from noisy, inaccurate neurons For example, many behavioral responses, such as a fly making a
course correction after a disturbance, occur over a period of around 30 ms [16] Neural signals
Figure 1-2: Bump circuit, which computes tanh(V1−V2) and its derivative simultaneously [15]
Trang 2612
integrated over comparable time windows typically exhibit a signal-noise ratio (SNR) in the
range of 1-10 [16], [17], [18], much lower than what can be easily achieved in moderate
precision analog electronics Comparisons between the noise and power tradeoffs in analog and
digital circuits and biological systems have also been explored in [14] The low SNR and
outstanding power efficiency of neural systems suggests that relaxed accuracy requirements for
electronic computational primitives could allow aggressive optimization for area and power
consumption
The power efficiency of a computational system can be expressed by its delay-power
product The delay-power product of a single stage can be approximated to be
/
where V DD is the supply voltage, C P is the equivalent parasitic capacitances associated with the
internal nodes, I D is the current consumption, and the g m is the equivalent trans-conductance of
the transistors The V DD and C P can be scaled down with the technology, and I D / g m is minimized
when the transistors are biased in weak inversion Therefore (1.1) indicates that an efficient
computational system can be built by slow but massively parallel computational elements biased
in weak inversion (or sub-threshold)
Subthreshold digital designs are difficult in that the high susceptibility to process variability
in subthreshold region causes timing errors [19] For high performance applications
low-threshold devices must be used and leakage becomes a significant problem [20] In a massively
paralleled system, the subthreshold leakage can consume a large portion of total power without
any contribution to the computation throughput
Trang 2713
The error in analog systems behaves more benignly than that in digital systems: an error in
digital causes the complete loss of information (unless error correction is implemented), while
the errors in analog has much smaller magnitude and cause graceful degradation of performance
and if static, can be compensated by the feedback inherent to the learning algorithms Moreover,
the leakage is no longer a problem: the subthreshold channel current in analog circuit is used to
carry information and perform operation, instead of being deemed as wasted in digital computer
In this work, an analog signal processing system implementing DeSTIN, a state-of-art deep
machine learning algorithm is proposed The original contribution of this work is summarized
below:
• Characterized a floating gate device in 0.13 µm standard digital CMOS process
• Designed and tested a novel floating gate analog memory with random-accessible bidirectional sigmoid updates in 0.13 µm standard digital CMOS
• Proposed novel architecture and circuits to realize an analog online k-means clustering circuit with non-volatile storage, first reported in the literature
• Designed an analog deep machine learning engine to implement DeSTIN, first reported in the literature Proposed techniques to greatly increase power and area
Trang 2814
The remaining chapters of this proposal will cover the design of components, circuits and
architectures to implement the analog deep machine learning engine, in a bottom-up way
Chapter 2 provides the implementation of the analog non-volatile memory, which is an
essential component in the learning system
Chapter 3 describes the design of an analog k-means clustering circuit, the key building
block in the deep machine learning engine
Chapter 4 presents the proposed analog deep machine learning engine, including its
architecture and circuit designs And the techniques to greatly improve the energy and area
efficiency are presented
Chapter 5 develops the ultra-low-power tunable bump circuit, which can have wide
application in analog signal processing systems
Chapter 6 concludes the dissertation and proposes potential future works
Trang 2915
Random-Accessible Bidirectional Sigmoid Updates
Memory is an essential component in a computation system Modern digital memory can
afford very high read/write speed and density [21] However, most digital memories are volatile:
DRAM requires constant refreshing, and SRAM requires a minimum V DDfor state retention
This volatility precludes their use in intermittently powered devices such as those utilizing
harvested energy
Non-volatile digital memories such as flash memory [22] require special process FRAM
(Ferroelectric RAM) is reported to be embeddable using two additional mask steps during
conventional CMOS process [23], and has been proven to be commercially viable [24] Recent
researches in this area have proposed other types of memory such as ReRAM (Resistive RAM)
[25], and MRAM (Magnetoresistive RAM) [26] However, these technologies are still new and
not commercially available, and all require special processing
Another major challenge using digital memory in analog signal processing systems is that
A/D/A conversion is needed to interface the memories to other circuits This is especially
problematic in distributed-memory architectures, where the A/D/A cannot be shared among the
memory cells, and this leads to prohibitive area and power overhead
In this work, I propose a floating-gate current-output analog memory which interfaces
naturally with the current-mode analog computation system, and allows random-accessible
control of bidirectional updates, described in [27] The update scheme avoids the use of charge
pump, minimizes interconnection and pin count, and is compatible with standard digital process
The update rule is sigmoid-shaped, which is a smooth, monotonic and bounded function
Implemented in a commercially available 0.13µm single-poly digital CMOS process using
Trang 30thick-16
oxide IO FETs, the memory cell achieves small area and low power consumption, and is suitable
for integration into systems that exploit the high-density digital logic available in modern CMOS
technology
2.1.1 Principles of Operation
Floating gate (FG) device utilizes the charge trapped on the isolated gate to store analog or
digital values in a non-volatile way The cross-section of a typical FG NFET in a bulk CMOS
process is shown in Figure 2-1 [28]; note that a double-poly process is used to obtain the control
gate The earliest research on this device can be dated back to the 1960's [29], and the modern
EEPROM and Flash memory are both based on FG devices Due to the excellent insulation from
the thermally-grown SiO2 surrounding the floating gate, the electron trapped on the gate can
have a retention time of more than 10 years [30] And the memory can be programmed by two
mechanisms: Fowler–Nordheim tunneling and hot-electron injection
Figure 2-1: Cross-section of a typical FG NFET in a bulk CMOS process [28]
Trang 3117
Fowler–Nordheim (FN) tunneling is used to remove electrons from the floating gate The
potential difference applied across the poly-SiO2-Si structure reduces the effective thickness of
the gate-oxide barrier, facilitating electron tunneling from the floating gate, through the SiO2
barrier, into the oxide conduction band This is illustrated by Figure 2-2, showing the energy
band diagrams of the Si/SiO2 interface with and without applied field [31] At sufficiently high
field, the width of the barrier becomes small enough for electrons to tunnel through the silicon
conduction band into the oxide conduction band This phenomenon is first described by Fowler
and Nordheim in electrons tunneling through the vacuum barrier, and the FN-tunneling is found
in SiO2 in 1969 [32]
2.1.3 Hot Electron Injection
Hot electron injection is used to add electrons to the floating gate In a PFET, the carrier
holes are accelerated by the lateral field applied between its drain and source Near the drain
Figure 2-2: Energy band diagram of Si/SiO2 interface (a) with and (b) without applied field [31]
Trang 3218
terminal, the kinetic energy of the holes is large enough to liberate electron-hole pairs (ehp)
when they collide with the silicon lattice Electrons scattered upwards with energy larger than
3.2 eV will be able to go over the Si–SiO2 work-function barrier into the oxide conduction band
These electrons are then swept over to the floating gate by the oxide electric field This process is
illustrated in Figure 2-3
Although FG devices are usually associated with digital memories such as EEPROM or
Flash memory to store binary values, they are intrinsically an analog device, because the charge
on the FG can be modified in a continuous way A floating-gate analog memory uses the charge
trapped on the isolated gate to store analog variables in a non-volatile way It has been widely
used in analog reconfigurable, adaptive and neuromorphic systems, such as electronic
potentiometer [33], precision voltage reference [34], offset-trimmed opamp [30], pattern
classifier [35], silicon learning networks [36], and adaptive filter [37]
Without direct electrical connections, the stored value of the memory is updated by
depositing electrons to the floating gate by hot-electron injection, or removing them by Fowler–
Nordheim tunneling Compared to injection, tunneling selectivity is harder to obtain because it
Figure 2-3: Hot electron injection in PFET
Trang 3319
often involves controlling a high voltage (HV) on chip Therefore, many previous works [35],
[36] use tunneling as the global erase, and injection to program individual memory to its target
value However, in an online adaptive system as this work, a bidirectional update is preferable
because the stored values need to vary with the inputs Previous works have proposed
approaches to achieve selective tunneling In [38], the selected memory is tunneled by pulling up
the tunneling voltage and pulling down the control gate voltage simultaneously This approach
requires a number of tunneling control pins equal to the number of rows in the memory array,
which is not desirable for large-scale systems In [33], a HV switch is built with
lightly-doped-drain nFETs This device is not compatible with standard digital processes and consumes static
power because it cannot be completely turned off In [37], a charge pump is used to generate a
local HV for the selected memory A simple charge pump provides limited voltage boost, while a
more complex one consumes larger area and/or requires multi-phase clocks
Another important performance metric of analog memory is the update rule The dynamic of
the single-transistor FG memory [38] leads to exponential and value-dependent update, which, in
general, affects the stability of the adaptation [37] A linear update can be obtained by fixing the
FG node voltage during update with a capacitive feedback loop around a differential [33] or
single-ended amplifier [37]
Trang 3420
2.3.1.1 Circuit Description
The schematic of the proposed FG analog memory cell is shown in Figure 2-4 The gate of
MP1-MP3 and the top plate of Cf form the FG The stored charge can be modified by the
injection transistor MP2 and the tunneling transistor MP3 The two MUXs at the sources of MP1
and MP2 control the tunneling and injection of the FG, which will be discussed later The
transconductor gm converts voltage V out to output current I out Vref determines the nominal
voltage of V out during operation
The negative feedback loop comprising the inverting amplifier MP1/MN1 and Cf keeps the
FG voltage V fg constant, ensuring a linear update of V out Tunneling or injection to the FG node
changes the charge stored in Cf, therefore changes the outputof the amplifier by ∆V out = ∆Q / C f
The transconductor is implemented with a differential pair MN2/MN3 and a cascode current
mirror MP4-MP7, depicted in Figure 2-5(a) Biased in deep sub-threshold region, the
Figure 2-4: Schematic of the proposed floating-gate analog memory cell
0 1 Inject
Gnd Vdd
Vinj
Tunnel
Vdd Vddt Vtun
MP1 MP2
Trang 3521
transconductor exhibits a transfer curve resembling a tanh function, plotted in Figure 2-5(b) The
mild nonlinearity is smooth, monotonic and bounded From Figure 2-5(b), a ∆V in of 0.2 V is
enough to cause a change of I out from 0.1I B to 0.9I B, this reduced swing requirement further
improves the update linearity, and enables the selective tunneling
2.3.1.2 Floating Gate Charge Modification Modeling
The proposed analog memory uses Fowler–Nordheim tunneling to remove the electrons
from the FG and decrease the memory value The tunneling current I tun can be expressed by the
( ) 2
108.1 0.46 8
Trang 36where V ox is the voltage across the tunneling transistor gate oxide, I tun0 and V f are process
dependent constants determined by measurements Figure 2-6(a) shows the measured tunneling
current I tun versus the oxide voltage V ox, and the fitted model
Hot-electron injection is employed to increase the stored value of the memory The injection
current I inj depends on the source current and the drain-to-source voltage of MP2 A simplified
empirical model derived from [39] approximates I inj as
δ
=
where I s is the injection transistor’s source current, V sd is the drain-to-source voltage, and α, β, δ
are fit constants In our memory cell, I s is set by the biasing current and the aspect ratios between
MP1 and MP2 Figure 2-6(b) shows the measured I inj versus V sd, and the fitted model
The extracted models above can be used in the future designs as well as to improve
programming convergence, as will be described in Section 2.3.3
2.3.1.3 Selective and Value-Independent Update Scheme
The proposed tunneling scheme exploits the steep change of tunneling current with regard to
V ox to achieve a good isolation between selected and unselected memories The operation of this
scheme can be described by Figure 2-7, showing the memory cell omitting components
irrelevant to tunneling process To show how V ox is changed, typical nodal voltages are
annotated The negative feedback keeps the FG voltage at
Trang 3723
where V SG,P is the source to gate voltage of MP1 Therefore, reducing supply voltage of the
selected memory effectively reduces V fg and increases V ox In our design, the power supply is
switched from 3 V Vdd to a 1 V Vddt, resulting in isolation over 7 orders of magnitude according
to (2.1) In practice, the leakage at lower V ox may be degraded by direct tunneling, which is a
weaker function of the applied field [40], and parasitic coupling Isolation of 83.54 dB is
observed in measurement The condition that MN1 stays in saturation during tunneling can be
satisfied by choosing a proper V ref and using the proposed transconductor to reduce the V out
swing
Injection selectivity is achieved by switching the source voltage of the injection transistor
MP2 The source of MP2 in the unselected memory is connected to ground while the one in the
selected cell is connected to Vdd, enabling injection Therefore, the injection is also selective and
value-independent
Figure 2-7: Simplified schematics and typical nodal voltages of memory cells (a) not selected (b) selected for tunneling
Trang 3824
32 proposed FG analog memory cells are connected to form a memory array They are
organized in two dimensions and can be randomly accessed (selected) for read and write
operations by setting both column and row inputs to high The block diagram is shown in Figure
2-8 with the cell symbolized The cells are augmented by digital logics controlling their
operation modes The list of digital control combinations and their corresponding operation
modes is shown in Figure 2-8
Once selected, a transmission gate connects the output of that cell to off-chip through
Iout_bus for read-out during programming The / signal sets the direction of memory
writing The magnitude of writing is controlled by the pulse width of Update signal When a cell
is not selected, it maintains its value and can be read or written by on-chip circuits to implement
Selected Update Inj/tun Mode
Trang 3925
adaptive algorithms The proposed architecture is scalable because all signals and
inter-connections are shared among the cells, and the pin count does not increase with the size of the
array
The proposed FG memory array has been fabricated in a 0.13µm single-poly standard digital
CMOS process using thick-oxide IO FETs The die micrograph is shown in Figure 2-9 Due to
extensive metal fills in this process, details of the circuits cannot be seen So the Virtuoso layout
view is also presented
The area of a single memory cell is 35×14 µm2 It operates at 3 V power supply and
Figure 2-9: (a) Chip micrograph of the memory array together with on-chip adaptation circuitry and (b) layout view of a single memory cell
Figure 2-10: Analog memory programming accuracy of 30 linearly spaced values
1 2 3 4 5 6 7 8 9
Target Values (nA)
Trang 4026
consumes 15 nA with an output range of 0-10 nA The biasing current is tunable and allows the
designer to balance between range, speed and power consumption
The test setup is built around a National Instruments data acquisition (DAQ) card and a host
PC The programming procedure is controlled by a Labview program in the host PC and based
on the models in Section 2.3.1.2 to achieve fast convergence The average number of iterations
required to achieve a 0.5% error is 5-6 Figure 2-10 demonstrates 30 memory cells programmed
to values between 1 and 9 nA The standard deviation of programming error is 76 pA, limited by
the external circuits and equipment, indicating a 7-bit programming resolution The memory
output noise is 20.5 pArms over 10 KHz bandwidth from simulation, indicating a 53.8 dB
dynamic range
To show the update rule, a memory is first ramped up then ramped down with fixed pulse
width of 1 ms The corresponding V out and I out are plotted in Figure 2-11 Both injection and
tunneling is linear to V out, and the current output has a smooth sigmoid update rule During the
same test, the stored values of the other 31 unselected cells are monitored to measure the writing
crosstalk The crosstalk from the injection and tunneling of the selected cell to the unselected
ones are plotted in Figure 2-12 There is no observable injection crosstalk Tunneling crosstalk is
2.2 2.4 2.6
Number of update pulses
Figure 2-11: Ramping of the memory value, showing the update rules