1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Tài liệu Computational Intelligence In Manufacturing Handbook P13 docx

57 347 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Computational Intelligence in Microelectronics Manufacturing
Tác giả Gary S. May
Người hướng dẫn Jun Wang, Editor
Trường học Georgia Institute of Technology
Thể loại Chương
Năm xuất bản 2001
Thành phố Boca Raton
Định dạng
Số trang 57
Dung lượng 3,35 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

These investigators found that the source input method yielded the most accurate results an averagerelative error of only 2.58%, as compared to 14.62% for the difference method and 14.59

Trang 1

May, Gary S "Computational Intelligence in Microelectronics Manufacturing"

Computational Intelligence in Manufacturing Handbook

Edited by Jun Wang et al

Boca Raton: CRC Press LLC,2001

Trang 2

13 Computational

Intelligence

in Microelectronics

Manufacturing13.1 Introduction

A similar trend exists in telecommunications, where the user will soon be employing high-performance,multifunctional, portable units In the consumer industry, multimedia products capable of voice, image,video, text, and other functions are also expected to be commonplace within the next decade

The common thread in each of these trends is low-cost electronics This multi-billion-dollar electronicsindustry is fundamentally dependent on the manufacture of semiconductor integrated circuits (ICs).However, the fabrication of ICs is extremely expensive In fact, the last couple of decades have seensemiconductor manufacturing become so capital-intensive that only a few very large companies canparticipate A typical state-of-the-art, high-volume manufacturing facility today costs over a billiondollars [Dax, 1996] As shown in Figure 13.1, this represents a factor of over 1000 increase over the cost

of a comparable facility 20 years ago If this trend continues at its present rate, facility costs will exceedthe total annual revenue of any of the four leading U.S semiconductor companies at the turn of thecentury [May, 1994]

Because of rising costs, the challenge before semiconductor manufacturers is to offset capital ment with a greater amount of automation and technological innovation in the fabrication process Inother words, the objective is to use the latest developments in computer technology to enhance themanufacturing methods that have become so expensive In effect, this effort in computer-integrated

invest-Gary S May

Georgia Institute of Technology

Trang 3

manufacturing of integrated circuits (IC-CIM) is aimed at optimizing the cost-effectiveness of integratedcircuit manufacturing as computer-aided design (CAD) has dramatically affected the economics of circuitdesign.

Under the overall heading of reducing manufacturing cost, several important subtasks have beenidentified These include increasing chip fabrication yield, reducing product cycle time, maintainingconsistent levels of product quality and performance, and improving the reliability of processing equip-ment Unlike the manufacture of discrete parts such as electrical appliances, where relatively little rework

is required and a yield greater than 95% on salable product is often realized, the manufacture of integratedcircuits faces unique obstacles Semiconductor fabrication processes consist of hundreds of sequentialsteps, and yield loss occurs at every step Therefore, IC manufacturing processes have yields as low as 20

to 80% The problem of low yield is particularly severe for new fabrication sequences Effective IC-CIMsystems, however, can alleviate such problems Table 13.1 summarizes the results of a Toshiba 1986 studythat analyzed the use of IC-CIM techniques in producing 256K dynamic RAM memory circuits [Hodges

et al., 1989] This study showed that CIM techniques improved the manufacturing process on each ofthe four productivity metrics investigated

Because of the large number of steps involved, maintaining product quality in an IC manufacturingfacility requires strict control of literally hundreds or even thousands of process variables The interde-pendent issues of high yield, high quality, and low cycle time have been addressed in part by the ongoingdevelopment of several critical capabilities in state-of-the-art IC-CIM systems: in situ process monitoring,process/equipment modeling, real-time closed-loop process control, and equipment malfunction diagno-sis Each of these activities increases throughput and reduces yield loss by preventing potential mispro-cessing, but each presents significant engineering challenges in effective implementation and deployment

13.2 The Role of Computational Intelligence

Recently, the use of computational intelligence in various manufacturing applications has dramaticallyincreased, and semiconductor manufacturing is no exception to this trend Artificial neural networks

FIGURE 13.1 Graph of rising integrated circuit fabrication costs in thousands of dollars over the last three decades.

(Source: May, G., 1994 Manufacturing ICs the Neural Way, IEEE Spectrum, 31(9):47-51 With permission.)

Trang 4

[Dayhoff, 1990], genetic algorithms [Goldberg, 1989], expert systems [Parsaye and Chignell, 1988], andother techniques have emerged as powerful tools for assisting IC-CIM systems in performing variousprocess monitoring, modeling, control, and diagnostic functions The following is an introduction tovarious computational intelligence tools in preparation for a more detailed description of the manner

in which these tools have been used in IC-CIM systems

13.2.1 Neural Networks

Because of their inherent learning capability, adaptability, and robustness, artificial neural nets are used

to solve problems that have heretofore resisted solutions by other more traditional methods Althoughthe name “neural network” stems from the fact that these systems crudely mimic the behavior of biologicalneurons, the neural networks used in microelectronics manufacturing applications actually have little to

do with biology However, they share some of the advantages that biological organisms have over standardcomputational systems Neural networks are capable of performing highly complex mappings on noisyand/or nonlinear data, thereby inferring very subtle relationships between diverse sets of input and outputparameters Moreover, these networks can also generalize well enough to learn overall trends in functionalrelationships from limited training data

There are several neural network architectures and training algorithms eligible for manufacturingapplications However, the backpropagation (BP) algorithm is the most generally applicable and mostpopular approach for microelectronics manufacturing Feedforward neural networks trained by BPconsist of several layers of simple processing elements called “neurons” (Figure 13.2) These rudimentaryprocessors are interconnected so that information relevant to input–output mappings is stored in theweight of the connections between them Each neuron contains the weighted sum of its inputs filtered

by a sigmoid transfer function The layers of neurons in BP networks receive, process, and transmitcritical information about the relationships between the input parameters and corresponding responses

In addition to the input and output layers, these networks incorporate one or more “hidden” layers ofneurons that do not interact with the outside world, but assist in performing nonlinear feature extractiontasks on information provided by the input and output layers

In the BP learning algorithm, the network begins with a random set of weights Then an input vector

is presented and fed forward through the network, and the output is calculated by using this initial weightmatrix Next, the calculated output is compared to the measured output data, and the squared differencebetween these two vectors determines the system error The accumulated error for all of the input–outputpairs is defined as the Euclidean distance in the weight space that the network attempts to minimize.Minimization is accomplished via the gradient descent approach, in which the network weights areadjusted in the direction of decreasing error It has been demonstrated that, if a sufficient number ofhidden neurons are present, a three-layer BP network can encode any arbitrary input–output relationship[Irie and Miyake, 1988]

The structure of a typical BP network appears in Figure 13.3 Referring to this figure, let w i,j,k = weightbetween the jth neuron in layer (k–1) and the ith neuron in layer k; in i,k = input to the ith neuron in the

kth layer; and outi,k = output of the ith neuron in the kth layer The input to a given neuron is given by

TABLE 13.1 Results of 1986 Toshiba Study

Source: Hodges, D., Rowe, L., and Spanos, C., 1989 Computer-Integrated turing of VLSI, Proc IEEE/CHMT Int Elec Manuf Tech Symp., 1-3 With permission.

Trang 5

Manufac-Equation (13.1)

where the summation is taken over all the neurons in the previous layer The output of a given neuron

is a sigmoidal transfer function of the input, expressed as

Equation (13.2)

Error is calculated for each input–output pair as follows: Input neurons are assigned a value and putation occurs by a forward pass through each layer of the network Then the computed value at theoutput is compared to its desired value, and the square of the difference between these two vectorsprovides a measure of the error (E) using

in i k w i j k out j k

j , =∑ [ , ,, –1]

11

Trang 6

After a forward pass through the network, error is propagated backward from the output layer Learningoccurs by minimizing error through modification of the weights one layer at a time The weights aremodified by calculating the derivative of E and following the gradient that results in a minimum value.From Equations 13.1 and 13.2, the following partial derivatives are computed as

, –

, ,

, ,

δφ

E in

Trang 7

In the previous expression, the outj,k-1 is available from the forward pass The quantity δi,k is calculated

by propagating the error backward through the network Consider that for the output layer

Equation (13.7)

where the expressions in Equations 13.3 and 13.4 have been substituted Likewise, the quantity φi,n isgiven by

Equation (13.8)Consequently, for the inner layers of the network

Note that φi,kdepends only on the δ in the (k + 1)th layer Thus, φ for all neurons in a given layer can

be computed in parallel The gradient of the error with respect to the weights is calculated for one pair

of input–output patterns at a time After each computation, a step is taken in the opposite direction ofthe error gradient This procedure is iterated until convergence is achieved

13.2.2 Genetic Algorithms

Neural networks are an extremely useful tool for defining the often complex relationships betweencontrollable process conditions and measurable responses in electronics manufacturing processes How-ever, in addition to the need to predict the output behavior of a given process given a set of inputconditions, one would also like to be able to use such models “in reverse.” In other words, given a targetresponse or set of response characteristics, it is often desirable to derive an optimum set of processconditions (or process “recipe”) to achieve these targets Genetic algorithms (GAs) are a method tooptimize a given process and define this reverse mapping

In the 1970s, John Holland introduced GAs as an optimization procedure [Holland, 1975] Geneticalgorithms are guided stochastic search techniques based on the principles of genetics They use threeoperations found in natural evolution to guide their trek through the search space: selection, crossover,

and mutation Using these operations, GAs search through large, irregularly shaped spaces quickly,

– ,

, ,

E out

out in

φi k

j k

i k j

E out

E in

in out

1

Trang 8

requiring only objective function values (detailing the quality of possible solutions) to guide the search.Furthermore, GAs take a more global view of the search space than many methods currently encountered

in engineering optimization Theoretical analyses suggest that GAs quickly locate high-performanceregions in extremely large and complex search spaces and possess some natural insensitivity to noise.These qualities make GAs attractive for optimizing neural network based process models

In computing terms, a genetic algorithm maps a problem onto a set of binary strings Each stringrepresents a potential solution Then the GA manipulates the most promising strings in searching forimproved solutions A GA operates typically through a simple cycle of four stages: (i) creation of apopulation of strings; (ii) evaluation of each string; (iii) selection of “best” strings; and (iv) geneticmanipulation to create the new population of strings During each computational cycle, a new generation

of possible solutions for a given problem is produced At the first stage, an initial population of potentialsolutions is created as a starting point for the search process Each element of the population is encodedinto a string (the “chromosome”), to be manipulated by the genetic operators In the next stage, theperformance (or fitness) of each individual of the population is evaluated Based on each individualstring’s fitness, a selection mechanism chooses “mates” for the genetic manipulation process The selectionpolicy is responsible for assuring survival of the most fit individuals

A common method of coding multiparameter optimization problems is concatenated, multiparameter,mapped, fixed-point coding Using this procedure, if an unsigned integer x is the decoded parameter ofinterest, then x is mapped linearly from [0, 2l] to a specified interval [Umin, Umax] (where l is the length

of the binary string) In this way, both the range and precision of the decision variables are controlled

To construct a multiparameter coding, as many single parameter strings as required are simply enated Each coding has its own sub-length Figure 13.4 shows an example of a two-parameter codingwith four bits in each parameter The ranges of the first and second parameter are 2-5 and 0-15,respectively

concat-The string manipulation process employs genetic operators to produce a new population of individuals(“offspring”) by manipulating the genetic “code” possessed by members (“parents”) of the currentpopulation It consists of selection, crossover, and mutation operations Selection is the process by whichstrings with high fitness values (i.e., good solutions to the optimization problem under consideration)receive larger numbers of copies in the new population In one popular method of selection called elitistroulette wheel selection, strings with fitness value Fi are assigned a proportionate probability of survivalinto the next generation This probability distribution is determined according to

Equation (13.12)

FIGURE 13.4 Example of multiparameter binary coding Two parameters are coded into binary strings with different ranges and varying precision ( π ) (Source: Han, S and May, G., 1997. Using Neural Network Process Models to Perform PECVD Silicon Dioxide Recipe Synthesis via Genetic Algorithms, IEEE Trans Semi Manuf., 10(2):279-287 With permission.)

F

Trang 9

Thus, an individual string whose fitness is n times better than another’s will produce n times the number

of offspring in the subsequent generation Once the strings have reproduced, they are stored in a “matingpool” awaiting the actions of the crossover and mutation operators

The crossover operator takes two chromosomes and interchanges part of their genetic information toproduce two new chromosomes (see Figure 13.5) After the crossover point is randomly chosen, portions

of the parent strings (P1 and P2) are swapped to produce the new offspring (O1 and O2) based on aspecified crossover probability Mutation is motivated by the possibility that the initially defined popu-lation might not contain all of the information necessary to solve the problem This operation is imple-mented by randomly changing a fixed number of bits in every generation according to a specifiedmutation probability (see Figure 13.6) Typical values for the probabilities of crossover and bit mutationrange from 0.6 to 0.95 and 0.001 to 0.01, respectively Higher rates disrupt good string building blocksmore often, and for smaller populations, sampling errors tend to wash out the predictions For thisreason, the greater the mutation and crossover rates and the smaller the population size, the less frequentlypredicted solutions are confirmed

13.2.3 Expert Systems

Computational intelligence has also been introduced into electronics manufacturing in the areas ofautomated process and equipment diagnosis When unreliable equipment performance causes operatingconditions to vary beyond an acceptable level, overall product quality is jeopardized Thus, timely andaccurate diagnosis is a key to the success of the manufacturing process Diagnosis involves determiningthe assignable causes for the equipment malfunctions and correcting them quickly to prevent the sub-sequent occurrence of expensive misprocessing

FIGURE 13.5 The crossover operation Two parent strings exchange binary information at a randomly determined crossover point to produce two offspring (Source: Han, S and May, G., 1997 Using Neural Network Process Models

to Perform PECVD Silicon Dioxide Recipe Synthesis via Genetic Algorithms, IEEE Trans Semi Manuf.,

10(2):279-287 With permission.)

FIGURE 13.6 The mutation operation A randomly selected bit in a given binary string is changed according to a given probability (Source: Han, S and May, G., 1997 Using Neural Network Process Models to Perform PECVD Silicon Dioxide Recipe Synthesis via Genetic Algorithms, IEEE Trans Semi Manuf., 10(2):279-287 With permission.)

0

Trang 10

Neural networks have recently emerged as an effective tool for fault diagnosis Diagnostic problem

solving using neural networks requires the association of input patterns representing quantitative and

qualitative process behavior to fault identification Robustness to noisy sensor data and high-speed

parallel computation makes neural networks an attractive alternative for real-time diagnosis However,

the pattern-recognition-based neural network approach suffers from some limitations First, a complete

set of fault signatures is hard to obtain, and representational inadequacy of a limited number of data sets

can induce network overtraining, thus increasing the misclassification or “false alarm” rate Also,

approaches such as this, in which diagnostic actions take place following a sequence of several processing

steps, are not appropriate, since evidence pertaining to potential equipment malfunctions accumulates

at irregular intervals throughout the process sequence At the end of process sequence, significant

mis-processing and yield loss may have already taken place, making this approach economically undesirable

Hybrid schemes involving neural networks and traditional expert systems have been employed to

circumvent these inadequacies Hybrid techniques offset the weaknesses of each individual method used

by itself Traditional expert systems excel at reasoning from previously viewed data, whereas neural

networks extrapolate analyses and perform generalized classification for new scenarios One approach

to defining a hybrid scheme involves combining neural networks with an inference system based on the

Dempster–Shafer theory of evidential reasoning [Shafer, 1976] This technique allows the combination

of various pieces of uncertain evidence obtained at irregular intervals, and its implementation results in

time-varying, nonmonotonic belief functions that reflect the current status of diagnostic conclusions at

any given point in time

One of the basic concepts in Dempster–Shafer theory is the frame of discernment (symbolized by Θ),

defined as an exhaustive set of mutually exclusive propositions For the purposes of diagnosis, the frame

of discernment is the union of all possible fault hypotheses Each piece of collected evidence can be

mapped to a fault or group of faults within Θ The likelihood of a fault proposition A is expressed as a

bounded interval [s(A), p(A)] which lies in {0,1} The parameter s(A) represents the support for A, which

measures the weight of evidence in support of A The other parameter p(A), called the plausibility of A,

is defined as the degree to which contradictory evidence is lacking Plausibility measures the maximum

amount of belief that can possibly be assigned to A The quantity u(A) is the uncertainty of A, which is

the difference between the evidential plausibility and support For example, an evidence interval of [0.3,

0.7] for proposition A indicates that the probability of A is between 0.3 and 0.7, with an uncertainty of 0.4

In terms of diagnosis, proposition A represents a given fault hypothesis An evidential interval for fault

is determined from a basic probability mass distribution (BPMD) The BPM indicates the portion

of the total belief in evidence assigned exactly to a particular fault hypothesis set Any residual belief in

the frame of discernment that cannot be attributed to any subset of Θ is assigned directly to Θ itself,

which introduces uncertainty into the diagnosis Using the framework, the support and plausibility of

proposition A are given by:

Equation (13.13)

Equation (13.14)

where A i A and B i A and the summation is taken over all propositions in a given BPM Thus the

total belief in A is the sum of support ascribed to A and all subsets thereof

Dempster’s rules for evidence combination provide a deterministic and unambiguous method of

combining BPMDs from separate and distinct sources of evidence contributing varying degrees of belief

to several propositions under a common frame of discernment The rule for combining the observed

BPMs of two arbitrary and independent knowledge sources m1 and m2 into a third m3 is

m A〈 〉

s A( )=∑m A i

p A( )=1 –m B i

Trang 11

Equation (13.15)

where Z = X iY j and

Equation (13.16)

where X iY j = Here X i and Y j represent various propositions which consist of fault hypotheses and

disjunctions thereof Thus, the BPM of the intersection of X i and Y j is the product of the individual BPMs

of X i and Y j The factor (1 – k) is a normalization constant that prevents the total belief from exceeding

unity due to attributing portions of belief to the empty set

To illustrate, consider the combination of m1 and m2 when each contain different evidence concerningthe diagnosis of a malfunction in a plasma etcher [Manos and Flamm, 1989] Such evidence could resultfrom two different sensor readings In particular, suppose that the sensors have observed that the flow

of one of the etch gases into the process chamber is too low Let the frame of discernment Θ = {A, B, C, D}, where A, , D symbolically represent the following mutually exclusive equipment faults:

A = mass flow controller miscalibration

B = gas line leak

C = throttle valve malfunction

D = incorrect sensor signal

These components are illustrated graphically in the etcher gas flow system shown in Figure 13.7.Suppose that belief in this frame of discernment is distributed according to the BPMDs:

FIGURE 13.7 Partial schematic of RIE gas delivery system (Source: Kim, B and May, G., 1997 Real-Time Diagnosis

of Semiconductor Manufacturing Equipment Using Neural Networks, IEEE Trans Comp Pack Manuf Tech C,

20(1):39-47 With permission.)

MFC

SensorThrottle valveGas line

Trang 12

The calculation of the combined BPMD (m3) is shown in Table 13.2 Each cell of the table contains the

intersection of the corresponding propositions from m1 and m2 along with the product of their individualbeliefs Note that the intersection of any proposition with Θ is the original proposition The BPM

attributed to the empty set, k, which originates from the presence of various propositions in m1 and m2whose intersection is empty, is 0.11 By applying Equation 13.16, BPMs for the remaining propositionsresult in:

The plausibilities for propositions in the combined BPM are calculated by applying Equation 13.15 The

individual evidential intervals implied by m3 are A[0.225, 0.550], B[0.169, 0.472], C[0.079, 0.235],

D[0.135, 0.269] Combining the evidence available from knowledge sources m1 and m2 thus leads to theconclusion that the most likely cause of the insufficient gas flow malfunction is a miscalibration of the

mass flow controller (proposition A).

13.3 Process Modeling

The ability of neural networks to learn input–output relationships from limited data is beneficial inelectronics manufacturing, where a plethora of nonlinear fabrication processes exist, and experimentaldata are expensive to obtain Several researchers have reported noteworthy successes in using neuralnetworks to model the behavior of a few key fabrication processes In so doing, the basic strategy isusually to perform a series of statistically designed characterization experiments, and then to train BPneural nets to model the experimental data The process characterization experiments typically consist

of a factorial exploration of the input parameter space, which may be subsequently augmented by a moreadvanced experimental design Each set of input conditions in the design corresponds to a particular set

of measured process responses This input–output mapping is what the neural network learns

13.3.1 Modeling Using Backpropagation Neural Networks

As an example of the neural-network-based process modeling procedure, Himmel and May [1993] used

BP neural nets to model plasma etching Plasma etching removes patterned layers of material usingreactive gases in an AC discharge (Figure 13.8) Because this process is popular, considerable effort hasbeen expended developing reliable models that relate the response of process outputs (such as etch rate

or etch uniformity) to variations in input parameters (such as pressure, radio-frequency power, or gascomposition) These models are required to predict etch behavior under an exhaustive set of operatingconditions with a very high degree of precision However, plasma processing involves complex anddynamic interactions between reactive particles in an electric field As a result of this inherent complexity,approaches to plasma etch modeling that preceded the advent of neural networks met with limited success

TABLE 13.2 Illustration of BPMD Combination

Source: Kim, B and May, G., 1997 Real-Time Diagnosis of Semiconductor Manufacturing Equipment

Using Neural Networks, IEEE Trans Comp Pack Manuf Tech C, 20(1):39-47 With permission.

Trang 13

Plasma process modeling efforts have previously focused on statistical response surface methods (RSM)

[Box and Draper, 1987] RSM models can predict etch behavior under a wide range of operatingconditions, but they are most efficient when the number of process variables is small (i.e., six or fewer).The large number of experiments required to adequately characterize the many significant variables inprocesses like plasma etching is costly and usually prohibitive, forcing experimenters to manipulate areduced set of variables Because plasma etching is a highly nonlinear process, this simplification reducesthe accuracy of the RSM models

Himmel and May compared RSM to BP neural networks for modeling the etching of polysilicon films

in a carbon tetrachloride (CCl4) plasma To do so, they characterized the process by varying RF power,chamber pressure, electrode spacing, and gas composition in a partial factorial design, and trained theneural nets to model the effect of each combination of these inputs on etch rate, uniformity, andselectivity Afterward, they found that the neural network models exhibited 40 to 70% better accuracy

(as measured by root-mean-square error) than RSM models and required fewer training experiments.

Furthermore, the results of this study also indicated that the generalizing capabilities of neural networkmodels were superior to their conventional statistical counterparts This fact was verified by using boththe RSM and “neural” process models to predict previously unobserved experimental data (or test data).Neural networks showed the ability to generalize with an RMS error 40% lower than the statistical modelseven when built with less training data

FIGURE 13.8 Simplified schematic of plasma etching system.

WAFER

Trang 14

Investigators at DuPont, Bell Laboratories, the University of Texas at Austin, Michigan State University,and Texas Instruments have likewise reported positive results using neural nets for modeling plasmaetching Mocella et al [1991] also modeled polysilicon etching, and found that BP neural nets consistentlyproduced models exhibiting better fit than second- and third-order polynomial RSM models Rietmanand Lory [1993] modeled tantalum silicide/polysilicon etching of the gate of metal-oxide-semiconductor(MOS) transistors They successfully used data from an actual production machine to train neural nets

to predict the amount of silicon dioxide remaining in the source and drain regions of the devices afteretching Subsequently, they used their neural etch models to analyze the sensitivity of this etch response

to several input parameters, which provided much useful information for process designers

Huang et al [1994] used neural networks to model the etching of silicon dioxide in a carbon oride (CF4)/oxygen plasma This group found that neural nets consistently outperform RSM models,and they also showed that developing satisfactory models is possible from even fewer experimental datathan coefficients in the neural network Salam et al [1997] modeled plasma etching in an electroncyclotron resonance (ECR) plasma This group focused on novel variations of the BP learning algorithmthat employed error functions different from the quadratic function described by Equation 13.3 Theywere able to successfully model ECR plasma responses using neural nets trained with a polynomial errorfunction derived from the statistical properties of the error signal itself

tetraflu-Other manufacturing processes have also benefited from the neural network approach Specifically,chemical vapor deposition (CVD) processes, which are also nonlinear, have been modeled effectively.Nadi et al [1991] combined BP neural nets and influence diagrams for both the modeling and recipesynthesis of low pressure CVD (LPCVD) of polysilicon Bose and Lord [1993] demonstrated that neuralnetworks provide appreciably better generalization than regression based models of silicon CVD Simi-larly, Han et al [1994] developed neural process models for the plasma-enhanced CVD (PECVD) ofsilicon dioxide films used as interlayer dielectric material in multichip modules

13.3.2 Modifications to Standard Backpropagation in Process Modeling

In each of the previous examples, standard implementations of the BP algorithm have been employed

to perform process modeling tasks However, innovative modifications of standard BP have also been

developed for certain other applications In one case, BP has been combined with simulated annealing

to enhance model accuracy In addition, a second adjustment has been developed that incorporatesknowledge of process chemistry and physics into a semi-empirical or hybrid model, with advantages overthe purely empirical “black-box” approach previously described These two variations of BP are describedbelow

13.3.2.1 Neural Networks and Simulated Annealing in Plasma Etch Modeling

Kim and May [1996] used neural networks to model etch rate, etch anisotropy, etch uniformity, and etch

selectivity in a low-pressure form of plasma etching called reactive ion etching (RIE) The RIE process

consisted of the removal of silicon dioxide films by a trifluoromethane (CHF3) and oxygen plasma in aPlasma Therm 700 series dual chamber RIE system operating at 13.56 MHz The process was initiallycharacterized via a 24 factorial experiment with three center-point replications augmented by a centralcomposite design The factors varied included pressure, RF power, and the two gas flow rates

Data from this experiment were used to train modified BP neural networks, which resulted inimproved prediction accuracy The new technique modified the rule used to update network weights.The new rule combined a memory-based weight update scheme with the simulated annealing procedureused in combinatorial optimization Neural network training rules adjust synapse strengths to satisfythe constraints given to the network In the standard BP algorithm, the weight update mechanism at

the (n + 1)th iteration is given by

w ijk (n + 1) = w ijk (n) + η∆w ijk (n) Equation (13.17)

Trang 15

where w ijk is the connection strength between the j neuron in layer (k – 1) and the i neuron in layer

k, w ijk is the calculated change in that weight that reduces the error function of the network, and η is

the learning rate Equation 13.17 is called the generalized delta rule Kim and May’s new K-step prediction

rule, modified the generalized delta rule by using portions of previously stored weights in predicting the

next set of weights The new update scheme is expressed as

w ijk (n + 1) = w ijk (n) + η∆w ijk (n) + γK w ijk (n – K) Equation (13.18)

The last term in this expression provides the network with long-term memory The integer K determines the number of sets of previous weights stored and the γK factor allows the system to place varying degrees

of emphasis on weight sets from different training epochs Typically, larger values of γK are assigned to

more recent weight sets

This memory-based weight update scheme was combined with a variation of simulated annealing Inthermodynamics, annealing is the slow cooling procedure that enables nature to find the minimumenergy state In neural network training, this is analogous to using the following function in place of theusual sigmoidal transfer function:

Equation (13.19)

where net ik is the weighted sum of neural inputs and βik is the neural threshold Network “temperature”

gradually decreases from an initial value T0 according to a decay factor λ (where λ < 1), effectivelyresulting in a time-varying gain for the network transfer function (Figure 13.9) Annealing the network

at high temperature early leads to rapid location of the general vicinity of the global minimum of theerror surface The training algorithm remains within the attractive basin of the global minimum as thetemperature decreases, preventing any significant uphill excursion When used in conjunction with the

K-step weight prediction scheme outlined previously, this approach is termed annealed K-step prediction

FIGURE 13.9 Plot of simulated annealing-based transfer function as temperature is decreased (Source: Kim, B and

May, G., 1996 Reactive Ion Etch Modeling Using Neural Networks and Simulated Annealing, IEEE Trans Comp.

Pack Manuf Tech C, 19(1): 3-8 With permission.)

11

Trang 16

BP neural networks were trained using this procedure with data from the 2 factorial array plus thethree center-point replications The remaining axial trials from the central composite characterizationexperiment were used as test data for the models The annealed K-step training rule and the generalizeddelta rule were also compared The RMS prediction errors are shown in Table 13.3, in which “%

Improvement” refers to the improvement obtained with the annealed K-step training rule Best results were achieved for K = 2, γ1 = 0.9, γ2 = 0.08, T0 = 100, and λ = 0.99 It is clear that annealed K-step

prediction improves network predictive ability

13.3.2.2 Semi-Empirical Process Modeling

Though neural process models offer advantages in accuracy and robustness over statistical models, theyoffer little insight into the physical understanding of processes being modeled This can be alleviated byneural process models that incorporate partial knowledge of the first-principles relationships inherent

in the process Two different approaches to accomplishing this include so-called hybrid neural networks

and model transfer techniques

Nami et al [1997] developed a semi-empirical model of the metal-organic CVD (MOCVD) processbased on hybrid neural networks Their model was constructed by characterizing the MOCVD of titaniumdioxide (TiO2) films by measuring the deposition rate over a range of deposition conditions This wasaccomplished by varying susceptor and source temperature, flow rate of the argon carrier gas for theprecursor (titanium tetra-iso-propoxide, or TTIP), and chamber pressure Following characterization, amodified BP (hybrid) neural network was trained to determine the value of three adjustable fittingparameters in an analytical expression for the TiO2 deposition rate

The first step in this hybrid modeling technique involves developing an analytical model For TiO2deposition via MOCVD, this was accomplished by applying the continuity equation to reactant concen-tration as the reactant of interest is transported from the bulk gas and incorporated into the growing

film Under these conditions and several key assumptions, the average deposition rate R for TiO2 is givenby

Equation (13.20)

where R is expressed in micrometers per hour, T inlet is the inlet gas temperature in degrees Kelvin, P is the chamber pressure (mtorr), P e is the equilibrium vapor pressure of the precursor (mtorr), P0 is thetotal bubbler pressure (mtorr), ν is the carrier gas flow rate (in standard cm3/min), Q is the total flow

rate (in standard cm3/min), D is the diffusion coefficient of the reactant gas, δ is the boundary layer

thickness, and K D is the mass transfer coefficient given by K D = Ae –E/kT , where A is a pre-exponential

factor related to the molecular “attempt rate” of the growth process, ∆E is the activation energy (cal/mol),

k is Boltzmann’s constant, and T is the susceptor temperature in degrees Kelvin To predict R, the three

unknown parameters that must be estimated are D, A, and E Estimating these parameters with hybrid

neural networks is explained as follows

In standard BP learning, gradient descent minimizes the network error E by adjusting the weights by

an amount proportional to the derivative of the error with respect to previous weights The weight updateexpression is the generalized delta rule given by Equation 13.17, where

Equation (13.21)

R T

K K D

Trang 17

The gradient of the error with respect to the weights is calculated for one pair of input–output patterns

at a time After each computation, a step is taken in the direction opposite to the error gradient, and theprocedure is iterated until convergence is achieved

In the hybrid approach, the network structure corresponding to the deposition of TiO2 by MOCVDhas inputs of temperature, total flow rate, chamber pressure, source pressure, precursor flow rate, and

the actual (measured) deposition rate R a The outputs are D, A, and E These are fed into Equation

13.5, the predicted deposition rate, R p is computed, and the result is compared with the actual (measured)deposition rate (see Figure 13.10) In this case, the error signal is defined as E = 0.5(R p –R a)2 Becausethe expression for predicted deposition rate is differentiable, the new error gradient is computed by thechain rule as

Equation (13.22)

where outik is the calculated output of the jth neuron in the kth layer The first partial derivative in Equation

13.22 is (R p – R a), and the third is the same as that of standard BP The second partial derivative iscomputed individually for each unknown parameter to be estimated Referring to Equation 13.20, the

partial derivative of R p with respect to activation energy is

5622 cal/mol, respectively Once trained, the hybrid neural network was subsequently used to predict thedeposition rate for five additional MOCVD runs, which constituted a test data set not part of the originalexperiment The RMS error of the deposition rate model predictions using the estimated parameters forthe five test vectors was only 0.086 µm/h The hybrid neural network approach, therefore, represents ageneral-purpose methodology for deriving semi-empirical neural process models that take into accountunderlying process physics

Model transfer techniques attempt to modify physically based neural network process models to reflectspecific pieces of processing equipment Marwah and Mahajan [1999] proposed model transfer

TABLE 13.3 Network Prediction Errors

Source: Kim B and May, G., 1996 Reactive Ion Etch Modeling Using Neural Networks and

Simulated Annealing, IEEE Trans Comp Pack Manuf Tech C, 19(1):3-8 With permission.

E R

R out

out w

p ik

ik ijk

P P

p inlet

D

D

e e

1

2 0

δ

ν

Trang 18

approaches for modeling a horizontal CVD reactor used in the epitaxial growth of silicon The goal was

to develop an equipment model that incorporated process physics, but was economical to build The

techniques investigated included (i) the difference method, in which a neural network was trained on the difference between the existing physical model (or “source” model) and equipment data; (ii) the source

weights method, in which the final weights of the source model were used as initial weights of the modified

model; and (iii) the source input method, in which the source model output was used as an additional

input to the modified network

The starting point for model transfer was the development of a physical neural network (PNM) modeltrained on 98 data points generated from a process simulator utilizing first principles Training data wasobtained by running the simulator for various combinations of input parameters (i.e., inlet silane con-centration, inlet velocity, susceptor temperature, and downstream position) using a statistically designedexperiment The numerical data were then split into 73 training vectors and 25 test vectors, and the physicalneural network source model was trained using BP to predict silicon growth rate and tested against thevalidation points for the desired accuracy The average relative training and testing error obtained were1.55% and 1.65%, respectively The source model was then modified by training a new neural networkwith 25 extra experimentally derived data points obtained from central composite experiment

In the difference method, the modified neural network model was trained on the difference betweenthe source and equipment data (see Figure 13.11(a)) The inherent expectation was that if this differencewas a simpler function of the inputs as compared to the pure equipment data, then fewer equipmentdata points would be required to build an accurate model In the source weights method, the sourcemodel was retrained using the equipment data as test data The final weights of the source model werethen used as the initial weights of the modified model The rationale for this approach was that trainingthe source network with the experimental data as test data captures the common features of the sourceand final modified models For the source input method, the source model is used as an additional input

to the modified network (Figure 13.11(b)) Since the source model should be close to the final modifiedmodel, the source output should be some internal representation of the input data, which should beuseful to the modified network The expectation once again was that the additional input makes thelearning task simpler for the modified network, thereby reducing the number of experimental datapoints required

These investigators found that the source input method yielded the most accurate results (an averagerelative error of only 2.58%, as compared to 14.62% for the difference method and 14.59% for the sourceweights method), and the amount of training data required to develop the model modified using this

FIGURE 13.10 Illustration of the hybrid neural network process modeling architecture A BP neural network is

trained to model three adjustable parameters (D, A, and E) from an analytical expression for predicted deposition

rate (Rp) (Source: Nami, Z., Misman, O., Erbil, A., and May, G., 1997 Semi-Empirical Neural Network Modeling

of Metal-Organic Chemical Vapor Deposition, IEEE Trans Semi Manuf., 10(2):288-294 With permission.)

E

Trang 19

technique was approximately 25% of that required to develop a complete equipment model from scratch.Furthermore, the source model can be reused for developing additional models of other similar equipment.

13.3.2.3 Process Modeling Using Modular Neural Networks

Natale et al [1999] applied modular neural networks to develop a model of atmospheric pressure CVD

(APCVD) of doped silicon dioxide films, a critical step in dynamic random access memory (DRAM)chip fabrication at the Texas Instruments fabrication facility in Avezzano, Italy Modular neural networks

consist of a group of subnetworks, or modules, competing to learn different aspects of a problem As

shown in Figure 12(a), “gating” network is applied to control the competition by assigning differentregions of the input data space to different local modules The gating network has as many outputs as

FIGURE 13.11 Schematic of two model modifiers: (a) difference method; and (b) source input method (Source:

Marwah, M and Mahajan, R., 1999 Building Equipment Models Using Neural Network Models and Model Transfer

Techniques, IEEE Trans Semi Manuf., 12(3):377-380 With permission.)

Actual Data

Difference

NN SOFTWARE

Equipment Model

DIFFERENCE MODEL

Target Output

Source Output

Target Model

Trang 20

the number of modules Both the modules and the gating network are trained by BP The modularapproach allows multiple networks to cooperate in solving the problem, as each module specializes inlearning different regions of the input space The outputs of each module are weighted by the gatingnetwork, thereby selecting a “winner” module whose output is closest to the target.

The deposition of both phosphosilicate glass (PSG) and boron-doped PSG (BPSG) were modeledusing this approach The inputs included 9 gas flows (three injectors each for silane, diborane, andphosphine gas), 3 injector temperatures, 6 nitrogen curtain flow rates, 12 thermocouple temperaturereadings, the chamber pressure, and a butterfly purge valve position reading The outputs were the weightpercentage of the boron and phosphorus dopants in the grown film, as well as the film thickness Anoverall input/output schematic is shown in Figure 13.12(b) Since the data set was not homogeneous,but instead was formed by two classes representing both PSG and BPSG deposition, the modular approachwas appropriate for this case The final modular network developed in this investigation exhibited anexcellent average relative error of approximately 1% in predicting the concentration of dopants and thethickness of the oxide

13.4 Optimization

In electronics manufacturing, neural network-based optimization has been undertaken from two differentviewpoints The first uses statistical methods to optimize the neural process models themselves, with thegoal of determining the network structure and set of learning parameters to minimize network trainingerror, prediction error, and training time The second approach focuses on using neural process models

to optimize a given semiconductor fabrication process or to determine specific process recipes for adesired response

FIGURE 13.12 (a) Block diagram of a modular neural network (b) Schematic of the location of the sensors inside

the APCVD equipment (Source: Natale, C et al., 1999 Modeling of APCV-Doped Silicon Dioxide Deposition Process

by a Modular Neural Network, IEEE Trans Semi Manuf., 12(1):109-115 With permission.)

module 1 module 2

output

module n

gating network

i1 i2

muffle temperatures

Modular Neural Network

P weight

B weight

thickness

Trang 21

13.4.1 Network Optimization

The problem of optimizing network structure and learning parameters has been addressed by Kim andMay [1994] for plasma etch modeling and Han and May [1996] in modeling plasma-enhanced CVD.The former study performed a statistically designed experiment in which network structure and learningparameters are varied systematically, and used the results of this experiment to derive the optimal neuralprocess model using the simplex search method The latter study improved this technique by using geneticalgorithms to search for the best combination of learning parameters

13.4.1.1 Network Optimization Using Statistical Experimental Design and Simplex

Search

Although they offer advantages over other methods, neural process models contain adjustable learningparameters whose proper values are unknown before model development In addition, the structure ofthe network can be modified by adjusting the number of layers and the number of neurons per layer As

a result, the optimal network structure and values of network parameters for a given modeling applicationare not always clear Systematically selecting an optimal set of parameters and network structure is anessential requirement for increasing the benefits of neural process modeling Among the most criticaloptimality issues for neural process models are learning capability, predictive (or generalization) capa-bility, and convergence speed

Neural network architecture is determined by the number of layers and number of neurons per layer.Usually, the number of input-layer and output-layer neurons is determined by the number of processinputs and responses in the modeling application However, specifying the number of hidden-layerneurons is less obvious It is generally understood that an excessively large number of hidden neuronssignificantly increases training time and gives poorer predictions for unfamiliar facts Aside from networkarchitecture, several other parameters affect the BP algorithm, including learning rate, initial weightrange, momentum, and training tolerance

A number of efforts to obtain the optimal network structure have been described [Kim and May,1994] Other efforts have focused on the effect of variations in learning parameters on network perfor-mance The consideration of interactions between parameters, however, has been lacking Furthermore,much of the existing effort in this area has focused on improving networks designed to perform classi-fication and pattern recognition The optimization of networks that model continuous nonlinear pro-cesses (such as those in semiconductor manufacturing) has not been addressed as thoroughly Kim andMay, however, presented an experiment designed to comprehensively evaluate all relevant learning andstructural network parameters The goal was to design an optimal neural network for a specific semi-conductor manufacturing problem, modeling the etch rate of polysilicon in a CCl4 plasma

To develop the optimal neural process model, these researchers designed a D-optimal experiment [Galil

and Kiefer, 1980] to investigate the effect of six factors: the number of hidden layers, the number ofneurons per hidden layer, training tolerance, initial weight range, learning rate, and momentum Thisexperiment determined how the structural and learning factors affect network performance and provided

an optimal set of parameters for a given set of performance metrics The network responses optimizedwere learning capability, predictive capability, and training time The experiment consisted of two stages

In the first stage, statistical experimental design was employed to fully characterize the behavior of theetch process [May et al., 1991] Etch rate data from these trials were used to train neural process models.Once trained, the models were used to predict the etch rate for 12 test wafers Prediction error for thesewafers was also computed, and these two measures of network performance, along with training time,were used as experimental responses to optimize the neural etch rate model as the structural and learning

parameters were varied in the second stage (which consisted of the D-optimal design).

Independent optimization of each performance characteristic was then performed with the objective ofminimizing training error, prediction error, and training time A constrained multicriteria optimizationtechnique based on the Nelder–Mead simplex search algorithm was implemented to do so The optimal

Trang 22

parameter set was first found for each criterion individually, irrespective of the optimal set for the othertwo The results of the independent optimization are summarized in Table 13.4.

Several interesting interactions and trade-offs between the various parameters emerged in this study.One such trade-off can be visualized in two-dimensional contour plots such as those in Figures 13.13and 13.14 Figure 13.13 plots training error against training tolerance and initial weight range with allother parameters set at their optimal values Learning capability improves with decreased tolerance andwider weight distribution Intuitively, the first result can be attributed to the increased precision required

by a tight tolerance Figure 13.14 plots network prediction error vs the same variables as in Figure 13.13

As expected, optimum prediction is observed at high tolerance and narrow initial weight distribution.The latter result implies that the interaction between neurons within the restricted weight space duringtraining is a primary stimulus for improving prediction Thus, although learning degrades with a widerweight range, generalization is improved

The parameter sets in Table 13.4 are useful for obtaining optimized performance for a single criterion,but can provide unacceptable results for the others For example, the parameter set that minimizestraining time yields high training and prediction errors Because it is undesirable to train three differentnetworks corresponding to each performance metric for a given neural process model, it is necessary tooptimize all network inputs simultaneously This is accomplished by implementing a suitable

Equation (13.24)where σt is the network training error, σp is the prediction error, and T is training time The constants

K1, K2, and K3 represent the relative importance of each performance measure

Prediction error is the most important quality characteristic For modeling applications, a networkneed not be trained frequently, so training time is not a critical consideration To optimize this costfunction, the values chosen by Kim and May were K1 = 10, K2 = 100, and K3 = 1 Optimization wasperformed on the overall cost function The results of this collective optimization appear in Table 13.5.The parameter values in this table yield the minimum cost according to Equation 13.24 This combinationresulted in a training error of 412 Å/min, a prediction error of 340 Å/min, and a training time of 292 s.Although this represents only marginal performance, these values may be further tuned by adjusting thecost function constants Ki and the optimization constraints until suitable performance is achieved

13.4.1.2 Network Optimization Using Genetic Algorithms

Although Kim and May had success with designed experiments and simplex search to optimize BP neuralnetwork learning, the effectiveness of the simplex method depends on its initial search point With animproper starting point, performance degrades, and the algorithm is likely to be trapped in local optima.Theoretical analyses suggest that genetic algorithms quickly locate high-performance regions in extremely

TABLE 13.4 Independently Optimized Network Inputs

Source: Kim, B and May, G., 1994 An Optimal Neural Network Process Model for Plasma Etching, IEEE Trans Semi Manuf., 7(1):12-21 With permission.

Cost=Kt2+K2σ2p+K T3 2

cost function such as

Trang 23

large and complex search spaces and possess some natural insensitivity to noise, which makes GAspotentially attractive for determining optimal neural network structure and learning parameters.Han and May [1996] applied GAs to obtain the optimal neural network structure and learningparameters for modeling PECVD The goal was to design an optimal model for the PECVD of silicondioxide as a function of gas flow rates, temperature, pressure, and RF power The responses includedfilm permittivity, refractive index, residual stress, uniformity, and impurity concentration To obtaintraining data for developing the model, an experiment was performed to investigate the effect of thenumber of hidden-layer neurons, training tolerance, learning rate, and momentum The network

FIGURE 13.13 Contour plot of training error (in Å/min) vs training tolerance and initial weight range (learning rate

= 2.8, momentum = 0.35, number of hidden neurons = 6, number of hidden layers = 1) Learning capability is shown

to improve with decreased tolerance and wider weight distribution (Source: Kim, B and May, G., 1994 An Optimal Neural Network Process Model for Plasma Etching, IEEE Trans Semi Manuf., 7(1):12-21 With permission.)

Training Error (A /min)

LEARNING RATE = 2.8, MOMENTUM = 0.35, NEURON NUMBER =6, LAYER NUMBER = 1

Trang 24

responses were learning and predictive capability Optimal parameter sets that minimized learningand prediction error were determined by genetic search, and this technique was compared with thesimplex method

Figure 13.15 shows the neural network optimization scheme GAs generated possible candidates forneural parameters using an initial population of 50 potential solutions Each element of the populationwas encoded into a 10-bit string to be manipulated by the genetic operators Because four parameterswere to be optimized (the number of hidden layer neurons, momentum, learning rate, and trainingtolerance), the concatenated total string length was 40 bits The probabilities of crossover and mutationwere set to 0.6 and 0.01, respectively

FIGURE 13.14 Contour plot of prediction error (in Å/min) vs training tolerance and initial weight range (learning

rate = 2.8, momentum = 0.35, number of hidden neurons = 6, number of hidden layers = 1) Optimum prediction

occurs at high tolerance and narrow initial weight distribution (Source: Kim, B and May, G., 1994 An Optimal Neural Network Process Model for Plasma Etching, IEEE Trans Semi Manuf., 7(1):12-21 With permission.)

Prediction Error (A /min)

LEARNING RATE = 2.8, MOMENTUM = 0.35, NEURON NUMBER = 9, LAYER NUMBER = 1

Trang 25

The performance of each individual of the population was evaluated with respect to the constraintsimposed by the problem based on the evaluation of a fitness function To search for parameter valuesthat minimized both network training error and prediction error, the following performance index (PI)was implemented:

Equation (13.25)where σt is the RMS training error, σp is the RMS prediction error, and K1 and K2 represent the relative

importance of each performance measure The values chosen for these constants were K1 = 1 and K2 =

10 The desired output was reflected by the following fitness function:

Equation (13.26)

Maximization of F continued until a final solution was selected after 100 generations If the optimal

solution was not found, the solution with the best fitness value was selected

Individual response neural network models were trained to predict PECVD silicon dioxide permittivity,refractive index, residual stress, and nonuniformity, and impurity (H2O and SiOH) concentration Theresult of genetically optimizing these neural process models is shown in Table 13.6 Analogous resultsfor network optimization by the simplex method are given in Table 13.7 Examination of Tables 13.6 and13.7 shows that the most significant differences between the two optimization algorithms occur in thenumber of hidden neurons and learning rates predicted to be optimal

Tables 13.8 and 13.9 compare σt and σp for the two search methods (In each table, the “% ment” column refers to the improvement obtained by using genetic search) Although in two casesinvolving training error minimization the simplex method proved superior, the genetically optimizednetworks exhibited vastly improved performance in nearly every category for prediction error minimi-zation The overall average improvement observed in using genetic optimization was 1.6% for networktraining error and 60.4% for prediction error

The parameter sets called for in Tables 13.6 and 13.7 are useful for obtaining optimal performance for

a single PECVD response, but provide suboptimal results for the remaining responses For example,Table 13.6 indicates that seven hidden neurons are optimal for permittivity, refractive index, and stress,but only four hidden neurons are necessary for the nonuniformity and impurity concentration models

It is desirable to optimize network parameters for all responses simultaneously Therefore, a multipleoutput neural process model (which includes permittivity, stress, nonuniformity, H2O, and SiOH) wastrained with that objective in mind

TABLE 13.5 Collectively Optimized Network Inputs

Source: Kim, B and May, G., 1994 An Optimal Neural Network

Process Model for Plasma Etching, IEEE Trans Semi Manuf.,

7(1):12-21 With permission.

PI=Kt2+K2σ2p

F PI

=+11

Trang 26

FIGURE 13.15 Block diagram of genetic optimization Genetic algorithms are used to generate populations of neural

networks with varying hidden layer neurons, momentum, learning rates, and training tolerances The best network

is selected after several generations according to a performance index (Source: Han, S and May, G., 1996 zation of Neural Network Structure and Learning Parameters Using Genetic Algorithms, Proc IEEE Int Conf AI

Optimi-Tools, 8:200-206 With permission.)

TABLE 13.6 Network Parameters Optimized by Genetic Algorithms

Source: Han, S and May, G., 1996 Optimization of Neural Network Structure and Learning Parameters Using

Genetic Algorithms, Proc IEEE Int Conf AI Tools, 8:200-206 With permission.

TABLE 13.7 Network Parameters Optimized by Simplex Search

Source: Han, S and May, G., 1996 Optimization of Neural Network Structure and Learning Parameters Using

Genetic Algorithms, Proc IEEE Int Conf AI Tools, 8:200-206 With permission.

Genetic Algorithms

Trained neural network Training

Test data Testing

Calculate errors

Evaluate

PI and F

Trang 27

Table 13.10 shows optimized parameters for the multiple response PECVD model Here, the optimalparameters derived by genetic and simplex search differ only slightly However, they differ noticeablyfrom the parameter sets optimized for individual responses This is especially true for the number ofhidden neurons and the learning rates Further, slight differences in optimal parameters lead to significantdifferences in performance This is indicated in Table 13.11, which shows training and prediction errorsfor the neural network models trained with the parameter sets in Table 13.10 If the improvements forthe multiple response model are factored in, GAs provide an average benefit of 10.0% in training accuracyand 65.6% in prediction accuracy.

13.4.2 Process Optimization

A natural extension of process modeling is using models to optimize the processes (as opposed to thenetworks) or to generate specific process recipes To illustrate the importance of process optimization,consider the PECVD of silicon dioxide films [Han et al., 1994] In this process, one would like to grow

a film with the lowest dielectric constant, best uniformity, minimal stress, and lowest impurity tration possible (Figure 13.16) However, achieving these goals usually requires a series of trade-offs ingrowth conditions Optimized neural process models can help a process engineer navigate the complexresponse surface and provide the necessary combination of process conditions (temperature, pressure,gas composition, etc.) or find the best compromise among potentially conflicting objectives to producethe desired results

concen-TABLE 13.8 Training Error Comparison of GA and Simplex Network Optimization

Source: Han, S and May, G., 1996 Optimization of Neural Network Structure and Learning Parameters Using

Genetic Algorithms, Proc IEEE Int Conf AI Tools, 8:200-206 With permission.

TABLE 13.9 Prediction Error Comparison of GA and Simplex Network Optimization

Source: Han, S and May, G., 1996 Optimization of Neural Network Structure and Learning Parameters Using

Genetic Algorithms, Proc IEEE Int Conf AI Tools, 8:200-206 With permission.

TABLE 13.10 Optimal Network Parameters for Multiple Response PECVD Model

Source: Han, S and May, G., 1996 Optimization of Neural Network Structure and

Learning Parameters Using Genetic Algorithms, Proc IEEE Int Conf AI Tools, 8:200-206.

With permission.

Trang 28

Han and May used neural process models for the PECVD process to synthesize other novel processrecipes To characterize the PECVD of silicon dioxide (SiO2) films, they first performed a 25-1 fractionalfactorial experiment with three center-point replications [Box et al., 1978] Data from these experimentswere used to develop neural process models for SiO2 deposition rate, refractive index, permittivity, filmstress, wet etch rate, uniformity, silanol concentration, and water concentration Then the recipe synthesisprocedure was performed to generate the necessary deposition conditions to obtain specific film qualities,including zero stress, 100% uniformity, low permittivity, and minimal impurity concentration Thissynthesis procedure compared GAs to other search procedures for determining optimal process recipes.GAs offer the advantage of global search, but they are slow to converge More traditional approaches

to optimization include calculus-based “hill-climbing” methods, where it is critical to find the bestdirection for searching to optimize the response surface One method in this category is Powell’s algo-rithm, which generates successive quadratic approximations of the space to be optimized This method

involves determining a set of n linearly independent, mutually conjugate directions (where n is the

dimensionality of the search space) Successive line minimizations put the algorithm at the minimum ofthe quadratic approximation For functions that are not exactly quadratic, the algorithm does not find

the exact minimum, but repeated cycles of n line minimizations converge in due course to the minimum.

Another widely used searching technique is Nelder and Mead’s simplex method A regular simplex is

defined as a set of (n + 1) mutually equidistant points in n-dimensional space The main idea of the simplex method is to compare the values of the function to be optimized at the (n + 1) vertices of the

simplex and move the simplex iteratively toward the optimal point

In both Powell’s algorithm and the simplex method, the initial starting search point profoundly affectsoverall performance With an improper starting point, both algorithms are more likely to be trapped inlocal optima However, if the proper initial point is given, the search is very fast On the other hand,genetic algorithms search out the overall optimal area very fast, but converge slowly to a global optimum.Therefore, hybrid combinations of genetic algorithms with the other two algorithms (Powell’s andsimplex) sometimes offer improved results in both speed and accuracy Hybrid algorithms start withgenetic algorithms to initially sample the hypersurface and find the global optimum area After somenumber of generations, the best point found by the GA is handed over to other algorithms as a startingpoint With this initial point, both Powell’s algorithm and simplex method quickly locate the optimum Han and May compared five optimization methods to synthesize PECVD recipes: (i) genetic algo-rithms; (ii) Powell’s method; (iii) Nelder and Mead’s simplex algorithm; (iv) a hybrid combination ofgenetic algorithms and Powell’s method; and (v) a hybrid combination of genetic algorithms and thesimplex algorithm The desired output characteristics of the PECVD SiO2 film to be produced are reflected

by the following fitness function:

Equation (13.27)

TABLE 13.11 Optimal Network Parameters for Multiple Response PECVD Model

Source: Han, S and May, G., 1996 Optimization of Neural Network Structure and

Learning Parameters Using Genetic Algorithms, Proc IEEE Int Conf AI Tools,

Ngày đăng: 13/12/2013, 01:16

TỪ KHÓA LIÊN QUAN