Tài liệu Digital Signal Processing Handbook P48 docx

48.7 Units of Speech for Representing Speakers48.8 Input ModesText-Dependent Fixed Passwords •Text Independent No Specified Passwords•Text Dependent Randomly Prompted Passwords 48.9 Repr

Trang 1

Furui, S & Rosenberg, A.E “Speaker Verification”

Digital Signal Processing Handbook

Ed Vijay K Madisetti and Douglas B Williams Boca Raton: CRC Press LLC, 1999

Trang 2

48.7 Units of Speech for Representing Speakers48.8 Input Modes

Text-Dependent (Fixed Passwords) •Text Independent (No

Specified Passwords)•Text Dependent (Randomly Prompted Passwords)

48.9 Representations

Representations That Preserve Temporal Characteristics •

Representations That do not Preserve Temporal istics

Character-48.10 Optimizing Criteria for Model Construction48.11 Model Training and Updating

48.12 Signal Feature and Score Normalization Techniques

Signal Feature Normalization •Likelihood and Normalized

Scores •Cohort or Speaker Background Models

48.13 Decision Process

Specifying Decision Thresholds and Measuring Performance

•ROC Curves•Adaptive Thresholds•Sequential Decisions

(Multi-Attempt Trials)

48.14 Outstanding IssuesDefining Terms

References

48.1 Introduction

Speaker recognition is the process of automatically extracting personal identity information by ysis of spoken utterances In this section, speaker recognition is taken to be a general process whereasspeaker identification and speaker verification refer to specific tasks or decision modes associated withthis process Speaker identification refers to the task of determining who is speaking and speakerverification is the task of validating a speaker’s claimed identity

anal-Many applications have been considered for automatic speaker recognition These include secureaccess control by voice, customizing services or information to individuals by voice, indexing orlabeling speakers in recorded conversations or dialogues, surveillance, and criminal and forensic in-vestigations involving recorded voice samples Currently, the most frequently mentioned application

Trang 3

is access control Access control applications include voice dialing, banking transactions over a phone network, telephone shopping, database access services, information and reservation services,voice mail, and remote access to computers Speaker recognition technology, as such, is expected

tele-to create new services and make our daily lives more convenient Another potentially importantapplication of speaker recognition technology is its use for forensic purposes [24]

For access control and other important applications, speaker recognition operates in a speakerverification task decision mode For this reason the section is entitled speaker verification However,the term speaker recognition is used frequently in this section when referring to general processes.This section is not intended to be a comprehensive review of speaker recognition technology.Rather, it is intended to give an overview of recent advances and the problems that must be solved in thefuture The reader is referred to papers by Doddington [4], Furui [10,11,12,13], O’Shaughnessy [39],and Rosenberg and Soong [48] for more general reviews

48.2 Personal Identity Characteristics

A universal human faculty is the ability to distinguish one person from another by personal identitycharacteristics The most prominent of these characteristics are facial and vocal features Organized,scientific efforts to make use of personal identifying characteristics for security and forensic purposesbegan about 100 years ago The most successful such effort was fingerprint classification which hasgained widespread use in forensic investigations

Today, there is a rapidly growing technology based on biometrics, the measurement of humanphysiological or behavioral characteristics, for the purpose of identifying individuals or verifying theclaimed or asserted identity of an individual [34] The goal of these technological efforts is to producecompletely automated systems for personal identity identification or verification that are convenient

to use and offer high performance and reliability Some of the personal identity characteristicswhich have received serious attention are blood typing, DNA analysis, hand shape, retinal and irispatterns, and signatures, in addition to fingerprints, facial features, and voice characteristics Ingeneral, characteristics that are subject to the least amount of contamination or distortion andvariability provide the greatest accuracy and reliability Difficulties arise, for example, with smudgedfingerprints, inconsistent signature handwriting, recording and channel distortions, and inconsistentspeaking behavior for voice characteristics Indeed, behavioral characteristics, intrinsic to signatureand voice features, although potentially an important source of identifying information, are alsosubject to large amounts of variability from one sample to another

The demand for effective biometric techniques for personal identity verification comes from sic and security applications For security applications, especially, there is a great need for techniquesthat are not intrusive, that are convenient and efficient, and are fully automated For these reasons,techniques such as signature verification or speaker verification are attractive even if they are subject

foren-to more sources of variability than other techniques Speaker verification, in addition, is particularlyuseful for remote access, since voice characteristics are easily recorded and transmitted over telephonelines

48.3 Vocal Personal Identity Characteristics

Both physiology and behavior underly personal identity characteristics of the voice Physiologicalcorrelates are associated with the size and configuration of the components of the vocal tract (seeFig.48.1)

For example, variations in the size of vocal tract cavities are associated with characteristic variations

in the spectral distributions in the speech signal for different speech sounds The most prominent

of these spectral features are the characteristic resonances associated with voiced speech soundsknown as formants [6] Vocal cord variations are associated with the average pitch or fundamental

Trang 4

FIGURE 48.1: Simplified diagram of the human vocal tract showing how speech sounds are generated.The size and shape of the articulators differ from person to person.

frequency of voiced speech sounds Variations in the velum and nasal cavities are associated withcharacteristic variations in the spectrum of nasalized speech sounds Atypical anatomical variations,

in the configuration of the teeth or the structure of the palate are associated with atypical speechsounds such as lisps or abnormal nasality

Behavioral correlates of speaker identity in the speech signal are more difficult to specify “Lowlevel” behavioral characteristics are associated with individuality in articulating speech sounds, char-acteristic pitch contours, rhythm, timing, etc Characteristics of speech that have to do with indi-vidual speech sounds, or phones, are referred to as “segmental”, while those that pertain to speechphenomena over a sequence of phones are referred to as “suprasegmental” Phonetic or articu-latory suprasegmental “settings” distinguishing speakers have been identified which are associatedwith characteristic “breathy”, nasal, and other voice qualities [38] “High-level” speaker behavioralcharacteristics refer to individual choice of words and phrases and other aspects of speaking styles

48.4 Basic Elements of a Speaker Recognition System

The basic elements of a speaker recognition system are shown in Fig.48.2 An input utterance from

an unknown speaker is analyzed to extract speaker characteristic features The measured features arecompared with prototype features obtained from known speaker models

Speaker recognition systems can operate in either an identification decision mode (Fig.48.2(a))

or verification decision mode (Fig.48.2(b)) The fundamental difference between these two modes

is the number of decision alternatives

In the identification mode, a speech sample from an unknown speaker is analyzed and compared

Trang 5

FIGURE 48.2: Basic structures of speaker recognition systems.

with models of known speakers The unknown speaker is identified as the speaker whose model bestmatches the input speech sample In the “closed set” identification mode, the number of decisionalternatives is equal to the size of the population In the “open set” identification mode, a referencemodel for the unknown speaker may not exist In this case, an additional alternative, “the unknowndoes not match any of the models”, is required

In the verification decision mode, an identity claim is made by or asserted for the unknown speaker.The unknown speaker’s speech sample is compared with the model for the speaker whose identity

is claimed If the match is good enough, as indicated by passing a threshold test, the identity claim

is verified In the verification mode there are two decision alternatives, accept or reject the identityclaim, regardless of the size of the population Verification can be considered as a special case of the

“open set” identification mode in which the known population size is one

Crucial to the operation of a speaker recognition system is the establishment and maintenance

of speaker models One or more enrollment sessions are required in which training utterances areobtained from known speakers Features are extracted from the training utterances and compiled

Trang 6

into models In addition, if the system operates in the “open set” or verification decision mode,decision thresholds must also be set Many speaker recognition systems include an updating facility

in which test utterances are used to adapt speaker models and decision thresholds

A list of terms commonly found in the speaker recognition literature can be found at the end

of this chapter In the remaining sections of the chapter, the following subjects are treated: howspeaker characteristic features are extracted from speech signals, how these features are used torepresent speakers, how speaker models are constructed and maintained, how speech utterancesfrom unknown speakers are compared with speaker models and scored to make speaker recognitiondecisions, and how speaker verification performance is measured The chapter concludes with adiscussion of outstanding issues in speaker recognition

48.5 Extracting Speaker Information from the Speech Signal

Explicit measurements of speaker characteristics in the speech signal are often difficult to carry out.Segmenting, labeling, and measuring specific segmental speech events that characterize speakers,such as nasalized speech sounds, is difficult because of variable speech behavior and variable anddistorted recording and transmission conditions Overall qualities, such as breathiness, are difficult

to correlate with specific speech signal measurements and are subject to variability in the same way

as segmental speech events

Even though voice characteristics are difficult to specify and measure explicitly, most characteristicsare captured implicitly in the kinds of speech measurements that can be performed relatively easily.Such measurements as short-time and long-time spectral energy, overall energy, and fundamentalfrequency are relatively easy to obtain They can often resolve differences in speaker characteristicssurpassing human discriminability Although subject to distortion and variability, features based onthese analysis tools form the basis for most automatic speaker recognition systems

The most important analysis tool is short-time spectral analysis It is no coincidence that short-timespectral analysis also forms the basis for most speech recognition systems [42] Short-time spectralanalysis not only resolves the characteristics that differentiate one speech sound from another, but alsomany of the characteristics already mentioned that differentiate one speaker from another There aretwo principal modes of short-time spectral analysis: filter bank analysis and linear predictive coding(LPC) analysis

In filter bank analysis, the speech signal is passed through a bank of bandpass filters covering theavailable range of frequencies associated with the signal Typically, this range is 200 to 3,000 Hzfor telephone band speech and 50 to 8,000 Hz for wide band speech A typical filter bank for wideband speech contains 16 bandpass filters spaced uniformly 500 Hz apart The output of each filter

is usually implemented as a windowed, short-time Fourier transform [using fast Fourier transform(FFT) techniques] at the center frequency of the filter The speech is typically windowed using a

10 to 30 ms Hamming window Instead of uniformly spacing the bandpass filters, a nonuniformspacing is often carried out reflecting perceptual criteria that allot approximately equal perceptualcontributions for each such filter Such mel scale or bark scale filters [42] provide a spacing linear infrequency below 1000 Hz and logarithmic above

LPC-based spectral analysis is widely used for speech and speaker recognition The LPC model ofthe speech signal specifies that a speech sample at timet, s(t), can be represented as a linear sum of

thep previous samples plus an excitation term, as follows:

s(t) = a1s(t − 1) + a2s(t − 2) + · · · + a p s(t − p) + Gu(t) (48.1)The LPC coefficients,a i, are computed by solving a set of linear equations resulting from the mini-mization of the mean-squared error between the signal at timet and the linearly predicted estimate

Trang 7

of the signal Two generally used methods for solving the equations, the autocorrelation method andthe covariance method, are described in Rabiner and Juang [42].

The LPC representation is computationally efficient and easily convertible to other types of spectralrepresentations While the computational advantage is less important today than it was for earlydigital implementations of speech and speaker recognition systems, LPC analysis competes well withother spectral analysis techniques and continues to be widely used

An important spectral representation for speech and speaker recognition is the cepstrum Thecepstrum is the (inverse) Fourier transform of the log of the signal spectrum Thus, the log spectrumcan be represented as a Fourier series expansion in terms of a set of cepstral coefficientsc n

of two effects which are products in the spectral domain are additive in the cepstral domain Also,pitch harmonics, which produce prominent ripples in the spectral envelope, are associated with highorder cepstral coefficients Thus, the set of cepstral coefficients truncated, for example, at order 12

to 24 can be used to reconstruct a relatively smooth version of the speech spectrum The spectralenvelope obtained is associated with vocal tract resonances and does not have the variable, oscillatoryeffects of the pitch excitation It is considered that one of the reasons that cepstral representationhas been found to be more effective than other representations for speech and speaker recognition isthis property of separability of source and tract Since the excitation function is considered to havespeaker dependent characteristics, it may seem contradictory that a representation which largelyremoves these effects works well for speaker recognition However, in short-time spectral analysisthe effects of the source spectrum are highly variable so that they are not especially effective inproviding consistent representations of the source spectrum

Other spectral features such as PARCOR coefficients, log area ratio coefficients, LSP (line spectralpair coefficients), have been used for both speech and speaker recognition [42] Generally speaking,however, the cepstral representation is most widely used and is usually associated with better speakerrecognition performance than other representations

Cruder measures of spectral energy, such as waveform zero-crossing or level-crossing ments have also been used for speech and speaker recognition in the interest of saving computationwith some success

measure-Additional features have been proposed for speaker recognition which are not used often or sidered to be marginally useful for speech recognition For example, pitch and energy features,particularly when measured as a function of time over a sufficiently long utterance, have been shown

con-to be useful for speaker recognition [27] Such time sequences or “contours” are thought to representcharacteristic speaking inflections and rhythms associated with individual speaking behavior Pitchand energy measurements have an advantage over short-time spectral measurements in that theyare more robust to many different kinds of transmission and recording variations and distortionssince they are not sensitive to spectral amplitude variability However, since speaking behavior can

be highly variable due to both voluntary and involuntary activity, pitch and energy can acquire morevariability than short-time spectral features and are more susceptible to imitation

The time course of feature measurements, as represented by so-called feature contours, providesvaluable speaker characterizing information This is because such contours provide overall, supraseg-mental information characterizing speaking behavior and also because they contain information on

a more local, segmental time scale describing transitions from one speech sound to another This

Trang 8

latter kind of information can be obtained explicitly by measuring the local trajectory in time of ameasured feature at each analysis frame Such measurements can be obtained by averaging successivedifferences of the feature in a window around each analysis frame, or by fitting a polynomial in time

to the successive feature measurements in the window The window size is typically 5 to 9 analysisframes The polynomial fit provides a less noisy estimate of the trajectory than averaging successivedifferences The order of the polynomial is typically 1 or 2, and the polynomial coefficients are calleddelta- and delta-delta-feature coefficients It has been shown in experiments that such dynamic fea-ture measurements are fairly uncorrelated with the original static feature measurements and provideimproved speech and speaker recognition performance [9]

48.6 Feature Similarity Measurements

Much of the originality and distinctiveness in the design of a speaker recognition system is found inhow features are combined and compared with reference models Underlying this design is the basicrepresentation of features in some space and the formation of a distance or distortion measurement

to use when one set of features is compared with another The distortion measure can be used

to partition the feature vectors representing a speaker’s utterances into regions representative ofthe most prominent speech sounds for that speaker, as in the vector quantization (VQ) codebookrepresentation (Section48.9.2) It can be used to segment utterances into speech sound units And

it can be used to score an unknown speaker’s utterances against a known speaker’s utterance models

A general approach for calculating a distance between two feature vectors is to make use of adistance metric from the family ofL pnorm distancesd p, such as the absolute value of the differencebetween the feature vectors

i , i = 1, 2, , D are the coefficients of two feature vectors f and f0 The feature

vectors, for example, could comprise filter-bank outputs or cepstral coefficients described in theprevious section (It is not common, however, to use filter bank outputs directly, as previouslymentioned, because of the variability associated with these features due to harmonics from the pitchexcitation.)

For example, a weighted Euclidean distance distortion measure for cepstral features of the form

i is an estimate of the variance of theithcoefficienthasbeenshowntoprovidegoodperformance

for both speech and speaker recognition A still more general formulation is the Mahalanobis distanceformulation which accounts for interactions between coefficients with a full covariance matrix

An alternate approach to comparing vectors in a feature space with a distortion measurement is

to establish a probabilistic formulation of the feature space It is assumed that the feature vectors

in a subspace associated with, for example, a particular speech sound for a particular speaker, can

Trang 9

be specified by some probability distribution A common assumption is that the feature vector is arandom variablex whose probability distribution is Gaussian

Whenx is a feature vector sample, p(x|λ) is referred to as the likelihood of x with respect to

λ Suppose there is a population of n speakers each modeled by a Gaussian distribution of feature

vectors,λ i,i = 1, 2, , n In the maximum likelihood formulation, a sample x is associated with

speakerI if

p (x|λ I ) > p (x|λ i ) , for all i 6= I (48.8)wherep(x|λi) is the likelihood of the test vector x for speaker model λ i It is common to use loglikelihoods to evaluate Gaussian models From Eq (48.7)

mixture The weightsw i are constrained so thatPn

i=1 w i = 1 The model parameters λ are

λ = {µ i , 6 i , w i , i = 1, 2, , M} (48.11)The Gaussian mixture probability function is capable of approximating a wide variety of smooth,continuous, probability functions

48.7 Units of Speech for Representing Speakers

An important consideration in the design of a speaker recognition system is the choice of a speechunit to model a speaker’s utterances The choice of units includes phonetic or linguistic units such

as whole sentences or phrases, words, syllables, and phone-like units It also includes acousticunits such as subword segments, segmented from utterances and labeled on the basis of acousticrather than phonetic criteria Some speaker recognition systems model speakers directly from singlefeature vectors rather than through an intermediate speech unit representation Such systems usuallyoperate in a text independent mode (see Sections48.8and48.9) and seek to obtain a general model

of a speaker’s utterances from a usually large number of training feature vectors Direct modelsmight include long-time averages, VQ codebooks, segment and matrix quantization codebooks, orGaussian mixture models of the feature vectors

Most speech recognizers of moderate to large vocabulary are based on subword units such asphones so that large numbers of utterances transcribed as sequences of phones can be represented

as concatenations of phone models For speaker recognition, there is no absolute need to represent

Trang 10

utterances in terms of phones or other phonetically based units because there is no absolute need

to account for the linguistic or phonetic content of utterances in order to build speaker recognitionmodels Generally speaking, systems in which phonetic representations are used are more complexthan other representations because they require phonetic transcriptions for both training and testingutterances and because they require accurate and reliable segmentations of utterances in terms ofthese units The case in which phonetic representations are required for speaker recognition is thesame as for speech recognition: where there is a need to represent utterances as concatenations ofsmaller units Speaker recognition systems based on subword units have been described by Rosenberg

et al [46] and Matsui and Furui [31]

48.8 Input Modes

Speaker recognition systems typically operate in one of two input modes: text dependent or textindependent In the text-dependent mode, speakers must provide utterances of the same text forboth training and recognition trials In the text-independent mode, speakers are not constrained toprovide specific texts in recognition trials Since the text-dependent mode can directly exploit thevoice individuality associated with each phoneme or syllable, it generally achieves higher recognitionperformance than the text-independent mode

48.8.1 Text-Dependent (Fixed Passwords)

The structure of a system using fixed passwords is rather simple; input speech is time aligned withreference templates or models created by using training utterances for the passwords If the fixedpasswords are different from speaker to speaker, the difference can also be used as additional individualinformation This helps to increase performance

48.8.2 Text Independent (No Specified Passwords)

There are several applications in which predetermined passwords cannot be used In addition, humanbeings can recognize speakers irrespective of the content of the utterance Therefore, text-independentmethods have recently been actively investigated Another advantage of text-independent recognition

is that it can be done sequentially, until a desired significance level is reached, without the annoyance

of having to repeat passwords again and again

48.8.3 Text Dependent (Randomly Prompted Passwords)

Both text-dependent and independent methods have a potentially serious problem Namely, thesesystems can be defeated because someone who plays back the recorded voice of a registered speakeruttering key words or sentences into the microphone could be accepted as the registered speaker Tocope with this problem, there are methods in which a small set of words, such as digits, are used askey words and each user is prompted to utter a given sequence of key words that is randomly chosenevery time the system is used [20,47]

Recently, a text-prompted speaker recognition method was proposed in which password sentencesare completely changed every time [31,33] The system accepts the input utterance only when itjudges that the registered speaker uttered the prompted sentence Because the vocabulary is unlimited,prospective impostors cannot know in advance the sentence they will be prompted to say This methodcannot only accurately recognize speakers, but can also reject utterances whose text differs from theprompted text, even if it is uttered by a registered speaker Thus, a recorded and played-back voicecan be correctly rejected

Trang 11

48.9 Representations

48.9.1 Representations That Preserve Temporal Characteristics

The most common approach to automatic speaker recognition in the text-dependent mode usesrepresentations that preserve temporal characteristics Each speaker is represented by a sequence

of feature vectors (generally, short-term spectral feature vectors), analyzed for each test word orphrase This approach is usually based on template matching techniques in which the time axes of aninput speech sample and each reference template of registered speakers are aligned, and the similaritybetween them accumulated from the beginning to the end of the utterance is calculated

Trial-to-trial timing variations of utterances of the same talker, both local and overall, can benormalized by aligning the analyzed feature vector sequence of a test utterance to the template featurevector sequence using a dynamic programming (DP) time warping algorithm or DTW [11,42] Sincethe sequence of phonetic events is the same for training and testing, there is an overall similarity amongthese sequences of feature vectors Ideally the intra-speaker differences are significantly smaller thanthe inter-speaker differences

Figure48.3shows an example of a typical structure of the DTW-based system [9] Initially, 10LPC cepstral coefficients are extracted every 10 ms from a short sentence of speech The spectralequalization technique, which is described in Section48.12.1, is applied to each cepstral coefficient

to compensate for transmission distortion and intraspeaker variability In addition to the ized cepstral coefficients, delta-cepstral and delta-delta-cepstral coefficients (polynomial expansioncoefficients) are extracted every 10 ms The time function of the set of parameters is brought intotime registration with the reference template in order to calculate the distance between them Theoverall distance is then compared with a threshold for the verification decision

normal-Another approach using representations that preserve temporal characteristics is based on theHMM (hidden Markov model) technique [42] In this approach, a reference model for each speaker

is represented by an HMM instead of directly using a time series of feature vectors An HMM canefficiently model statistical variation in spectral features Therefore, HMM-based methods haveachieved significantly better recognition accuracies than the DTW-based methods [36,47,53]

48.9.2 Representations That do not Preserve Temporal Characteristics

In a text-independent system, the words or phrases used in recognition trials generally cannot bepredicted Therefore, it is impossible to model or match speech events at the level of words or phrases.Classical text-independent speaker recognition techniques are based on measurements for which thetime dimension is collapsed Recently text-independent speaker verification techniques based onshort duration speech events have been studied The new approaches extract and measure salientacoustic and phonetic events The bases for these approaches lie in statistical techniques for extractingand modeling reduced sets of optimally representative feature vectors or feature vector sequences orsegments These techniques fall under the related categories of vector quantization (VQ), matrix andsegment quantization, probabilistic mixture models, and HMM

A set of short-term training feature vectors of a speaker can be used directly to represent theessential characteristics of that speaker However, such a direct representation is impractical whenthe number of training vectors is large, since the memory and amount of computation requiredbecome prohibitively large Therefore, efficient ways of compressing the training data have beentried using VQ techniques

In this method, VQ codebooks consisting of a small number of representative feature vectors areused as an efficient means of characterizing speaker-specific features [25,29,45,52] A speaker-specific codebook is generated by clustering the training feature vectors of each speaker In therecognition stage, an input utterance is vector-quantized using the codebook of each reference speaker,

Tiêu đề	Speaker Verification
Tác giả	Sadaoki Furui, Aaron E. Rosenberg
Người hướng dẫn	Vijay K. Madisetti, Editor, Douglas B. Williams, Editor
Trường học	Tokyo Institute of Technology
Chuyên ngành	Digital Signal Processing
Thể loại	Book chapter
Năm xuất bản	1999
Thành phố	Boca Raton

Định dạng
Số trang	22
Dung lượng	335,46 KB