1. Trang chủ
  2. » Luận Văn - Báo Cáo

nguồn âm thanh sử dụng mô hình phổ nguồn tổng quát trên cơ sở thừa số hoá ma trận không âm

131 91 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 131
Dung lượng 3,3 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

MINISTRY OF EDUCATION AND TRAININGHANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING BASED GENERIC SOURCE SPECTRAL MODEL NMF-DOCTORAL DISS

Trang 1

MINISTRY OF EDUCATION AND TRAINING

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

DUONG THI HIEN THANH

AUDIO SOURCE SEPARATION EXPLOITING BASED GENERIC SOURCE SPECTRAL MODEL

NMF-DOCTORAL DISSERTATION OF COMPUTER SCIENCE

Hanoi - 2019

Trang 2

MINISTRY OF EDUCATION AND TRAINING

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

DUONG THI HIEN THANH

AUDIO SOURCE SEPARATION EXPLOITING BASED GENERIC SOURCE SPECTRAL MODEL

NMF-Major: Computer Science

Code: 9480101

DOCTORAL DISSERTATION OF COMPUTER SCIENCE

SUPERVISORS:

1 ASSOC PROF DR NGUYEN QUOC CUONG

2 DR NGUYEN CONG PHUONG

Hanoi - 2019

Trang 3

• Where I have consulted the published work of others, this is always clearly at-tributed.

• Where I have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely my own work

• I have acknowledged all main sources of help

• Where the thesis is based on work done by myself jointly with others, I have made exactly what was done by others and what I have contributed myself.

Hanoi, February 2019Ph.D Student

Duong Thi Hien Thanh

SUPERVISORS

Assoc.Prof Dr Nguyen Quoc Cuong Dr Nguyen Cong Phuong

Trang 4

This thesis has been written during my doctoral study at International Research Institute Multimedia, Information, Communication, and Applications (MICA), Hanoi University of Science and Technology (HUST) It is my great pleasure to thank numer-ous people who have contributed towards shaping this thesis.

First and foremost I would like to express my most sincere gratitude to mysupervi-sors, Assoc Prof Nguyen Quoc Cuong and Dr Nguyen Cong Phuong,for their great guidance and support throughout my Ph.D study I am grateful tothem for devoting their precious time to discussing research ideas, proofreading,and explaining how to write good research papers I would like to thank them forencouraging my research and empowering me to grow as a research scientist Icould not have imagined having a better advisor and mentor for my Ph.D study

I would like to express my appreciation to my supervisor in Master cource, Prof Nguyen Thanh Thuy, School of Information and Communication Technology - HUST, and Dr Nguyen Vu Quoc Hung, my supervisor in Bachelors course at Hanoi National University of Education They had shaped my knowledge for excelling in studies.

In the process of implementation and completion of my research, I havereceived many supports from the board of MICA directors and my colleagues atSpeech Com-munication department Particularly, I am very much thankful toProf Pham Thi Ngoc Yen, Prof Eric Castelli, Dr Nguyen Viet Son and Dr DaoTrung Kien, who pro-vided me with an opportunity to join researching works inMICA institute and have access to the laboratory and research facilities Withouttheir precious support would it have been being impossible to conduct thisresearch My warmly thanks go to my colleagues at Speech Communicationdepartment of MICA institute for their useful comments on my study andunconditional support over four years both at work and outside of work

I am very grateful to my internship supervisor Prof Nobutaka Ono and the mem-bers

of Ono’s Lab at the National Institute of Informatics, Japan for warmly welcoming me into their lab and the helpful research collaboration they offered I much appreciate his help in funding my conference trip and introducing me to the signal processing research communities I would also like to thank Dr Toshiya Ohshima, MSc Yasu-taka Nakajima, MSc Chiho Haruta and other researchers at Rion Co., Ltd., Japan for

Trang 5

welcoming me to their company and providing me data for experimental.

I would also like to sincerely thank Dr Nguyen Quang Khanh, dean ofInformation Technology Faculty, and Assoc Prof Le Thanh Hue, dean ofEconomic Informatics Department, at Hanoi University of Mining and Geology(HUMG) where I am work-ing I have received the financial and time supportfrom my office and leaders for completing my doctoral thesis Grateful thanksalso go to my wonderful colleagues and friends Nguyen Thu Hang, Pham ThiNguyet, Vu Thi Kim Lien, Vo Thi Thu Trang, Pham Quang Hien, Nguyen TheBinh, Nguyen Thuy Duong, Nong Thi Oanh and Nguyen Thi Hai Yen, who havethe unconditional support and help during a long time A special thank goes to

Dr Le Hong Anh for the encouragement and his precious advice

Last but not the least, I would like to express my deepest gratitude to myfamily I am very grateful to my mother-in-law and father-in-law for their support

in the time of need, and always allow me to focus on my work I dedicate thisthesis to my mother and father with special love, they have been being a greatmentor in my life and had constantly encouraged me to be a better person Thestruggle and sacrifice of my parents always motivate me to work hard in mystudies I would also like to express my love to my younger sisters and youngerbrother for their encouraging and helping This work has become morewonderful because of the love and affection that they have provided

A special love goes to my beloved husband Tran Thanh Huan for his patience and understanding, for always being there for me to share the good and bad times I also appreciate my sons Tran Tuan Quang and Tran Tuan Linh for always cheering me up with their smiles Without love from them, this thesis would not have been completed.

Thank you all!

Hanoi, February 2019

Ph.D StudentDuong Thi Hien Thanh

Trang 6

DECLARATION OF AUTHORSHIP i

DECLARATION OF AUTHORSHIP i ACKNOWLEDGEMENT ii

CONTENTS iv

NOTATIONS AND GLOSSARY viii

LIST OF TABLES xi

LIST OF FIGURES xii

INTRODUCTION 1

Chapter 1 AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF THE ART 10 1.1 Audio source separation: a solution for cock-tail party problem 10

1.1.1 General framework for source separation 10

1.1.2 Problem formulation 11

1.2 State of the art 13

1.2.1 Spectral models 13

1.2.1.1 Gaussian Mixture Model 14

1.2.1.2 Nonnegative Matrix Factorization 15

1.2.1.3 Deep Neural Networks 16

1.2.2 Spatial models 18

1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD) 18 1.2.2.2 Rank-1 covariance matrix 19

1.2.2.3 Full-rank spatial covariance model 20

1.3 Source separation performance evaluation 21

1.3.1 Energy-based criteria 22

1.3.2 Perceptually-based criteria 23

1.4 Summary 23

Chapter 2 NONNEGATIVE MATRIX FACTORIZATION 24 2.1 NMF introduction 24

Trang 7

2.1.1 NMF in a nutshell 24

2.1.2 Cost function for parameter estimation 26

2.1.3 Multiplicative update rules 27

2.2 Application of NMF to audio source separation 29

2.2.1 Audio spectra decomposition 29

2.2.2 NMF-based audio source separation 30

2.3 Proposed application of NMF to unusual sound detection 32

2.3.1 Problem formulation 33

2.3.2 Proposed methods for non-stationary frame detection 34

2.3.2.1 Signal energy based method 34

2.3.2.2 Global NMF-based method 35

2.3.2.3 Local NMF-based method 35

2.3.3 Experiment 37

2.3.3.1 Dataset 37

2.3.3.2 Algorithm settings and evaluation metrics 37

2.3.3.3 Results and discussion 38

2.4 Summary 43

Chapter 3 SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP SPARSITY CONSTRAINT 44 3.1 General workflow of the proposed approach 44

3.2 GSSM formulation 46

3.3 Model fitting with sparsity-inducing penalties 46

3.3.1 Block sparsity-inducing penalty 47

3.3.2 Component sparsity-inducing penalty 48

3.3.3 Proposed mixed sparsity-inducing penalty 49

3.4 Derived algorithm in unsupervised case 49

3.5 Derived algorithm in semi-supervised case 52

3.5.1 Semi-GSSM formulation 52

3.5.2 Model fitting with mixed sparsity and algorithm 54

3.6 Experiment 54

3.6.1 Experiment data 54

3.6.1.1 Synthetic dataset 55

Trang 8

3.6.1.2 SiSEC-MUS dataset 55

3.6.1.3 SiSEC-BNG dataset 56

3.6.2 Single-channel source separation performance with unsuper-vised setting 57

3.6.2.1 Experiment settings 57

3.6.2.2 Evaluation method 57

3.6.2.3 Results and discussion 61

3.6.3 Single-channel source separation performance with semi-supervised setting 65

3.6.3.1 Experiment settings 65

3.6.3.2 Evaluation method 65

3.6.3.3 Results and discussion 65

3.7 Summary 66

Chapter 4 MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 68 4.1 Formulation and modeling 68

4.1.1 Local Gaussian model 68

4.1.2 NMF-based source variance model 70

4.1.3 Estimation of the model parameters 71

4.2 Proposed GSSM-based multichannel approach 72

4.2.1 GSSM construction 72

4.2.2 Proposed source variance fitting criteria 73

4.2.2.1 Source variance denoising 73

4.2.2.2 Source variance separation 74

4.2.3 Derivation of MU rule for updating the activation matrix 75

4.2.4 Derived algorithm 77

4.3 Experiment 79

4.3.1 Dataset and parameter settings 79

4.3.2 Algorithm analysis 80

4.3.2.1 Algorithm convergence: separation results as func-tions of EM and MU iterafunc-tions 80

4.3.2.2 Separation results with different choices of and 81 4.3.3 Comparison with the state of the art 82

Trang 9

4.4 Summary 91

CONCLUSIONS AND PERSPECTIVES 93

BIBLIOGRAPHY 96

LIST OF PUBLICATIONS 113

Trang 10

NOTATIONS AND GLOSSARY Standard mathematical symbols

C Set of complex numbers

R Set of real numbers

Z Set of integers

E Expectation of a random variable

Nc Complex Gaussian distribution

Vectors and matrices

AT Matrix transpose

AH Matrix conjugate transposition (Hermitian conjugation)

diag(a) Diagonal matrix with a as its diagonal

det(A) Determinant of matrix A

tr(A) Matrix trace

A B The element-wise Hadamard product of two matrices (of the same dimension)

n Time frame index

t Time sample index

Trang 11

I Number of channels

J Number of sources

L STFT filter length

F Number of frequency bin

N Number of time frames

K Number of spectral basis

Mixing filters

Matrix of filters

aj( ) 2 RI Mixing filter of jth source to all microphones, is the time delay

aij(t) 2 R Filter coefficient at tth time index

aij 2 RL Time domain filter vector

aij 2 CL Frequency domain filter vector

Time-independent covariance matrix of the jth source

Covariance matrix of the jth source imageEmpirical mixture covariance

Empirical mixture covariancePower spectrogram matrixSpectral basis matrixTime activation matrixGeneric source spectral model

Trang 12

APS Artifacts-related Perceptual Score

BSS Blind Source Separation

DoA Direction of Arrival

DNN Deep Neural Network

EM Expectation Maximization

ICA Independent Component Analysis

IPS Interference-related Perceptual Score

ISR source Image to Spatial distortion RatioISTFT Inverse Short-Time Fourier TransformIID (i.i.d) Interchannel Intensity Difference

ITD (i.t.d) Interchannel Time Difference GCC-PHAT Generalized Cross Correlation Phase Transform

GMM Gaussian Mixture Model

GSSM Generic Source Spectral Model

KL Kullback-Leibler

LGM Local Gaussian Model

MAP Maximum A Posteriori

MU Multiplicative Update

NMF Non-negative Matrix Factorization

OPS Overall Perceptual Score

PLCA Probabilistic Latent Component AnalysisSAR Signal to Artifacts Ratio

SDR Signal to Distortion Ratio

SIR Signal to Interference Ratio

SiSEC Signal Separation Evaluation CampaignSNMF Spectral Non-negative Matrix FactorizationSNR Signal to Noise Ratio

STFT Short-Time Fourier Transform

TDOA Time Difference of Arrival

T-F Time-Frequency

TPS Target-related Perceptual Score

Trang 13

LIST OF TABLES

2.1 Total number of different events detected from three recordings in spring 402.2 Total number of different events detected from three recordings in sum-mer 412.3 Total number of different events detected from three recordings in winter 423.1 List of snip songs in the SiSEC-MUS dataset 563.2 Source separation performance obtained on the Synthetic and SiSEC-MUS dataset with unsupervised setting 593.3 Speech separation performance obtained on the SiSEC-BGN indi-cates submissions by the authors and “-” indicates missing information[81, 98, 100] 603.4 Speech separation performance obtained on the Synthetic dataset withsemi-supervised setting 664.1 Speech separation performance obtained on the SiSEC-BGN-devset -Comparison with closed baseline methods 854.2 Speech separation performance obtained on the SiSEC-BGN-devset -Comparison with s-o-t-a methods in SiSEC indicates submissions

by the authors and “-” indicates missing information 864.3 Speech separation performance obtained on the test set of the SiSEC-BGN indicates submissions by the authors [81] 91

Trang 14

LIST OF FIGURES

1 A cocktail party effect 2

2 Audio source separation 3

3 Live recording environments 4

1.1 Source separation general framework 11

1.2 Audio source separation: a solution for cock-tail party problem 13

1.3 IID coresponding to two sources in an anechoic environment 19

2.1 Decomposition model of NMF [36] 25

2.2 Spectral decomposition model based on NMF (K = 2) [66] 29

2.3 General workflow of supervised NMF-based audio source separation 30 2.4 Image of overlapping blocks 34

2.5 General workflow of the NMF-based nonstationary segment extraction 35 2.6 Number of different events were detected by the methods from (a) the recordings in Spring, (b) the recordings in Summer, and (c) the record-ings in Winter 39

3.1 Proposed weakly-informed single-channel source separation approach 45 3.2 Generic source spectral model (GSSM) construction 47

3.3 Estimated activation matrix H: (a) without a sparsity constraint, (b) with a block sparsity-inducing penalty (3.5), (c) with a component inducing penalty (3.6), and (d) with the proposed mixed sparsity-inducing penalty (3.7) 48

3.4 Average separation performance obtained by the proposed method with unsupervised setting over the Synthetic dataset as a function of MU it-erations 61

3.5 Average separation performance obtained by the proposed method with unsupervised setting over the Synthetic dataset as a function of and 62 3.6 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methmeth-ods over the dev set in SiSEC-BGN 63

3.7 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methmeth-ods over the test set in SiSEC-BGN 63

Trang 15

4.1 General workflow of the proposed source separation approach The topgreen dashed box describes the training phase for the GSSM construc-tion Bottom blue boxes indicate processing steps for source separa-tion Green dashed boxes indicate the novelty compared to the existingworks [6, 38, 107] 734.2 Average separation performance obtained by the proposed method overstereo mixtures of speech and noise as functions of EM and MU itera-tions (a): speech SDR, (b): speech SIR, (c): speech SAR, (d): speechISR, (e): noise SDR, (f): noise SIR, (g): noise SAR, (h): noise ISR 814.3 Average separation performance obtained by the proposed method overstereo mixtures of speech and noise as functions of and (a): speechSDR, (b): speech SIR, (c): speech SAR, (d): speech ISR, (e): noiseSDR, (f): noise SIR, (g): noise SAR, (h): noise ISR 824.4 Average speech separation performance obtained by the proposed meth-ods and the closest existing algorithms in terms of the energy-basedcriteria 884.5 Average speech separation performance obtained by the proposed meth-ods and the closest existing algorithms in terms of the perceptually-based criteria 884.6 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methods in terms of the energy-based criteria 894.7 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methods in terms of the perceptually-basedcriteria 894.8 Boxplot for the speech separation performance obtained by the pro-posed “GSSM + SV denoising” (P1) and “GSSM + SV separation”(P2) methods 90

Trang 16

In this part, we will introduce the motivation and the problem that we focus

on throughout this thesis Then, we emphasize on the objectives as well asscopes of our work In addition, our contributions in this thesis will besummarized in order to give a clear view of the achievement Finally, thestructure of the thesis is presented chapter by chapter

1 Background and Motivation

1.1 Cocktail party problem

Real-world sound scenarios are usually very complicated as they are mixtures of many different sound sources Fig 1 depicts the scenario of a typical cocktail party, where there are many people attending, many conversations going on simultaneously and various disturbances like loud music, people screaming sounds, and a lot of hustle- bustle Some other similar situations also happen in daily life, for example, in outdoor recordings, where there is interference from a variety of environmental sounds, or in a music concert scenario, where a number of musical instruments are played and the au- dience gets to listen to the collective sound, etc In such settings, what is actually heard

by the ears is a mixture of various sounds that are generated by various audio sources The mixing process can contain many sound reflections from walls and ceiling, which is known as the reverberation Humans with normal hearing ability are generally able to locate, identify, and differentiate sound sources which are heard simultaneously so as to understand the conveyed information However, this task has remained extremely challenging for machines, especially in highly noisy and reverberated environments The cocktail party effect described above prevents both human and machine perceiv-ing the target sound sources [2, 12, 145], the creation of machine listening algorithms that can automatically separate sound sources in difficult mixing conditions remains an open problem.

Audio source separation aims at providing machine listeners with a similar tion to the human ears by separating and extracting the signals of individual sources from a given mixture This technique is formally termed as blind source separation

Trang 17

func-(BSS) when no prior information about either the sources or the mixing condition is available, and is described in Fig 2 Audio source separation is also known as an effective solution for cocktail party problem in audio signal processing community [85,

90, 138, 143, 152] Depending on specific application, some source separation approaches focus on speech separation, in which the speech signal is extracted from the mixture containing multiple background noise and other unwanted sounds Other methods deal with music separation, in which the singing voice and certain instruments are recovered from the mixture or song containing multiple musical instruments The separated source signals may be either listened to or further processed, giving rise to many potential applications Speech separation is mainly used for speech enhance- ment in hearing aids, hands-free phones, or automatic speech recognition (ASR) in adverse conditions [11, 47, 64, 116, 129] While music separation has many interest-ing applications, including editing/remixing music post-production, up-mixing, music information retrieval, rendering of stereo recordings, and karaoke [37, 51, 106, 110].

Figure 1: A cocktail party effect1

Over the last couple of decades, efforts have been undertaken by the scientific munity, from various backgrounds such as Signal Processing, Mathematics, Statistics, Neural Networks, Machine Learning, etc., to build audio source separation systems as described in [14, 15, 22, 43, 85, 105, 125] The audio source separation problem

com-1 Some icons of Fig 1 are from: http://clipartix.com/.

Trang 18

Figure 2: Audio source separation.

has been studied at various levels of complexity, and different approaches andsystems have come up Despite numerous effort, the problem is not completelysolved yet as the obtained separation results are still far from perfect, especially

in challenging conditions such as moving sound sources and high reverberation

1.2 Basic notations and target challenges

• Overdetermined, determined, and underdetermined mixture

There are three different settings in audio source separation under the ship between the number of sources J and the number of microphones I: In case the number of the microphones is larger than that of the sources, J < I, the number of observable variables are more than the unknown variables and hence

relation-it is referred to as overdetermined case If J = I, we have as many observable variables as unknowns, and this is a determined case The more dificult soure separation case is that the number of unknowns are more than the number of observable variables, J > I, which is called the underdetermined case.

Furthermore, if I = 1 then it is a single-channel case If I > 1 then it is a multi-channel case

• Instantaneous, anechoic, and reverberant mixing environment

Apart from the mixture settings based on the relationship between the number of sources and the number of microphones, audio source separation algorithms can also be distinguished based on the target mixing condition they deal with.

Trang 19

The simplest case deals with instantaneous mixtures, such as certain music mix-tures generated by amplitude panning In this case, there is no time delay,

a mixture at a given time is essentially a weighted sum of the source signals at the same time instant There are two other typical types of the live recording environments, anechoic and reverberant, as shown in Fig 3 In the anechoic environments such as studio or outdoor, the microphones capture only the direct sound propagation from a source With reverberant environments such

as real meeting rooms or chambers, the microphones capture not only the direct sound but also many sound reflections from walls, ceilings, and floors The modeling of the reverberant environment is much more difficult than the instantaneous and anechoic cases.

Figure 3: Live recording environments2

State-of-the-art audio source separation algorithms perform quite well in taneous or noiseless anechoic conditions, but still far from perfect by the amount of reverberation These numerical performance results are clearly shown in the recent community-based Signal Separation Evaluation Campaigns (SiSEC) [5, 99, 101, 133, 134] and others [65, 135] That shows that addressing the separation of reverberant mixtures, a common case in the real-world recording applications, remains one of the key scientific challenges in the source separation community Moreover, when the de- sired sound is corrupted by high-level background noise, i.e., the Signal-to-Noise Ratio (SNR) is up to 0 dB or lesser, the separation performance is even lower.

instan-2 Some icons of Fig 3 are from: http://clipartix.com/.

Trang 20

To improve the separation performance, informed approaches have been proposed and emerged over the last decade in the literature [78, 136] Such approaches exploit side information about one or all of the sources themselves, or the mixing condition in order to guide the separation process Examples of the investigated side information include deformed or hummed references of one (or more) source(s) in a given mixture [123, 126], text associated with spoken speeches [83], score associated with musical sources [37, 51], and motion associated with audio-visual objects in a video [110].

Following this trend, our research focuses on using weakly-informed strategy to target the determined/underdetermined and high reverberation audio source separation challenge We use a very abstract semantic information just about the types of audio sources existing in the mixture to guide the separation process.

2 Objective and scope

2.1 Objective

The main objective of the thesis is to investigate and develop efficient audio source separation algorithm, which can deal with the determined/underdetermined and high reverberation in the real-world recording conditions.

In order to do that, we start by studying state-of-the-art approaches forselecting one of the most well-known frameworks that can deal with the targetedchallenges We then develop novel algorithms grounded on such consideredmodeling framework, i.e., the Local Gaussian Model (LGM), with NonnegativeMatrix Factorization (NMF) as the spectral model, for both single-channel andmulti-channel cases In our proposed approach, we exploit information just aboutthe types of audio sources in the mixture to guide the separation process Forinstance, in speech enhancement application, we know that one source in anoisy recording should be speech, and another is background noise We furtherwant to investigate the algorithms’ convergence as well as their sensitivity to theparameter settings in order to guide for parameter settings when it is applicable

For evaluation, both speech and music separations are considered We consider a speech separation for speech enhancement task, and consider both singing voice and musical instrument separation for music task In order to compare fairly the obtained separation results with other existing methods, we use the benchmark dataset in addi-

Trang 21

tion to our own synthetic dataset This well-designed benchmark dataset isfrom the Signal Separation Evaluation Campaign (SiSEC3) for the speechand real-world back-ground noise separation task and music separation task.Using these datasets allows us to join in our research community activities.Especially, we target to participate the SiSEC challenge so as to bring ourdeveloped algorithm to the international research community.

2.2 Scope

In our study, we order to recover the original sources (in single-channelsetting) or the spatial images of each source (in multi-channel setting) fromthe observed audio mixture The source spatial images are the contribution ofthose sources to the mixture signal For example, for speech recordings inreal-world environments, the spatial images are the speech signals recorded

at the microphones after propagating from the speaker to the microphones.Furthermore, as focusing on the weakly-informed source separation, weassume the number of sources and the types of sources are known prior Forinstance, the mixture is composed of speech and noise in speech separationcontext, or vocals and musical instruments in music separation context

3 Contributions

Aims to tackle the real-world recordings with challenging settings as mentioned earlier, we have proposed novel separation algorithms for both single-channel and multi- channel cases The achieved results have been described in seven publications The results of our algorithms were also submitted to the international source separation campaign SiSEC 20164 [81] and obtained the best performance in terms of energy- based criteria More specifically, the main contributions are described as follows:

• We have proposed a novel single-channel audio source separation algorithm weakly guided by some source examples This algorithm exploits the generic source spectral model (GSSM), which represents the spectral characteristics of audio sources, to guide the separation process With that, a new sparsity-inducing penalty for the cost function has also been proposed We have validated the

4http://sisec.inria.fr/sisec-2016/

Trang 22

speech performance of the proposed algorithm in both supervised andsemi-supervised setting We have also analyzed algorithm’s convergence

as well as its stability with respect to the parameter settings

These contributions were published in four scientific papers (papers 1,

2, 4, 5 in “List of publications”)

• A novel multi-channel audio source separation algorithm weakly guided by some source examples has been proposed This algorithm exploits the use of generic source spectral model learned by NMF within the well-established local Gaus-sian model We have proposed two new optimization criteria, the first one con-strains the variances of each source by NMF, the second criterion constrains the total variances of all sources altogether The corresponding EM algorithms for parameter estimation have also been derived We have investigated the sensitiv- ity of the proposed algorithm to parameters as well as its convergence in order to guide for parameter settings in the practical implementation.

As another important contribution, we participated in the SiSECchallenges so as our proposed approach is visible to the internationalresearch community Evaluated fairly by the SiSEC organizes, ourproposed algorithm obtained the best source separation results in terms

of the energy-based criteria in the SiSEC 2016

These achievements were described in two papers (papers 6 and 7 in

“List of publications”)

• In addition to two main contributions mentioned above, by studying NMF model and it’s application in acoustic processing field, we have proposed novel un-supervised detection methods for detecting automatically non- stationary seg-ments from single-channel real-world recordings Those methods aim to ef-fective acoustic-event annotation They were proposed during my research in-ternship at Ono’s Lab, Japan National Institute of Informatics, and transferred to RION company in Japan for the potential use.This work has published in paper 3 in “List of publications”

4 Structure of thesis

The work presented in this thesis is structured in four chapters as follows:

Trang 23

• Chapter 1: Audio source separation: Formulation and State of the art

We introduce the general framework and the mathematical formulation ofthe considered audio source separation problem as well as the notationsused in this thesis It is followed by an overview of the state-of-the-artaudio source sep-aration methods, which exploits different spectral modelsand spatial models Also, two families of criteria, that are used for sourceseparation performance evaluation, are presented in this chapter

• Chapter 2: Nonnegative matrix factorization

This chapter firstly introduces NMF, which has received a lot of attention in the audio processing community It is followed by a baseline supervised algorithm based on NMF model aiming to separate audio sources from the observed mix- ture By the end of this chapter, we propose novel methods for automatically detecting non-stationary segments using NMF for effective sound annotation.

• Chapter 3: Proposed single-channel audio source separation approach

We present the proposed weakly-informed audio source separation method for single-channel audio source separation targeting both unsupervised and semi- supervised setting The algorithm is based on NMF with mixed sparsity con- straints In this method, the generic spectral characteristics of sources are firstly learned from several training signals by NMF They are then used to guide the similar factorization of the observed power spectrogram into each source We also propose to combine two existing group sparsity-inducing penalties in the optimization process and adapt the corresponding algorithm for parameter esti- mation based on multiplicative update (MU) rule The last section of this chapter is devoted to the experimental evaluation We show the effectiveness of the pro- posed approach in both unsupervised and semi-supervised settings.

• Chapter 4: Proposed multichannel audio source separation approach

This chapter is a significant extension of the work mentioned in chapter 3 to the multi-channel case We describe a novel multichannel audio source separa-tion algorithm weakly guided by some source examples, where the NMF-based GSSM

is combined with the full-rank spatial covariance model in a Gaussian modeling paradigm We then present the generalized expectation-maximization (EM) algorithm for the parameter estimation Especially, for guiding the esti-mation of the intermediate source variances in each EM iteration, we investigate the use of two criteria: (1) the estimated variances of each source are constrained

Trang 24

by NMF, and (2) the total variances of all sources are constrained byNMF al-together By the experiment, the separation performancesobtained by proposed algorithms are analyzed and compared withstate-of-the-art and baseline algo-rithms Moreover, the analysis resultsabout the sensitivity of the proposed al-gorithms to parameter settings

as well as their convergence are also addressed in this chapter

In the last part of the thesis, we present the conclusion and perspectives for the future research directions

Trang 25

CHAPTER 1 AUDIO SOURCE SEPARATION: FORMULATION AND STATE

OF THE ART

In this chapter, we introduce audio source separation technique as a solution for the cocktail party problem After briefly describing the general audio source separation framework, we present some basic setting for convolution conditional and recording environment Then the state-of-the-art models exploiting spectral cues as well as spa- tial cues for source separating process will be summarized Finally, we introduce two families of criterias that are used for source separation performance evaluation.

1.1 Audio source separation: a solution for cock-tail

party problem

1.1.1 General framework for source separation

Audio source separation is the signal processing task which consists in recovering the constitutive sounds, called sources, of an observed mixture, which can be single- channel or multichannel [43, 78, 85, 90, 105] This separation needs a system that is able to perform many processes, such as estimating the number of sources, esti-mating the required number of frequency basis and convolutive parameters to be as-signed to each source, applying separation algorithms, and reconstructing the sources [6, 25, 28,

102, 111, 121, 158, 159] There are two types of cues can be exploited for the separation process, called spectral cues and spatial cues Spectral cues describe the spectral structures of sources, while spatial cues are information about the source spatial positions [22, 85, 97] They will be discussed more detail in Section 1.2.1 and 1.2.2, respectively It can be seen that spectral signals alone are not able to distin-guish sources with similar pitch range and timbre, while the individual spatial signals may not

be sufficient to distinguish sources from near directions So most of existing systems require the exploitation of both types of cues.

In general, the source separation algorithm is processed in the time-frequency

Trang 26

do-main after the short-time Fourier transform (STFT) and consists of twomodeling cues as in Fig 1.1: (1) spectral model exploits spectralcharacteristics of sources, (2) spatial model performs modeling and exploitingspatial information Finally, the estimated time domain source signals areobtained via the inverse short-time Fourier transform (ISTFT).

Figure 1.1: Source separation general framework

1.1.2 Problem formulation

Multichannel audio mixtures are the types of recordings that we obtain when we employ microphone arrays [14, 22, 85, 90, 92] Let us formulate the multichannel mix- ture signal, where J sources are observed by an array of I microphones, with indexes

j 2 f1; 2; : : : ; Jg and i 2 f1; 2; : : : ; Ig to indicate specific source j andchannel i This mixture signal is denoted by x(t) = [x1(t); : : : ; xI (t)]T 2 RI1 and

is sum of contributions from all sources as [85]:

J

X

j

(1.1)x(t) = cj(t)

=1

where cj(t) = [c1j(t); : : : ; cIj(t)]T 2 RI 1 is the contribution of j-th source to themicrophone array and called spatial image of this source, [:]T denotes matrix orvector transposition The mixture and source spatial images are time-domaindigital signals indexed by t 2 f0; 1; : : : ; T 1g, where T is the length of the signal

Under physical view, sound sources are typically divided into two types: point sources and diffuse sources The point source is the case in which sound emits from a single point in a space, e.g., unmoving human speaker, a water drop, a singer is singing

Trang 27

alone, etc The diffuse source is the case in which sound comes from a region of space, e.g., water drops in the rain, singers are singing in a choir, etc Diffuse sources can be considered as a collection of point sources [85, 141] In the case where the j-th source is a point source, source spatial image c j (t) is written as [85]

sj(t) is the single-channel source signal

Audio source separation systems often operate in the time-frequency F) do-main, in which the temporal characteristics and the spectralcharacteristics of audio can be jointly represented A most commonly usedtime-frequency representation is the short-time Fourier transform (STFT) [3,125] STFT analysis refers to computing the time-frequency representationfrom the time-domain waveform by creating over-lapping frames along thewaveform and applying the disjointed Fourier transform on each frame

(T-Switched to the T-F domain, equation (1.1) can be written as

J

X j

(1.3)x(n; f) = cj(n; f)

=1

where cj(n; f) 2 CI 1 and x(n; f) 2 CI 1 denote the T-F representationscomputed from cj(t) and x(t), respectively n = 1; 2; ::; N is the time frameindex and f = 1; 2; :::; F presents the frequency bin index

A common assumption in array signal processing is the narrowbandassumption on the source signal [118] Under the narrowband assumption,the convolutive mixing model (1.2) may be approximated by complex-valuedmultiplication in each frequency bin (n; f) given by

cj(n; f) aj(f)sj(n; f) (1.4)where cj(n; f) and sj(n; f) are the STFT coefficients of cj(t) and sj(t),respectively, aj(f) is the Fourier transform of aj( )

Source separation consists in recovering either the J original source signals s j (t) or their spatial images c j (t) given the I-channel mixture signal x(t) The objective of our

Trang 28

Figure 1.2: Audio source separation: a solution for cock-tail party problem.

research, as mentioned previously, is to recover the spatial image c j (t) of the source from the observed mixture as shown in Fig 1.2 Note that in our study, background noise is also considered as a source This definition applies to both point sources and diffuse sources in both live recordings and artificially-mixed recordings.

As discussed in Section 1.1.1, a standard architecture for source separation system includes two models: the spectral model formulates the spectral characteristics of the sources, and spatial model exploits the spatial information of the sources An advan- tage of this architecture is that it offers modularity and we can mix and match any mix- ing filter estimation technique with any spectral source estimation technique Besides, some of the approaches to source separation also can recover the sources by directly exploiting either the spectral sources or the mixing filters The whole BSS picture built in more than two decades of research is very large, consisting of many different tech- niques and requiring an intensive survey, e.g see in [22, 54, 85, 112, 138, 141] In this section, we limit our discussion on some popular spectral and spatial models They are combined or used individually in the state-of-the-art algorithms in different ways.

1.2.1 Spectral models

This section reviews three typical source spectral models that have been studied tensively in the literature They are spectral Gaussian Mixture Model (Spectral GMM), spectral Nonnegative Matrix Factorization (Spectral NMF) and Deep Neural Network

Trang 29

1.2.1.1 Gaussian Mixture Model

We start by the principles of the Gaussian model-based approaches,known as Spec-tral GMM [7, 77, 106, 113], where the redundancy andstructure of each audio source can be exploited for audio source separation

As it can be seen, the short time Fourier spectrum of the j-th source is acolumn vector composed of all elements sj(n; f), with f = 1; : : : ; F as sj(n) =[sj(n; f)]f The Spectral GMM approach models sj(n) as a multidimensionalzero-mean complex-valued K-state Gaussian mixture with probability densityfunction (pdf) given by [7, 106]

K

X k

(1.5)p(sj(n)) =jkNc(sj(n); 0; jk);

=1

where 0 detotes a vector of zeroes, jk which satisfies PK

jk = 1; 8j, and

k=1

jk = diag([vjk(f)]f ) are the weight and the diagonal spectral covariance matrix

of the k-th state of the j-th source, respectively, and

The Spectral GMM defines K F free variances vjk(f) and exploits the global structure of the sources to estimate them However, GMM does not explicitly model amplitude variation of sound sources, so the signals having similar spectral shape but different amplitude level may result in different estimated spectral variance templates [v jk (f)] f To overcome this issue, another version of GMM was proposed in

2006 [13], called Spectral Gaussian Scaled Mixture Model (Spectral GSMM) In Spectral GSMM, a time-varying scaling parameter gjk(n) is incorporated in each Spectral-GMM The pdf of the GSMM is then written as [13]

K

X k

(1.7)p(sj(n)) = jkNc(sj(n); 0; gjk(n) jk);

=1

Trang 30

Spectral GMM and Spectral GSMM were applied to single-channel audiosource separation [13, 16], and stereo separation of moving sources [95].The GMM was also considered in multichannel instantaneous music mixtures[7] where the Spectral-GMMs are learnt from the mixture signals.

1.2.1.2 Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a dimension reduction technique that works with nonnegative data NMF has been applied to many fields of machine learn-ing and audio signal processing [43, 72, 73, 102, 105, 108, 109, 127] More detailed descriptions of NMF will be presented in Chapter 2 as a baseline method for our study In the following, we will review NMF as a structured spectral source model applied to audio source separation, known as Spectral NMF.

In the Spectral NMF model, each source s j is the sum of K j spectral basis (also is called frequency basis, basis spectra, or latent components) and is written by [102]

ck(n; f) Nc(0; hnkwkf ) (1.9)

where w kf 2 R + denotes spectral basis representing spectral structures of the signal, h nk

2 R + is the distribution of the spectral basis representing time-varying activations The source STFT coefficients s j (n; f) are also modeled as independent zero-mean

Gaussian random variables with free variances PK j h nk w kf given by [138]

k=1

X k

(1.10)p(sj(n; f)) = Nc(sj(n; f); 0; hnkwkf )

Trang 31

Xlog p(SjjHj; Wj) $ d(jsj(n; f)j2kHjWj) (1.11)

n;f

where denotes equality up to a constant, divergence function d may beKullback-Leibler (KL) divergence [73]: dKL(xky) = x log(xy ) x y, or Itakura-Saito (IS) divergence [40]: dIS(xky) = xy log(xy ) 1, etc., it will be presentedmore details in Chapter 2 Here NMF requires the estimation of only N Kj

values of Hj and KjF values of Wj instead of estimating N F values of thepower spectrogram Sj where N Kj + KjF N F: Thus NMF is considered as aform of dimension reduction in this context

Spectral NMF has been applied to single-channel audio source separation[115, 142] and multichannel audio source separation [102, 104] with differentsettings In recent years, several studies have investigated user-guided NMFmethods [26, 30, 37, 104, 126, 156] that incorporate specific information aboutthe sources in order to im-prove the efficiency of the separation algorithm

1.2.1.3 Deep Neural Networks

Recent studies have shown that deep neural networks (DNNs) are able tomodel complex functions and perform well on various tasks, including audiosignal process-ing [4, 35, 53, 62, 119, 144, 155, 157] The two former methods,GMM and NMF, first learn the characteristics of speech and noise signals, thenthose learned models were used to guide the signal separation process Thedeep learning based approaches can learn the separation mask or theseparation model by end-to-end training and gain a significant impact

In DNN-based approaches, the mixture time-frequency representation is processed to extract relevant features Given these features as inputs, a DNN is utilized either for directly estimating the time-frequency mask [144] or for estimating the source spectra whose ratio yields a time-frequency mask [4, 56, 132] Time-frequency masking, as its name suggests, estimates the spatial images by filtering the time- frequency repre-sentation of the mixture using a mask This can be expressed as

pre-c^j(n; f) = m^j(n; f)x(n; f); (1.12)

where m^ j (n; f) is the mask for time frame n and frequency bin f of source j-th In the audio enhancement scenario, the best possible binary or soft mask are called the ideal binary mask or the ideal ratio mask, respectively They are derived from a typical

Trang 32

real-valued scalar mask in [33], and can be computed as

- The mask estimation error:

where starg(f; n) is the target source spectra

- The error of signal in the complex-valued T-F domain computed using the mated mask:

Most studies have addressed the problem of single-channel source separation [18,

52, 56, 132, 150] Recently, there exist a few studies exploiting DNN for multichannel sound source separation based on diference approaches In Nugraha’s study [96], the DNNs are used to estimate the spectral parameters for each source in the EM iteration Such estimated parameters, together with the spatial parameters, are used to derive a time-varying multichannel filter The study of Wang et al., [148] combines spectral and spatial features in a deep clustering algorithm for blind source separation In their approach, phase difference features are included in the input to a deep clustering net- work, they encode both spatial and spectral information in the embeddings it creates,

Trang 33

leading to better-estimated time-frequency masks Such DNN-basedapproaches were shown to offer very promising results However, theyrequire a large amount of la-beled data for training, which may not always beavailable and the training is usually computationally expensive.

1.2.2 Spatial models

When more recording channels are available thanks to the use of multiple phones, a multichannel source separation algorithm should be considered as it allows to exploit important information about the spatial locations of audio sources Such spatial information is reflected in the mixing process (usually with reverberation), and can be modeled by e.g., the interchannel time difference (ITD) and interchannel in-tensity difference (IID) [31, 63, 86, 112], the rank-1 time-invariant mixing vector in the frequency domain when following the narrowband assumption [66, 102, 121, 151], or the full-rank spatial covariance matrix in local Gaussian model (LGM) where the narrowband assumption is relaxed [28, 38, 94].

micro-In this part, we present three typical existing models that exploitdeterministic or probabilistic parameterization for the spatial cues They areIID/ITD, rank-1 covari-ance matrix, and full-rank spatial covariance model.1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD)

Spatial models encode any information related to the spatial position of sources Many existing BSS algorithms exploit spatial cues such as the phase and amplitude

of the mixture channels They are called interchannel time difference (ITD) and terchannel intensity difference (IID) This ITD is produced because it takes longer for the sound to arrive at the microphone that is farther from the source The IID is pro- duced because some of the incoming sound energy are degreaded when reaching the microphone that is farther away from the direction of the source [145].

in-Assuming that there are two microphones and two sources in anechoic mixing condition, the IID is illustrated in Fig 1.3 The source position s 1 nearer to microphone 1,

so the recorded signal level x 1 is higher than x 2 and the corresponding IID when the source amplitude varies is modeled by the solid line s 1 On the contrary, the source position s 2 results in smaller x 1 than x 2 , and the corresponding IID is represented by the dotted line in Fig 1.3 The observed IID is therefore constant over time and directly related to the source direction of arrival (DoA) The IID/ITD has been widely exploited

Trang 34

in the history of both anechoic and convolutive source separation [1, 31, 86, 97, 112,

138, 145] The state of the arts also pointed out that IID/ITD relevant for instantaneous and anechoic mixtures but far from the actual characteristics of reverberation mixtures.

Figure 1.3: IID coresponding to two sources in an anechoic environment

1.2.2.2 Rank-1 covariance matrix

Given the mixing model is written in 1.4 with the narrowband assumption,the covariance matrix of cj(n; f), denoting by j, is then given by [28]

j = vj(n; f)Rj(n; f); (1.18)where vj(n; f) is the variance of sj(n; f) and Rj(n; f) is equal to the rank-1 matrix

Rj(n; f) = aj(f)ajH (f); (1.19)with aj(f) is the Fourier transform of the mixing filters aj( ) and (:)H indicatesthe conjugate transposition This rank-1 convolutive parameterization of thespatial co-variance matrices has been exploited in together with an NMFmodel of the source variances in [67, 103, 104, 121, 151]

In the case of anechoic recording environment without reverberation andusing omnidirectional microphones, each mixing filter will combine with adelay ij and a gain ij specified by the distance rij from the j-th source to the i-thmicrophone as [50]

c

Trang 35

aanj(f) = ( 1je 2i f1j ; : : : ; Ije 2i fIj )T :(1.23)1.2.2.3 Full-rank spatial covariance model

In an anechoic or low reverberation recording environment, one possible interpre-tation of the narrowband approximation is that the sound of each source as recorded at the microphones comes from a single spatial position at each frequency

f, as spec-ified by a j (f) or aanj (f) [28] But this approximation is not valid in a reverberant environment because of some spatial spread of each source, echoes at many different positions on the walls, ceilings, and floors of the recording room The full-rank spatial covariance matrices will model better this spread.

Assuming that the spatial image of each source is composed of two uncorrelated parts: a direct part aanj (f) as in (1.23) and a reverberant part [69] Then the spatial covariance R j (f) of each source is a full-rank matrix defined as the sum of the covari- ance of its direct part and the covariance of its reverberant part [28]

Rj(f) = ajan(f)(ajan)H (f) + 2 (f); (1.24)where 2 is the variance of the reverberant part and il(f) is a function of the mi-crophone directivity pattern and the distance between the i-th and the l-thmicrophone (such that ii(f) = 1) This full-rank direct+diffuse model assumesthat the reverber-ation recorded at all microphones has the same power but

is correlated as characterized by il(f)

This model was employed for single source localization in [50] and considered for multiple source localization in [93] The covariance matrix (f) was usually em-ployed for the modeling of diffuse background noise [60, 87] For instance, the source separation algorithm in [60] assumed that the sources follow an anechoic model and represented the non-direct part of all sources by a shared diffuse noise component with

Trang 36

covariance (f) and constant variance This algorithm did not account for thecorre-lation between the variances of the direct part and the non-direct part.

A full-rank unconstrained covariance model had been proposed in 2010 [28]which encodes the spatial position of the sources as well as their spatial spread.This model parameterizes the spatial information of each source via a full-rankunconstrained Hermitian positive semi-definite spatial covariance matrix Rj(f)whose coefficients are not deterministically related a priori This unconstrainedparameterization is the most general possible parameterization for a covariancematrix It generalizes the above three parameterizations in the sense that anymatrix taking the form of (1.19), (1.22) or (1.24) can also be considered as aparticular form of an unconstrained ma-trix Since then, the full-rankunconstrained spatial model has been applying more and more widely, bycombining with different spectral models such as NMF [6, 105, 107], DNNs [96]

The topic of the source separation performance evaluation has long been studied in the literature Several studies have been published both in terms of objective quality [49, 137] and subjective quality [32, 45, 139] In our study, we focus

on two popular families of objective evaluation criteria, which can be applied to any audio mixture and any algorithm and do not require the knowledge of the unmixing parameters or filters These criteria, namely energy ratio criteria and perceptually- motivated criteria, have been widely used in the community as well as in the recent evaluation campaigns [5, 65, 99, 101, 133–135, 140].

Both families of criteria that we mentioned above are derived from the perceptual decomposition of each estimated source image c^ij(t) into four constituents as [140]

c^ij(t) = cij(t) + eijspat(t) + eijinter(t) + eijartif (t); (1.25)where cij(t) is the true spatial image of the j-th source at the i-th microphone,

espatij(t), einterij(t), and eartifij (t) are different error components representingspatial (or filtering) distortion, interference from the other sources, andburbling artifacts Based on this decomposition, the measures of each familycriteria are presented more details in sub-sections 1.3.1 and 1.3.2

Trang 37

1 k I; 0 L 1, PallL is the least-squares projector onto the subspace spanned by

ckl(t ), 1 k I, 1 l J, 0 L 1, and L is the filter length which is set to 32 ms [140].Then the relative amounts of interference distortion, artifacts distortion, and spatial distortion are measured using three energy ratio criteria expressed in decibels (dB): the Source to Interference Ratio (SIR), the Sources to Artifacts Ratio (SAR), and the source Image to Spatial distortion Ratio (ISR), defined by [140]

• Signal to Interference Ratio:

This measure estimates the artifacts introduced by the source separation process.

• Source Image to Spatial distortion Ratio:

The total error represents the overall performance of the source separation algorithm, also measured by the Signal to Distortion Ratio (SDR) and calculated as follows

22

Trang 38

• Signal to Distortion Ratio:

ob-These criteria score from 0 to 100 where higher values indicate better performance.

It was shown in [32] that the perceptually-motivated criteria could improve the lation with subjective scores compared to the energy ratio criteria and were often used in addition with the energy ratio criteria from 2010 in the audio source separation com- munity The source code of these perceptually-motivated criteria is also available2.

This chapter has introduced the audio source separation as a big picture, and lated the general source separation problem that we will focus on in this thesis From that, we have surveyed the major technicals in order to exploit spectral information or spatial information of sources in the separation process In addition, two popular fam- ilies of objective evaluation criteria, that we will use to evaluate the source separation performance of the proposed methods in Chapter 3 and 4, have also been presented.

2 http://bass-db.gforge.inria.fr/peass/

Trang 39

CHAPTER 2 NONNEGATIVE MATRIX FACTORIZATION

Spectral decomposition by NMF has become the popular approach in manyaudio signal processing tasks, such as source separation, enhancement andaudio detection This chapter first presents the NMF formulation and itsextensions.We then introduce the NMF-based audio spectral decomposition Bythe end of this chapter, we present the proposed methods for automaticallydetecting unusual sounds using NMF with aiming for effective sound annotation

2.1 NMF introduction

Nonnegative matrix factorization is a dimension reduction technique that applies

to the nonnegative data NMF has been widely known and used after the publication

of Lee and Seung in 1999 [72, 73], but it actually appeared nearly 20 years before that with other names such as nonnegative rank factorization [61] or positive matrix factor-ization [109] Thanks to [72, 73], NMF has been used extensively for a variety

of many applications, such as bioinformatics [76], image processing [120], facial recognition [55], speech enhancement [39, 89], direction of arrival (DoA) estimation [131], blind source separation [40, 102, 107, 122, 130, 159], and the informed source separation [25, 44, 46, 48] Comprehensive reviews about the NMF can be found in [147, 160] In the following, we will present some details about NMF so as

to understand what the NMF is and how it works.

where W 2 RF+K and H 2 RK+ N are nonnegative matrices of dimensions F K and K

N, respectively NMF can be applied to the statistical analysis of multivariate

Trang 40

data in the following manner Given a set of multivariate n-dimensional data vectors, the vectors are placed in the columns of a F N matrix V where F is the charac- teristic of the data, N is the number of observations or examples of the dataset NMF approximately factorizes V into F K matrix W and K N matrix H as shown in Fig 2.1, where K is the number of the basis vector (latent components) Usually, K is chosen

to be smaller than F and N, in order to achieve the decompositions, where F K + K

N F N [42, 73] So W and H are smaller than the original matrix V, they are rank representation of the original data matrix That is why NMF is considered as a dimensionality reduction technique.

lower-Equation (2.1) can be rewritten column by column as v Wh, where v and hare the columns of V and H, respectively In other words, each data vector v isapproxi-mated by a linear combination of the columns of W, weighted by thecomponents of h Therefore W is called a dictionary matrix, containing the basisthat is optimized for the linear approximation of the data in V H contains thedistribution of the basis in W matrix and called a distribution weight matrix oractivation matrix Usually, relatively few basis vectors can be used to representmany data vectors, so we can achieve a good approximation when the basisvectors successfully discover the latent structure in the data

To sum up, NMF aims to find the nonnegative basic representative factors which can

be used for feature extraction, dimensional reduction, eliminating redundant infor-mation and discovering the hidden patterns behind a series of non-negative vectors.

Figure 2.1: Decomposition model of NMF [36]

Ngày đăng: 04/04/2019, 07:01

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w