Hanoi - 2019MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING BASED GENERIC SOURCE SPECTRAL MODEL NMF-
Trang 1Hanoi - 2019
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
DUONG THI HIEN THANH
AUDIO SOURCE SEPARATION EXPLOITING BASED GENERIC SOURCE SPECTRAL MODEL
NMF-DOCTORAL DISSERTATION OF COMPUTER SCIENCE
Trang 2Hanoi - 2019
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
DUONG THI HIEN THANH
AUDIO SOURCE SEPARATION EXPLOITING BASED GENERIC SOURCE SPECTRAL MODEL
NMF-Major: Computer Science
Code: 9480101
DOCTORAL DISSERTATION OF COMPUTER SCIENCE
SUPERVISORS:
1 ASSOC PROF DR NGUYEN QUOC CUONG
2 DR NGUYEN CONG PHUONG
Trang 3DECLARATION OF AUTHORSHIP
I, Duong Thi Hien Thanh, hereby declare that this thesis is my original work and
it has been written by me in its entirety I confirm that:
• This work was done wholly during candidature for a Ph.D research degree atHanoi University of Science and Technology
• Where any part of this thesis has previously been submitted for a degree or anyother qualification at Hanoi University of Science and Technology or any otherinstitution, this has been clearly stated
• Where I have consulted the published work of others, this is always clearly tributed
at-• Where I have quoted from the work of others, the source is always given Withthe exception of such quotations, this thesis is entirely my own work
• I have acknowledged all main sources of help
• Where the thesis is based on work done by myself jointly with others, I havemade exactly what was done by others and what I have contributed myself
Hanoi, February 2019Ph.D Student
Duong Thi Hien Thanh
SUPERVISORS
Assoc.Prof Dr Nguyen Quoc Cuong Dr Nguyen Cong Phuong
Trang 4ACKNOWLEDGEMENT
This thesis has been written during my doctoral study at International ResearchInstitute Multimedia, Information, Communication, and Applications (MICA), HanoiUniversity of Science and Technology (HUST) It is my great pleasure to thanknumer- ous people who have contributed towards shaping this thesis
First and foremost I would like to express my most sincere gratitude to mysupervi- sors, Assoc Prof Nguyen Quoc Cuong and Dr Nguyen Cong Phuong, fortheir great guidance and support throughout my Ph.D study I am grateful to themfor devoting their precious time to discussing research ideas, proofreading, andexplaining how to write good research papers I would like to thank them forencouraging my research and empowering me to grow as a research scientist I couldnot have imagined having a better advisor and mentor for my Ph.D study
I would like to express my appreciation to my supervisor in Master cource, Prof.Nguyen Thanh Thuy, School of Information and Communication Technology -HUST, and Dr Nguyen Vu Quoc Hung, my supervisor in Bachelors course at HanoiNational University of Education They had shaped my knowledge for excelling instudies
In the process of implementation and completion of my research, I have receivedmany supports from the board of MICA directors and my colleagues at Speech Com-munication department Particularly, I am very much thankful to Prof Pham ThiNgoc Yen, Prof Eric Castelli, Dr Nguyen Viet Son and Dr Dao Trung Kien, whopro- vided me with an opportunity to join researching works in MICA institute andhave access to the laboratory and research facilities Without their precious supportwould it have been being impossible to conduct this research My warmly thanks go
to my colleagues at Speech Communication department of MICA institute for theiruseful comments on my study and unconditional support over four years both atwork and outside of work
I am very grateful to my internship supervisor Prof Nobutaka Ono and the bers of Ono’s Lab at the National Institute of Informatics, Japan for warmlywelcoming me into their lab and the helpful research collaboration they offered Imuch appreciate his help in funding my conference trip and introducing me to thesignal processing research communities I would also like to thank Dr Toshiya
Trang 5mem-iiOhshima, MSc Yasu- taka Nakajima, MSc Chiho Haruta and other researchers atRion Co., Ltd., Japan for
Trang 6welcoming me to their company and providing me data for experimental
I would also like to sincerely thank Dr Nguyen Quang Khanh, dean ofInformation Technology Faculty, and Assoc Prof Le Thanh Hue, dean of EconomicInformatics Department, at Hanoi University of Mining and Geology (HUMG) where
I am work- ing I have received the financial and time support from my office andleaders for completing my doctoral thesis Grateful thanks also go to my wonderfulcolleagues and friends Nguyen Thu Hang, Pham Thi Nguyet, Vu Thi Kim Lien, VoThi Thu Trang, Pham Quang Hien, Nguyen The Binh, Nguyen Thuy Duong, NongThi Oanh and Nguyen Thi Hai Yen, who have the unconditional support and helpduring a long time A special thank goes to Dr Le Hong Anh for the encouragementand his precious advice
Last but not the least, I would like to express my deepest gratitude to my family I
am very grateful to my mother-in-law and father-in-law for their support in the time
of need, and always allow me to focus on my work I dedicate this thesis to mymother and father with special love, they have been being a great mentor in my lifeand had constantly encouraged me to be a better person The struggle and sacrifice
of my parents always motivate me to work hard in my studies I would also like toexpress my love to my younger sisters and younger brother for their encouraging andhelping This work has become more wonderful because of the love and affection thatthey have provided
A special love goes to my beloved husband Tran Thanh Huan for his patience andunderstanding, for always being there for me to share the good and bad times I alsoappreciate my sons Tran Tuan Quang and Tran Tuan Linh for always cheering me upwith their smiles Without love from them, this thesis would not have beencompleted
Thank you all!
Hanoi, February 2019Ph.D StudentDuong Thi Hien Thanh
Trang 7CONTENTS
DECLARATION OF AUTHORSHIP
DECLARATION OF AUTHORSHIP
i
i ACKNOWLEDGEMENT ii
CONTENTS iv
NOTATIONS AND GLOSSARY viii
LIST OF TABLES xi
LIST OF FIGURES xii
INTRODUCTION 1
Chapter 1 AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF THE ART 10
1.1 Audio source separation: a solution for cock-tail party problem 10
1.1.1 General framework for source separation 10
1.1.2 Problem formulation 11
1.2 State of the art 13
1.2.1 Spectral models 13
1.2.1.1 Gaussian Mixture Model 14
1.2.1.2 Nonnegative Matrix Factorization 15
1.2.1.3 Deep Neural Networks 16
1.2.2 Spatial models 18
1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD) 18
1.2.2.2 Rank-1 covariance matrix 19
1.2.2.3 Full-rank spatial covariance model 20
1.3 Source separation performance evaluation 21
1.3.1 Energy-based criteria 22
1.3.2 Perceptually-based criteria 23
1.4 Summary 23
Chapter 2 NONNEGATIVE MATRIX FACTORIZATION 24
2.1 NMF introduction 24
Trang 83.1 General workflow of the proposed approach 44
3.2 GSSM formulation 46
3.3 Model fitting with sparsity-inducing penalties 46
3.3.1 Block sparsity-inducing penalty 47
3.3.2 Component sparsity-inducing penalty 48
3.3.3 Proposed mixed sparsity-inducing penalty 49
3.4 Derived algorithm in unsupervised case 49
3.5 Derived algorithm in semi-supervised case 52
3.5.1 Semi-GSSM formulation 52
3.5.2 Model fitting with mixed sparsity and algorithm 54
3.6 Experiment 54
3.6.1 Experiment data 54
3.6.1.1 Synthetic dataset 55
2.1.1 NMF in a nutshell 24
2.1.2 Cost function for parameter estimation 26
2.1.3 Multiplicative update rules 27
2.2 Application of NMF to audio source separation 29
2.2.1 Audio spectra decomposition 29
2.2.2 NMF-based audio source separation 30
2.3 Proposed application of NMF to unusual sound detection 32
2.3.1 Problem formulation 33
2.3.2 Proposed methods for non-stationary frame detection 34
2.3.2.1 Signal energy based method 34
2.3.2.2 Global NMF-based method 35
2.3.2.3 Local NMF-based method 35
2.3.3 Experiment 37
2.3.3.1 Dataset 37
2.3.3.2 Algorithm settings and evaluation metrics 37
2.3.3.3 Results and discussion 38
2.4 Summary 43
Chapter 3 SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP SPARSITY CONSTRAINT 44
Trang 9tions of EM and MU iterations 80
4.3.2.2 Separation results with different choices of λ and γ 81 Comparison with the state of the art 82
6 3.6.1.2 SiSEC-MUS dataset 55
3.6.1.3 SiSEC-BNG dataset 56
3.6.2 Single-channel source separation performance with unsuper-vised setting 57
3.6.2.1 Experiment settings 57
3.6.2.2 Evaluation method 57
3.6.2.3 Results and discussion 61
3.6.3 Single-channel source separation performance with semi-supervised setting 65
3.6.3.1 Experiment settings 65
3.6.3.2 Evaluation method 65
3.6.3.3 Results and discussion 65
3.7 Summary 66
Chapter 4 MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 68
4.1 Formulation and modeling 68
4.1.1 Local Gaussian model 68
4.1.2 NMF-based source variance model 70
4.1.3 Estimation of the model parameters 71
4.2 Proposed GSSM-based multichannel approach 72
4.2.1 GSSM construction 72
4.2.2 Proposed source variance fitting criteria 73
4.2.2.1 Source variance denoising 73
4.2.2.2 Source variance separation 74
4.2.3 Derivation of MU rule for updating the activation matrix 75
4.2.4 Derived algorithm 77
4.3 Experiment 79
4.3.1 Dataset and parameter settings 79
4.3.2 Algorithm analysis 80
4.3.2.1 Algorithm convergence: separation results as
func-4.3.3
Trang 10
9396
LIST OF PUBLICATIONS 113
Trang 11i j
NOTATIONS AND GLOSSARY
Standard mathematical symbols
C Set of complex numbers
R Set of real numbers
Z Set of integers
E Expectation of a random variable
Nc Complex Gaussian distribution
Vectors and matrices
a Scalar
a Vector
A Matrix
AT Matrix transpose
AH Matrix conjugate transposition (Hermitian conjugation)
diag(a) Diagonal matrix with a as its diagonal
det(A) Determinant of matrix A
tr(A) Matrix trace
A B The element-wise Hadamard product of two matrices (of the same dimension)
with elements [A B]ij = Aij Bij
A.(n) The matrix with entries [A].(n)
kak1 `1 -norm of vector
kAk1 `1 -norm of matrix
Indices
f Frequency index
i Channel index
j Source index
n Time frame index
t Time sample index
Trang 12F Number of frequency bin
N Number of time frames
K Number of spectral basis
Mixing filters
A ∈ RI ×J ×L Matrix of filters
aj (τ ) ∈ RI Mixing filter of jth source to all microphones, τ is the time delay
aij (t) ∈ R Filter coefficient at tth time index
aij ∈ RL Time domain filter vector
aij ∈ CL Frequency domain filter vector
th
aij(f ) ∈ C Filter coefficient at
General parameters
x(t) ∈ RI Time-domain mixture signal
s(t) ∈ RJ Time-domain source signals cj
(t) ∈ RI Time-domain jth source image
sj (t) ∈ R Time-domain jth original source signal
x(n, f ) ∈ CI Time-frequency domain mixture signal
s(n, f ) ∈ CJ Time-frequency domain source signals cj
(n, f ) ∈ CI Time-frequency domain jth source image vj
(n, f ) ∈ R Time-dependent variances of the jth source
Rj (f ) ∈ C Time-independent covariance matrix of the jth source
Σj (n, f ) ∈ CI ×I Covariance matrix of the jth source image
Σb x (n, f ) ∈ CI ×I Empirical mixture covariance
Σb x (n, f ) ∈ CI ×I Empirical mixture covariance
+ Power spectrogram matrix
+ Spectral basis matrix
+ Time activation matrix
+ Generic source spectral model
Trang 13APS Artifacts-related Perceptual Score
BSS Blind Source Separation
DoA Direction of Arrival
DNN Deep Neural Network
EM Expectation Maximization
ICA Independent Component Analysis
IPS Interference-related Perceptual Score
IS Itakura-Saito
ISR source Image to Spatial distortion Ratio
ISTFT Inverse Short-Time Fourier Transform
IID (i.i.d) Interchannel Intensity Difference
ITD (i.t.d) Interchannel Time Difference
GCC-PHAT Generalized Cross Correlation Phase TransformGMM Gaussian Mixture Model
GSSM Generic Source Spectral Model
KL Kullback-Leibler
LGM Local Gaussian Model
MAP Maximum A Posteriori
ML Maximum Likelihood
MU Multiplicative Update
NMF Non-negative Matrix Factorization
OPS Overall Perceptual Score
PLCA Probabilistic Latent Component AnalysisSAR Signal to Artifacts Ratio
SDR Signal to Distortion Ratio
SIR Signal to Interference Ratio
SiSEC Signal Separation Evaluation Campaign
SNMF Spectral Non-negative Matrix FactorizationSNR Signal to Noise Ratio
STFT Short-Time Fourier Transform
TDOA Time Difference of Arrival
T-F Time-Frequency
TPS Target-related Perceptual Score
Trang 142.1 Total number of different events detected from three recordings in
2.2 Total number of different events detected from three recordings in
sum- mer
412.3 Total number of different events detected from three recordings in
3.1 List of snip songs in the SiSEC-MUS dataset 563.2 Source separation performance obtained on the Synthetic and SiSEC- MUS dataset with unsupervised setting 593.3 Speech separation performance obtained on the SiSEC-BGN ∗ indi-cates submissions by the authors and “-” indicates missinginformation [81, 98, 100]
603.4 Speech separation performance obtained on the Synthetic dataset withsemi-supervised setting 664.1 Speech separation performance obtained on the SiSEC-BGN-devset - Comparison with closed baseline methods 854.2 Speech separation performance obtained on the SiSEC-BGN-devset -
by the authors and “-” indicates missing information 864.3 Speech separation performance obtained on the test set of the SiSEC- BGN ∗ indicates submissions by the authors [81] 91
LIST OF TABLES
Comparison with s-o-t-a methods in SiSEC ∗ indicates submissions
Trang 15LIST OF FIGURES
1 A cocktail party effect 2
2 Audio source separation 3
3 Live recording environments 4
1.1 Source separation general framework 11
1.2 Audio source separation: a solution for cock-tail party problem 13
1.3 IID coresponding to two sources in an anechoic environment 19
2.1 Decomposition model of NMF [36] 25
2.2 Spectral decomposition model based on NMF (K = 2) [66] 29
2.3 General workflow of supervised NMF-based audio source separation 30 2.4 Image of overlapping blocks 34
2.5 General workflow of the NMF-based nonstationary segment extraction 35 2.6 Number of different events were detected by the methods from (a) the recordings in Spring, (b) the recordings in Summer, and (c) the record-ings in Winter 39
3.1 Proposed weakly-informed single-channel source separation approach 45 3.2 Generic source spectral model (GSSM) construction 47
3.3 Estimated activation matrix H: (a) without a sparsity constraint, (b) with a block sparsity-inducing penalty (3.5), (c) with a component sparsity-inducing penalty (3.6), and (d) with the proposed mixed sparsity-inducing penalty (3.7) 48
3.4 Average separation performance obtained by the proposed method with unsupervised setting over the Synthetic dataset as a function of MU it-erations 61
3.5 Average separation performance obtained by the proposed method with unsupervised setting over the Synthetic dataset as a function of λ and γ 62 3.6 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methmeth-ods over the dev set in SiSEC-BGN 63 3.7 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methmeth-ods over the test set in SiSEC-BGN 63
Trang 164.1 General workflow of the proposed source separation approach Thetop green dashed box describes the training phase for the GSSMconstruc- tion Bottom blue boxes indicate processing steps forsource separa- tion Green dashed boxes indicate the noveltycompared to the existing works [6, 38, 107]
734.2 Average separation performance obtained by the proposed method overstereo mixtures of speech and noise as functions of EM and MU itera-tions (a): speech SDR, (b): speech SIR, (c): speech SAR, (d): speechISR, (e): noise SDR, (f): noise SIR, (g): noise SAR, (h): noise ISR 814.3 Average separation performance obtained by the proposed method overstereo mixtures of speech and noise as functions of λ and γ (a): speechSDR, (b): speech SIR, (c): speech SAR, (d): speech ISR, (e): noiseSDR, (f): noise SIR, (g): noise SAR, (h): noise ISR 824.4 Average speech separation performance obtained by the proposed meth-ods and the closest existing algorithms in terms of the energy-basedcriteria 884.5 Average speech separation performance obtained by the proposed meth-ods and the closest existing algorithms in terms of the perceptually-based criteria 884.6 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methods in terms of the energy-based
4.7 Average speech separation performance obtained by the proposed ods and the state-of-the-art methods in terms of the perceptually-basedcriteria 894.8 Boxplot for the speech separation performance obtained by the pro-posed “GSSM + SV denoising” (P1) and “GSSM + SV separation”(P2) methods 90
Trang 17INTRODUCTION
In this part, we will introduce the motivation and the problem that we focus onthroughout this thesis Then, we emphasize on the objectives as well as scopes ofour work In addition, our contributions in this thesis will be summarized in order togive a clear view of the achievement Finally, the structure of the thesis is presentedchapter by chapter
1 Background and Motivation
1.1 Cocktail party problem
Real-world sound scenarios are usually very complicated as they are mixtures ofmany different sound sources Fig 1 depicts the scenario of a typical cocktail party,where there are many people attending, many conversations going onsimultaneously and various disturbances like loud music, people screaming sounds,and a lot of hustle- bustle Some other similar situations also happen in daily life, forexample, in outdoor recordings, where there is interference from a variety ofenvironmental sounds, or in a music concert scenario, where a number of musicalinstruments are played and the au- dience gets to listen to the collective sound, etc
In such settings, what is actually heard by the ears is a mixture of various soundsthat are generated by various audio sources The mixing process can contain manysound reflections from walls and ceiling, which is known as the reverberation.Humans with normal hearing ability are generally able to locate, identify, anddifferentiate sound sources which are heard simultaneously so as to understand theconveyed information However, this task has remained extremely challenging formachines, especially in highly noisy and reverberated environments The cocktailparty effect described above prevents both human and machine perceiv- ing thetarget sound sources [2, 12, 145], the creation of machine listening algorithms thatcan automatically separate sound sources in difficult mixing conditions remains anopen problem
Trang 18Audio source separation aims at providing machine listeners with a similar tion to the human ears by separating and extracting the signals of individual sourcesfrom a given mixture This technique is formally termed as blind source separation
Trang 19(BSS) when no prior information about either the sources or the mixing condition isavailable, and is described in Fig 2 Audio source separation is also known as aneffective solution for cocktail party problem in audio signal processing community[85, 90, 138, 143, 152] Depending on specific application, some source separationapproaches focus on speech separation, in which the speech signal is extractedfrom the mixture containing multiple background noise and other unwanted sounds.Other methods deal with music separation, in which the singing voice and certaininstruments are recovered from the mixture or song containing multiple musicalinstruments The separated source signals may be either listened to or furtherprocessed, giving rise to many potential applications Speech separation is mainlyused for speech enhance- ment in hearing aids, hands-free phones, or automaticspeech recognition (ASR) in adverse conditions [11, 47, 64, 116, 129] While musicseparation has many interest- ing applications, including editing/remixing musicpost-production, up-mixing, music information retrieval, rendering of stereorecordings, and karaoke [37, 51, 106, 110]
Figure 1: A cocktail party effect1.Over the last couple of decades, efforts have been undertaken by the scientificcom- munity, from various backgrounds such as Signal Processing, Mathematics,Statistics, Neural Networks, Machine Learning, etc., to build audio source
Trang 21Figure 2: Audio source separation
has been studied at various levels of complexity, and different approaches andsystems have come up Despite numerous effort, the problem is not completelysolved yet as the obtained separation results are still far from perfect, especially inchallenging conditions such as moving sound sources and high reverberation
challenges
• Overdetermined, determined, and underdetermined mixture
There are three different settings in audio source separation under therelation- ship between the number of sources J and the number ofmicrophones I : In case the number of the microphones is larger than that ofthe sources, J < I , the number of observable variables are more than theunknown variables and hence it is referred to as overdetermined case If J =
I , we have as many observable variables as unknowns, and this is adetermined case The more dificult soure separation case is that the number
of unknowns are more than the number of observable variables, J > I , which
is called the underdetermined case
Furthermore, if I = 1 then it is a single-channel case If I > 1 then it is amulti-channel case
• Instantaneous, anechoic, and reverberant mixing environment
Trang 22Apart from the mixture settings based on the relationship between thenumber of sources and the number of microphones, audio source separationalgorithms can also be distinguished based on the target mixing condition theydeal with
Trang 23The simplest case deals with instantaneous mixtures, such as certain musicmix- tures generated by amplitude panning In this case, there is no timedelay, a mixture at a given time is essentially a weighted sum of the sourcesignals at the same time instant There are two other typical types of the liverecording environments, anechoic and reverberant, as shown in Fig 3 In theanechoic environments such as studio or outdoor, the microphones captureonly the direct sound propagation from a source With reverberantenvironments such as real meeting rooms or chambers, the microphonescapture not only the direct sound but also many sound reflections from walls,ceilings, and floors The modeling of the reverberant environment is muchmore difficult than the instantaneous and
anechoic cases
Figure 3: Live recording environments2
State-of-the-art audio source separation algorithms perform quite well ininstan- taneous or noiseless anechoic conditions, but still far from perfect by theamount of reverberation These numerical performance results are clearly shown inthe recent community-based Signal Separation Evaluation Campaigns (SiSEC) [5, 99,
101, 133,
134] and others [65, 135] That shows that addressing the separation ofreverberant mixtures, a common case in the real-world recording applications,remains one of the key scientific challenges in the source separation community.Moreover, when the de- sired sound is corrupted by high-level background noise,
Trang 25To improve the separation performance, informed approaches have beenproposed and emerged over the last decade in the literature [78, 136] Suchapproaches exploit side information about one or all of the sources themselves, orthe mixing condition in order to guide the separation process Examples of theinvestigated side information include deformed or hummed references of one (ormore) source(s) in a given mixture [123, 126], text associated with spoken speeches[83], score associated with musical sources [37, 51], and motion associated withaudio-visual objects in a video [110]
Following this trend, our research focuses on using weakly-informedstrategy to target the determined/underdetermined and high reverberationaudio source separation challenge We use a very abstract semanticinformation just about the types of audio sources existing in the mixture toguide the separation process
2 Objective and scope
Trang 26For evaluation, both speech and music separations are considered We consider
a speech separation for speech enhancement task, and consider both singing voiceand musical instrument separation for music task In order to compare fairly theobtained separation results with other existing methods, we use the benchmarkdataset in addi-
Trang 27tion to our own synthetic dataset This well-designed benchmark dataset is fromthe Signal Separation Evaluation Campaign (SiSEC3) for the speech and real-worldback- ground noise separation task and music separation task Using these datasetsallows us to join in our research community activities Especially, we target toparticipate the SiSEC challenge so as to bring our developed algorithm to theinternational research community
2.2 Scope
In our study, we order to recover the original sources (in single-channel setting)
or the spatial images of each source (in multi-channel setting) from the observedaudio mixture The source spatial images are the contribution of those sources tothe mixture signal For example, for speech recordings in real-world environments,the spatial images are the speech signals recorded at the microphones afterpropagating from the speaker to the microphones
Furthermore, as focusing on the weakly-informed source separation, weassume the number of sources and the types of sources are known prior Forinstance, the mixture is composed of speech and noise in speech separationcontext, or vocals and musical instruments in music separation context
3 Contributions
Aims to tackle the real-world recordings with challenging settings as mentionedearlier, we have proposed novel separation algorithms for both single-channel andmulti-channel cases The achieved results have been described in sevenpublications The results of our algorithms were also submitted to the internationalsource separation campaign SiSEC 20164 [81] and obtained the best performance interms of energy- based criteria More specifically, the main contributions aredescribed as follows:
• We have proposed a novel single-channel audio source separation algorithmweakly guided by some source examples This algorithm exploits the genericsource spectral model (GSSM), which represents the spectral characteristics ofaudio sources, to guide the separation process With that, a new sparsity-
Trang 29speech performance of the proposed algorithm in both supervised and supervised setting We have also analyzed algorithm’s convergence as well asits stability with respect to the parameter settings
semi-These contributions were published in four scientific papers (papers 1, 2, 4, 5in
“List of publications”)
• A novel multi-channel audio source separation algorithm weakly guided bysome source examples has been proposed This algorithm exploits the use ofgeneric source spectral model learned by NMF within the well-establishedlocal Gaus- sian model We have proposed two new optimization criteria, thefirst one con- strains the variances of each source by NMF, the secondcriterion constrains the total variances of all sources altogether Thecorresponding EM algorithms for parameter estimation have also beenderived We have investigated the sensitiv- ity of the proposed algorithm toparameters as well as its convergence in order to guide for parameter settings
in the practical implementation
As another important contribution, we participated in the SiSEC challenges so
as our proposed approach is visible to the international research community.Evaluated fairly by the SiSEC organizes, our proposed algorithm obtained thebest source separation results in terms of the energy-based criteria in theSiSEC
Trang 304 Structure of thesis
The work presented in this thesis is structured in four chapters as follows:
Trang 31• Chapter 1: Audio source separation: Formulation and State of the art
We introduce the general framework and the mathematical formulation ofthe considered audio source separation problem as well as the notations used
in this thesis It is followed by an overview of the state-of-the-art audiosource sep- aration methods, which exploits different spectral models andspatial models Also, two families of criteria, that are used for sourceseparation performance evaluation, are presented in this chapter
• Chapter 2: Nonnegative matrix factorization
This chapter firstly introduces NMF, which has received a lot of attention inthe audio processing community It is followed by a baseline supervisedalgorithm based on NMF model aiming to separate audio sources from theobserved mix- ture By the end of this chapter, we propose novel methods forautomatically detecting non-stationary segments using NMF for effectivesound annotation
• Chapter 3: Proposed single-channel audio source separation approach
We present the proposed weakly-informed audio source separation methodfor single-channel audio source separation targeting both unsupervised andsemi- supervised setting The algorithm is based on NMF with mixed sparsitycon- straints In this method, the generic spectral characteristics of sources arefirstly learned from several training signals by NMF They are then used toguide the similar factorization of the observed power spectrogram into eachsource We also propose to combine two existing group sparsity-inducingpenalties in the optimization process and adapt the corresponding algorithmfor parameter esti- mation based on multiplicative update (MU) rule The lastsection of this chapter is devoted to the experimental evaluation We showthe effectiveness of the pro- posed approach in both unsupervised and semi-supervised settings
• Chapter 4: Proposed multichannel audio source separation approachThis chapter is a significant extension of the work mentioned in chapter 3
to the multi-channel case We describe a novel multichannel audio sourcesepara- tion algorithm weakly guided by some source examples, where theNMF-based GSSM is combined with the full-rank spatial covariance model in aGaussian modeling paradigm We then present the generalized expectation-
Trang 32maximization (EM) algorithm for the parameter estimation Especially, forguiding the esti- mation of the intermediate source variances in each EMiteration, we investigate the use of two criteria: (1) the estimated variances ofeach source are constrained
Trang 33by NMF, and (2) the total variances of all sources are constrained by NMF together By the experiment, the separation performances obtained byproposed algorithms are analyzed and compared with state-of-the-art andbaseline algo- rithms Moreover, the analysis results about the sensitivity ofthe proposed al- gorithms to parameter settings as well as their convergenceare also addressed in this chapter
al-In the last part of the thesis, we present the conclusion and perspectives for the future research directions
Trang 341.1 Audio source separation: a solution for cock-tail
of cues can be exploited for the separation process, called spectral cues and spatialcues Spectral cues describe the spectral structures of sources, while spatial cuesare information about the source spatial positions [22, 85, 97] They will bediscussed more detail in Section 1.2.1 and
1.2.2, respectively It can be seen that spectral signals alone are not able to guish sources with similar pitch range and timbre, while the individual spatial signals
Trang 36main after the short-time Fourier transform (STFT) and consists of two modelingcues as in Fig 1.1: (1) spectral model exploits spectral characteristics of sources, (2)spatial model performs modeling and exploiting spatial information Finally, theestimated time domain source signals are obtained via the inverse short-timeFourier transform (ISTFT)
Figure 1.1: Source separation general framework
1.1.2 Problem formulation
Multichannel audio mixtures are the types of recordings that we obtain when
we employ microphone arrays [14, 22, 85, 90, 92] Let us formulate the multichannelmix- ture signal, where J sources are observed by an array of I microphones, withindexes j ∈ {1, 2, , J } and i ∈ {1, 2, , I } to indicate specific source j andchannel i This mixture signal is denoted by x(t) = [x1 (t), , xI (t)]T ∈ RI ×1 and
is sum of contributions from all sources as [85]:
Under physical view, sound sources are typically divided into two types: pointsources and diffuse sources The point source is the case in which sound emits from
Trang 37a single point in a space, e.g., unmoving human speaker, a water drop, a singer issinging
Trang 38alone, etc The diffuse source is the case in which sound comes from a region ofspace, e.g., water drops in the rain, singers are singing in a choir, etc Diffuse sourcescan be considered as a collection of point sources [85, 141] In the case where the j-
Audio source separation systems often operate in the time-frequency (T-F) main, in which the temporal characteristics and the spectral characteristics of audiocan be jointly represented A most commonly used time-frequency representation
do-is the short-time Fourier transform (STFT) [3, 125] STFT analysdo-is refers to computingthe time-frequency representation from the time-domain waveform by creatingover- lapping frames along the waveform and applying the disjointed Fouriertransform on each frame
Switched to the T-F domain, equation (1.1) can be written as
J
x(n, f ) = X cj (n, f ) (1.3)
j=1
where cj (n, f ) ∈ CI ×1 and x(n, f ) ∈ CI ×1 denote the T-F representations
computed from cj (t) and x(t), respectively n = 1, 2, , N is the time frame index and f =
1, 2, , F presents the frequency bin index
A common assumption in array signal processing is the narrowbandassumption on the source signal [118] Under the narrowband assumption, theconvolutive mixing model (1.2) may be approximated by complex-valuedmultiplication in each frequency bin (n, f ) given by
cj (n, f ) ≈ aj (f )sj (n, f ) (1.4)where cj (n, f ) and sj (n, f ) are the STFT coefficients of cj (t) and sj (t), respectively,
Trang 39aj (f ) is the Fourier transform of aj (τ )
Source separation consists in recovering either the J original source signals sj (t)
or their spatial images cj (t) given the I -channel mixture signal x(t) The objective ofour
Trang 40Figure 1.2: Audio source separation: a solution for cock-tail party problem
research, as mentioned previously, is to recover the spatial image cj (t) of thesource from the observed mixture as shown in Fig 1.2 Note that in our study,background noise is also considered as a source This definition applies to bothpoint sources and diffuse sources in both live recordings and artificially-mixedrecordings
1.2 State of the art
As discussed in Section 1.1.1, a standard architecture for source separationsystem includes two models: the spectral model formulates the spectralcharacteristics of the sources, and spatial model exploits the spatial information ofthe sources An advan- tage of this architecture is that it offers modularity and wecan mix and match any mix- ing filter estimation technique with any spectral sourceestimation technique Besides, some of the approaches to source separation alsocan recover the sources by directly exploiting either the spectral sources or themixing filters The whole BSS picture built in more than two decades of research isvery large, consisting of many different tech- niques and requiring an intensivesurvey, e.g see in [22, 54, 85, 112, 138, 141] In this section, we limit our discussion
on some popular spectral and spatial models They are combined or usedindividually in the state-of-the-art algorithms in different ways
1.2.1 Spectral models