(Luận Án Tiến Sĩ) Tách Nguồn Âm Thanh Sử Dụng Mô Hình Phổ Nguồn Tổng Quát Trên Cơ Sở Thừa Số Hóa Ma Trận Không Âm.pdf

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF BASED GENERIC SOURCE SPECTRAL MODEL DOCTORAL DISSERTATION OF C[.]

AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF

Audio source separation: a solution for cock-tail party problem

1.1.1 General framework for source separation

Audio source separation is a signal processing task focused on recovering individual sounds, or sources, from an observed mixture, which can be either single-channel or multichannel This process requires sophisticated systems capable of estimating the number of sources, determining the appropriate frequency basis and convolutive parameters for each source, applying effective separation algorithms, and accurately reconstructing the original signals.

Audio separation techniques utilize two primary types of cues: spectral cues, which describe the spectral structures of sources, and spatial cues, which provide information about the sources' spatial positions While spectral cues help characterize the source's spectral content, they are insufficient alone to distinguish sources with similar pitch and timbre Conversely, spatial cues offer spatial localization but may not reliably differentiate sources from nearby directions Therefore, most existing systems combine both spectral and spatial cues to achieve more accurate source separation.

Source separation algorithms typically operate in the time-frequency domain after applying the short-time Fourier transform (STFT) These algorithms rely on two key modeling cues: the spectral model, which leverages the spectral characteristics of sources, and the spatial model, which utilizes spatial information to improve separation accuracy The process culminates in reconstructing the time-domain source signals through the inverse short-time Fourier transform (ISTFT), ensuring accurate and effective source separation.

Figure 1.1: Source separation general framework.

Multichannel audio mixtures are the types of recordings that we obtain when we employ microphone arrays [14, 22, 85, 90, 92] Let us formulate the multichannel mixture signal, whereJ sources are observed by an array ofI microphones, with indexes j ∈ {1,2, , J} and i ∈ {1,2, , I} to indicate specific source j and channel i. This mixture signal is denoted by x(t) = [x 1 (t), , x I (t)] T ∈ R I×1 and is sum of contributions from all sources as [85]: x(t) J

The equation $ X_j= \sum_{j=1}^{J} c_j(t) $ describes the combined signal where $ c_j(t) = [c_{1j}(t), \ldots, c_{Ij}(t)]^T $ represents the contribution of the j-th source to the microphone array, known as the spatial image of that source In this context, $ [\cdot]^T $ indicates the transposition of the matrix or vector Both the mixture signal and individual source spatial images are digital signals in the time domain, indexed by $ t \in \{0, 1, \ldots, T-1\} $, where $ T $ is the total length of the signals This framework is essential for analyzing and separating sources in multi-microphone audio processing applications.

Sound sources are generally categorized into two types: point sources and diffuse sources Point sources emit sound from a single, fixed point in space, such as a solo singer, a water droplet, or a person speaking alone In contrast, diffuse sources originate from a region in space, like rain droplets or a choir, and can be modeled as a collection of multiple point sources When considering a specific point source, its spatial image can be represented mathematically as c_j(t), capturing the sound's characteristics at that location.

X τ=0 aj(τ)sj(t−τ) (1.2) wherea j (τ) = [a 1j (τ), , a Ij (τ)] T ∈R I×1 , j = 1, , J aremixing filtersmodeling the acoustic path from thej-th source toI microphones,τ is the time delay, ands j (t) is the single-channel source signal.

Audio source separation systems typically operate in the time-frequency (T-F) domain, allowing for joint analysis of the temporal and spectral features of audio signals The short-time Fourier transform (STFT) is the most commonly used T-F representation, as it converts time-domain waveforms into a combined time and frequency domain, facilitating effective source separation STFT analysis involves dividing the waveform into overlapping frames and applying the Fourier transform to each frame, capturing the essential time-varying spectral information of audio signals.

Switched to the T-F domain, equation (1.1) can be written as x(n, f) J

X j=1 c j (n, f) (1.3) wherec j (n, f) ∈ C I×1 andx(n, f) ∈ C I×1 denote the T-F representations computed from c j (t) and x(t), respectively n = 1,2, , N is the time frame index and f 1,2, , F presents the frequency bin index.

In array signal processing, a common assumption is that the source signals are narrowband This simplifies the convolutive mixing model to an approximation by complex-valued multiplication for each frequency bin, expressed as c_j(n, f) ≈ a_j(f)s_j(n, f) Here, c_j(n, f) and s_j(n, f) represent the Short-Time Fourier Transform (STFT) coefficients of the respective time-domain signals, with a_j(f) being the Fourier transform of the mixing filter a_j(τ).

Source separation consists in recovering either theJoriginal source signalss j (t)or their spatial imagesc j (t)given theI-channel mixture signalx(t) The objective of our

Figure 1.2: Audio source separation: a solution for cock-tail party problem. research, as mentioned previously, is to recover the spatial image c j (t)of the source from the observed mixture as shown in Fig 1.2 Note that in our study, background noise is also considered as a source This definition applies to bothpoint sourcesand diffuse sourcesin both live recordings and artificially-mixed recordings.

State of the art

A standard architecture for source separation systems involves two key models: the spectral model, which captures the spectral characteristics of sources, and the spatial model, which leverages spatial information for better separation This modular approach allows flexibility in combining different filtering and source estimation techniques, enhancing the effectiveness of separation algorithms Additionally, some methods directly utilize spectral sources or the mixing filters themselves to recover sources Over the past two decades, numerous techniques have been developed within the blind source separation (BSS) field, making it a complex and expansive area requiring comprehensive surveys In this discussion, we focus on popular spectral and spatial models that are either combined or used individually in advanced source separation algorithms.

This section discusses three widely studied source spectral models in the literature: Spectral Gaussian Mixture Model (Spectral GMM), Spectral Nonnegative Matrix Factorization (Spectral NMF), and Deep Neural Networks These models are essential for effectively representing and analyzing spectral data, each offering unique advantages Spectral GMM provides a probabilistic approach to model complex spectral variations, while Spectral NMF decomposes spectra into meaningful components for improved interpretability Deep Neural Networks leverage advanced learning techniques to capture intricate spectral features, making them highly effective in various applications.

Gaussian model-based approaches, such as Spectral GMM, leverage the redundancy and inherent structure of each audio source to enhance audio source separation By modeling the spectral characteristics with Gaussian mixtures, these methods effectively distinguish individual sources, improving separation accuracy Spectral GMM techniques are widely recognized for their ability to exploit spectral patterns, making them a robust choice in audio signal processing.

The short-time Fourier spectrum of the j-th source is represented as a column vector comprising all frequency elements j(n, f) for f = 1, , F, denoted as s j(n) = [s j(n, f)]f The Spectral GMM approach models s j(n) as a multidimensional, zero-mean, complex-valued K-state Gaussian mixture with a probability density function (pdf) detailed in reference [7, 106].

The expression X k=1 δ jk N c (s j (n);0,Σ jk ) represents a mixture model where δjk is a weight satisfying the condition ∑_{k=1}^K δ_jk = 1 for each source j The vector 0 denotes a zero mean, and Σjk is a diagonal spectral covariance matrix with elements v_jk(f), capturing the spectral variance for the k-th state of the j-th source This formulation models the spectral characteristics of sources using weighted Gaussian components with specified covariance structures.

This model employs a two-step process for source separation At each time frame of the j-th source, a state k(n) is selected with probability δ_jk(n) during the first step Given the selected state, the STFT coefficient vector s_j(n) is generated from a zero-mean Gaussian distribution characterized by covariance matrix Σ_jk(n) The core goal of source separation is to compute the posterior probability of all possible states at each time frame, enabling accurate distinction and isolation of the sources.

The Spectral GMM utilizes K × F free variances (v_jk(f)) and leverages the global structure of sources to estimate these spectral variances However, traditional GMM does not explicitly account for amplitude variations in sound sources, which can lead to different estimated spectral variance templates for signals with similar spectral shapes but varying amplitude levels To address this limitation, an improved version of GMM was proposed in 2006, enhancing the model's ability to handle amplitude variations effectively.

The Spectral Gaussian Scaled Mixture Model (Spectral GSMM) enhances spectral analysis by incorporating a time-varying scaling parameter, g_jk(n), into each Spectral GMM component This approach allows for dynamic adjustment of the model to capture temporal variations in the data The probability density function (pdf) of the GSMM is formulated to include these scaling parameters, enabling more accurate modeling of complex spectral signals Overall, Spectral GSMM provides a flexible and robust framework for spectral representation, improving the effectiveness of signal analysis in various applications.

Spectral GMM and Spectral GSMM have been effectively applied to single-channel audio source separation, enhancing audio clarity and fidelity [13, 16] These models have also been used for stereo separation of moving sources, improving the accuracy of spatial audio separation [95] Additionally, Gaussian Mixture Models (GMM) have been employed in multichannel instantaneous music mixtures, with Spectral GMMs learned directly from mixture signals to facilitate better source discrimination and separation [7].

Nonnegative matrix factorization (NMF) is a powerful dimension reduction technique tailored for analyzing nonnegative data, widely utilized across various machine learning and audio signal processing applications Its effectiveness in extracting meaningful features has made it a popular choice in numerous fields, with detailed methodologies discussed further in Chapter 2 as a foundational technique for this study.

In the following, we will review NMF as a structured spectral source model applied to audio source separation, known as Spectral NMF.

In the Spectral NMF model, each sources j is the sum ofK j spectral basis(also is calledfrequency basis, basis spectra, orlatent components) and is written by [102] s j (n, f) K j

Spectral basis components $ c_k(n, f) $ are assumed to be mutually independent within each time-frequency bin, following a zero-mean Gaussian distribution with variances defined by $ h_{nk} w_{kf} $ The spectral basis $ w_{kf} $ captures the spectral structures of the signal, while $ h_{nk} $ models the time-varying activations Additionally, source STFT coefficients $ s_j(n, f) $ are modeled as independent zero-mean Gaussian variables with variances summed over spectral bases, represented as $ p(s_j(n, f)) = \mathcal{N}_c(s_j(n, f); 0, \sum_{k=1}^K h_{nk} w_{kf}) $.

Denoting byS j = [s j (n, f)] nf theN ×F matrix of STFT coefficients of the j-th source, H j = [h nk ] nk with dimension N ×K j , and W j = [w kf ] kf with dimension

K j ×F, ML estimation of the latent variables H j and W j is equivalent to NMF of the power spectrogram |S j | 2 into H j W j according to the divergence function d as following [40]

−logp(Sj|Hj,Wj)$X n,f d(|sj(n, f)| 2 kHjWj) (1.11) where◦denotes equality up to a constant, divergence functiondmay be Kullback-

KL divergence (d_KL) and Itakura-Saito (IS) divergence are key measures in signal processing, with KL divergence defined as d_KL(x,y) = x log(x/y) - x + y, and IS divergence as d_IS(x,y) = x/y - log(x/y) - 1 More detailed explanations of these divergences will be provided in Chapter 2 Non-negative Matrix Factorization (NMF) simplifies the estimation process by requiring only N_Kj values of Hj and N_KjF values of Wj, instead of estimating all N_F values of the power spectrogram Sj, thereby reducing computational complexity and enhancing efficiency.

N K j +K j F N F.Thus NMF is considered as a form of dimension reduction in this context.

Spectral NMF has been applied to single-channel audio source separation [115,

142] and multichannel audio source separation [102, 104] with different settings In recent years, several studies have investigated user-guided NMF methods [26, 30, 37,

104, 126, 156] that incorporate specific information about the sources in order to improve the efficiency of the separation algorithm.

Recent studies demonstrate that deep neural networks (DNNs) excel at modeling complex functions and achieving high performance across various tasks, including audio signal processing Traditional methods like GMM and NMF focus on learning the characteristics of speech and noise signals to guide signal separation In contrast, deep learning approaches leverage end-to-end training to directly learn separation masks or models, resulting in a significant improvement in speech separation performance.

In DNN-based speech separation, the mixture's time-frequency representation is pre-processed to extract relevant features, which are then input into a deep neural network The DNN either directly estimates the time-frequency mask or infers source spectra to derive the mask, facilitating effective separation Time-frequency masking involves filtering the mixture's representation with a mask to estimate spatial images, expressed as ˆ c j (n, f) = ˆm j (n, f)x(n, f), where the mask ˆm j (n, f) is specific to each source and time-frequency bin In audio enhancement, ideal masks—binary or ratio masks—are computed from real-valued scalar masks to optimize separation quality, with the ratio mask defined as m rat j (f, n) = || c j (n, f) || / || x(n, f) || and the binary mask derived by applying a threshold to the ratio mask.

Recent studies have investigated and compared various DNN architectures and training criteria to enhance speech separation performance These methods involve experimenting with different deep neural networks to estimate a real-valued ratio mask, denoted as $ \hat{r}_{\text{rat\_targ}}(f, n) $, which represents the target source The networks are trained by minimizing one of three specific error functions to optimize the accuracy of the ratio mask estimation, thereby improving separation quality.

- The error of spectra computed using the estimated mask:

( ˆm rat targ (f, n)|x(f, n)| − |s targ (f, n)|) 2 , (1.16) wherestarg(f, n)is the target source spectra.

- The error of signal in the complex-valued T-F domain computed using the estimated mask:

Source separation performance evaluation

The topic of the source separation performance evaluation has long been studied in the literature Several studies have been published both in terms of objective quality

Our study focuses on two widely used families of objective audio quality evaluation criteria: energy ratio criteria and perceptually-motivated criteria These metrics are versatile, applicable to any audio mixture and algorithm, and do not require prior knowledge of unmixing parameters or filters Both criteria have been extensively adopted within the research community and are commonly utilized in recent evaluation campaigns for assessing audio separation performance.

This article discusses the perceptual decomposition of estimated source images into four key components: true spatial image, spatial distortion, inter-source interference, and artifacts Specifically, each estimated source image $ \hat{c}_{ij}(t) $ is decomposed as $ c_{ij}(t) + e_{spat\, ij}(t) + e_{inter\, ij}(t) + e_{artif\, ij}(t) $, highlighting the different error sources affecting audio quality The true spatial image $ c_{ij}(t) $ represents the actual sound source, while $ e_{spat\, ij}(t) $, $ e_{inter\, ij}(t) $, and $ e_{artif\, ij}(t) $ capture spatial distortions, interference, and artifacts, respectively This decomposition forms the basis for the two families of criteria used to evaluate source separation performance, with detailed measures provided in sections 1.3.1 and 1.3.2.

The article discusses the calculation of three distortion components in the energy ratio criteria family based on the given decomposition (1.25) These components include the spat ij distortion, computed as the difference between the projected signal and the original signal, the inter distortion, representing the difference between the all-channel projection and the single-channel projection, and the artifact distortion, which reflects the discrepancy between the original signal and the all-channel projection The projections are performed using least-squares projectors onto specific subspaces defined by previous signal samples, with a filter length of 32 milliseconds This approach helps accurately quantify various distortion types in signal processing applications, enhancing audio quality assessment and signal analysis.

The assessment of audio and image quality involves measuring the relative levels of interference distortion, artifacts distortion, and spatial distortion using three key energy ratio criteria expressed in decibels (dB) These criteria include the Source to Interference Ratio (SIR), which quantifies the extent of interference relative to the original source; the Sources to Artifacts Ratio (SAR), which assesses the level of unwanted artifacts introduced during processing; and the Source to Image Spatial Distortion Ratio (ISR), which evaluates the preservation of spatial fidelity in the reconstructed image These metrics, defined by established standards [140], provide comprehensive insights into the quality and integrity of audio-visual signals.

P te inter ij (t) 2 (1.29) The suppression of interfering sources in the separation is objectified by this measure.

P t(c ij (t) +e spat ij (t) +e inter ij (t)) 2

This measure estimates the artifacts introduced by the source separation process.

• Source Image to Spatial distortion Ratio:

P te spat ij (t) 2 (1.31) This measure represents the suppression of the spatial distortions.

The total error represents the overall performance of the source separation algorithm,also measured by theSignal to Distortion Ratio(SDR) and calculated as follows

P t(e spat ij (t) +e inter ij (t) +e artif ij (t)) 2 (1.32)

These criteria were implemented in Matlab and distributed for public use [41] 1 They are most commonly used metrics in the source separation community so far.

In addition to the energy ratio criteria, we evaluate the quality of estimated source image signals using perceptually-motivated objective measures, as described in [32] These measures decompose the estimated signals into three distortion components—target distortion, interference distortion, and artifact distortion—similar to equations (1.26), (1.27), and (1.28), while incorporating the PMO-Q perceptual salience measure [57] Based on these components, we derive four key metrics: the Overall Perceptual Score (OPS), Artifacts-related Perceptual Score (APS), Interference-related Perceptual Score (IPS), and Target-related Perceptual Score (TPS), which correspond to standard evaluation criteria like SDR, SAR, SIR, and ISR.

These criteria score from 0 to 100 where higher values indicate better performance.

Perceptually-motivated criteria have been shown to enhance correlation with subjective audio quality scores compared to traditional energy ratio criteria, often complementing these methods since 2010 in the audio source separation community Additionally, the source code for implementing these perceptually-oriented metrics is publicly available, supporting researchers and developers in improving audio quality assessment.

Summary

This chapter provides an overview of audio source separation, presenting the fundamental problem and outlining key technical approaches that leverage spectral and spatial information for effective source separation It also introduces two widely-used objective evaluation criteria essential for assessing the performance of the proposed separation methods discussed in Chapters 3 and 4.

1 http://bass-db.gforge.inria.fr/bss eval/

2 http://bass-db.gforge.inria.fr/peass/

NONNEGATIVE MATRIX FACTORIZATION 24

NMF introduction

Nonnegative matrix factorization (NMF) is a powerful dimensionality reduction technique designed for nonnegative data, gaining widespread recognition after Lee and Seung's influential work in 1999 However, NMF originally appeared nearly two decades earlier under different names, such as nonnegative rank factorization and positive matrix factorization Thanks to those foundational publications, NMF has been broadly applied across numerous fields, including bioinformatics, image processing, and facial recognition, demonstrating its versatility and significance in data analysis.

[55], speech enhancement [39, 89], direction of arrival (DoA) estimation [131], blind source separation [40, 102, 107, 122, 130, 159], and the informed source separation

[25, 44, 46, 48] Comprehensive reviews about the NMF can be found in [147, 160].

In the following, we will present some details about NMF so as to understand what the NMF is and how it works.

Given a data matrix V ∈ R F + ×N of dimensions F ×N with nonnegative entries, NMF aims at finding two nonnegative matrices Wand Hsuch thatWHis approximately equal toVas [73]

Nonnegative Matrix Factorization (NMF) approximates a data matrix $ V $ as the product of two nonnegative matrices, $ W $ and $ H $, where $ V \approx WH $ In this context, $ W $ is an $ F \times K $ matrix, and $ H $ is a $ K \times N $ matrix, with both matrices containing only nonnegative elements NMF is particularly useful for the statistical analysis of multivariate data by decomposing the data matrix $ V $, which contains data vectors in its columns, into basis components Here, $ F $ represents the characteristics of the data, and $ N $ indicates the number of observations or dataset samples The factorization seeks to identify $ K $ latent components, with $ K $ typically chosen to be smaller than both $ F $ and $ N $, enabling meaningful decompositions and feature extraction from complex multivariate datasets.

F ×K+K×N F ×N [42, 73] SoWandHare smaller than the original matrix

V, they are lower-rank representation of the original data matrix That is why NMF is considered as a dimensionality reduction technique.

Equation (2.1) can be expressed as v ≈ Wh, where v and h are the columns of matrices V and H, respectively This means each data vector v is approximated by a linear combination of the columns of W, weighted by the components of h W is known as the dictionary matrix, containing the basis vectors optimized for linear data approximation The matrix H represents the distribution of these basis vectors across data samples and is referred to as the activation or weight matrix Typically, a small number of basis vectors are sufficient to accurately represent many data vectors, enabling effective data representation by uncovering the latent structure within the dataset.

Non-negative Matrix Factorization (NMF) is a powerful technique designed to identify basic nonnegative factors that facilitate feature extraction and dimensionality reduction By eliminating redundant information, NMF uncovers hidden patterns within datasets composed of non-negative vectors This method enhances data interpretability and plays a crucial role in various applications such as pattern recognition, data compression, and machine learning.

Figure 2.1: Decomposition model of NMF [36].

2.1.2 Cost function for parameter estimation

For decomposing a matrix V into matrices W and H, we want to get as close an approximation for equation (2.1) as possible This can be achieved by solving the optimization problem [40]

H≥0,W≥0min D(VkWH), (2.2) whereD(VkWH)is the cost function DenotingVˆ =WH, this cost function is defined by

In (2.3),d(xky)is a divergence function, which may be Euclidean distance (EUC)

Commonly used cost functions in machine learning and data analysis include Kullback-Leibler (KL) divergence, Itakura-Saito (IS) divergence, β-divergence, α-divergence, γ-divergence, and Bregman divergence, with the α−β-divergence representing an overarching framework Among these, Euclidean distance (EUC), KL divergence, and IS divergence are the most popular choices for measuring differences between data distributions The β-divergence is a versatile generalization that encompasses many of these measures, offering flexibility for various applications in unsupervised learning and statistical modeling.

Ifβ = 2this will become the EUC distance, if β = 1this will become to the KL divergence, and ifβ = 0this will be the IS divergence For clarity, these distances can be written as follows:

• KL divergence: d KL (xky) = xlog(x y)−x−y (2.6)

• IS divergence: d IS (xky) = x y −log(x y)−1 (2.7)

Choosing the appropriate NMF cost function depends on the data type, with Euclidean distance being symmetric and sensitive to component magnitudes, while KL-divergence and IS-divergence are asymmetric and measure relative entropy, allowing x and y to be viewed as normalized probability distributions In our study, we focus on NMF with the IS divergence, a special case of β-divergence, which has proven effective for decomposing audio spectra.

In 2001, Lee and Seung addressed the minimization problem in equation (2.2) by exploring gradient descent algorithms based on β-divergence cost functions Their research introduced the transformation of traditional gradient descent update rules into multiplicative update (MU) algorithms, enhancing the efficiency of the optimization process This approach involves expressing the derivative of D with respect to θ in a specific form, facilitating more effective updates during matrix factorization tasks.

∇ θ D(θ) = ∇ + θ D(θ)− ∇ − θ D(θ), (2.8) where∇ + θ D(θ)and∇ − θ D(θ)are nonnegative components, the gradient descent update rules ofθcan be turned into the MU rules as θ←θ.∇ − θ D(θ)

Applied to theβ-divergenve, the derivative ofd β (xky)(i.e, equation (2.4a)) with regard toyis caculated as

Because y representsWH, the partial derivative with regard toHandW, respectively, are written as

∇ W D(VkWH) = ((WH) (β−2) (WH−V)H T (2.12) whereA (n) denotes the matrix with entries[A] (n) ij ,A T is the transposition of matrix

A Subject to equation 2.9, the multiplicative updateHandWare written as

(WH) (β−1) H T , (2.14) wheredenotes the element-wise Hadamard product and the division is also element- wise.

The NMF algorithm with the MU-rules in order to estimateWandHis described in Algorithm 1 The input of the algorithm is matrix V and number of spectral basis

K β determines the divergence used in the algorithm: β = 0 corresponds to IS- divergence,β = 1corresponds to KL-divergence,niteris the number of iterations.

InitializeH (0) ,W (0) randomly with nonnegative values t = 0;

Previous studies by Lee and Sung (2009) showed that Dβ(VkWH) does not increase with subsequent updates for β values of 2 (Euclidean distance) and 1 (KL divergence), highlighting the stability of the divergence measure during iterative processes Additionally, Kompass (2007) extended this proof to encompass the entire range of β values between 1 and 2, demonstrating the consistent non-increasing behavior of Dβ(VkWH) across these parameters.

Fevotte et al (2009) demonstrated that the update criterion remains non-increasing for β < 1 and β > 2, particularly noting the case when β = 2, which corresponds to the IS divergence Although a general proof of convergence is lacking, the simplicity of the Multiplicative Update (MU) rules has significantly contributed to the widespread popularity of Non-negative Matrix Factorization (NMF).

Application of NMF to audio source separation

NMF has been widely used for supervised source separation in the literature Ac- cordingly, a short-time Fourier transform (STFT), is applied to the original time-domain signalx(t) Then the magnitude or power of the STFT coefficients is computed resulting in a nonnegative matrixV The basic idea is to consider matrixVas a combination of a spectral basis matrixWand an activation matrixHasV=WH The columns of

Spectral basis represent the distinct spectral characteristics of an audio signal, while the matrix H indicates their corresponding time gains In this decomposition model, K represents the number of spectral bases, F is the number of frequency bins, and N denotes the total number of time frames A simple Non-Negative Matrix Factorization (NMF) model with two spectral bases is illustrated in Fig 2.2 [66], demonstrating how two spectral bases form the dictionary matrix to capture different spectral features of the audio signal.

W The activation matrix H returns the mixing proportions of two spectra basis in each time-frame.

Figure 2.2: Spectral decomposition model based on NMF (K = 2) [66].

The number of spectral basis functions, K, significantly influences the efficiency of audio spectral analysis A larger K captures more detailed spectral features but increases computational complexity and makes parameter estimation more challenging Conversely, a smaller K may overlook essential sound characteristics, impacting analysis accuracy Therefore, selecting an appropriate K as a tuning parameter is crucial and should be guided by prior knowledge of the specific sound type For example, research indicates that an optimal K is around 32 for speech signals and approximately 16 for environmental noise.

2.2.2 NMF-based audio source separation

We introduce in this section a conventional supervised audio source separation method based on NMF as one of the most popular model for audio signal [43, 105, 124,

The general pipeline operates in the frequency domain after applying the Short-Time Fourier Transform (STFT) It comprises two main phases: first, learning Non-negative Matrix Factorization (NMF) source spectral models from training examples; second, decomposing the observed audio mixture guided by these pre-learned models This process enables effective source separation by leveraging learned spectral representations.

Figure 2.3: General workflow of supervised NMF-based audio source separation.

In a single-channel signal separation scenario involving J sources, we analyze the problem where X ∈ C^{F×N} represents the observed mixture signal's Short-Time Fourier Transform (STFT) coefficients, and S_j ∈ C^{F×N} denotes the STFT coefficients of the individual source signals for j = 1, , J The mixing model can be formulated to describe how the observed signal is a combination of these source signals in the frequency domain, providing a foundation for developing effective source separation algorithms optimized for audio and speech processing tasks.

Denoting V = |X| 2 the power spectrogram of the mixture where |X| p is the matrix with entries[X] p il , NMF aims at decomposing the F ×N non-negative matrix

The article discusses decomposing a matrix V into two non-negative matrices, W ∈ R F+×K and H ∈ R K×N+, as described in section 2.2.1 This matrix factorization aims to minimize the Itakura-Saito divergence, a popular measure in audio processing, as outlined in equation (2.7) The optimization is carried out under the constraints that both W and H remain non-negative, ensuring valid and interpretable decompositions for audio signal analysis.

The parameters θ = {W,H} are usually initialized with random non-negative values and are iteratively updated via the well-known MU rules [40].

In a supervised setting, spectral models for each source (denoted as Wj, where j = 1, , J) are first learned from corresponding training data using an optimization process outlined in Algorithm 1, based on criterion (2.2) These models individually capture the unique spectral characteristics of each source Once trained, the spectral models for all sources are combined to obtain a comprehensive representation, enabling improved separation and analysis of mixed signals in various applications.

In the testing phase (source separation process), this spectral modelWis fixed, and the time activation matrixHis estimated via the MU rule Note thatHis also partitioned into blocks as

H= [H T 1 ,H T 2 , ,H T J ] T , (2.17) whereH j denotes a block characterizing the time activations forj-th sourse withj 1, , J, andA T is the transposition of matrixA.

Algorithm 2Baseline NMF-based audio source separation algorithm

Training data of all source{s j (t)}, j = 1 : J

Ensure: : Source imagesˆc j (t)separated fromx(t)

- Estimating the spectral basis matrixWj forj-th source from training example s j (t)by Algorithm 1 end for

- EstimatingHfrom mixture signalx(t)by Algorithm 1 (Wis fixed).

- EstimatingˆS j by Wiener filtering (2.18) ˆ s j (t) =ISTFT(Sˆ j )

Once the parametersθ ={W,H}are obtained, the source STFT coefficients are computed by Wiener filtering as ˆS j = W j H j

The state-of-the-art NMF-based audio source separation algorithm utilizes element-wise Hadamard product and division operations, with time-domain source estimates obtained through inverse STFT This algorithm, described in Algorithm 2, serves as a baseline method for comparison with the proposed approach introduced in Chapter 3 It estimates parameters under the multiplicative update (MU) rules, guided by a spectral basis dictionary W trained on previous data Consequently, this supervised NMF-based technique requires training data and cannot be applied in scenarios lacking such data.

Proposed application of NMF to unusual sound detection

Our primary goal is to utilize Non-negative Matrix Factorization (NMF) for effective audio source separation by modeling the spectral characteristics of audio signals We focus on understanding how NMF captures frequently occurring features in long audio recordings, which enhances separation accuracy Additionally, we propose an innovative sound detection method based on NMF, aimed at identifying unusual or anomalous sounds within audio data This approach leverages NMF's ability to discern common and distinctive spectral patterns, improving the detection of atypical audio events.

Audio event detection and audio scene analysis are important tasks in acoustic signal processing They have been recently attracted much attention For example, detection and classification of audio scenes and events (D-CASE) were organized as an IEEE audio and acoustic signal processing (AASP) challenge in 2016, and 2017

The 2018 DCASE challenge highlights ongoing efforts in sound classification tasks, traditionally relying on feature extraction methods like Mel-Frequency Cepstral Coefficients (MFCC) combined with classifiers such as Gaussian Mixture Models (GMM) through supervised training Recently, there has been a shift towards deep learning architectures, which require large, accurately annotated datasets to effectively classify various sound types However, data annotation remains a significant challenge, as it involves manually listening to recordings and labeling short segments (e.g., one second) with sound categories like vehicles, birds, airplanes, or voices, a process that is both time-consuming and tedious.

To enhance sound annotation accuracy and efficiency, we introduce innovative methods for automatically detecting non-stationary sound segments in an unsupervised manner This approach reduces annotation costs by minimizing manual effort, leveraging advanced analysis techniques and assumptions about sound dynamics Our novel methods enable precise identification of non-stationary segments, streamlining the annotation process and improving overall sound data analysis.

1 http://dcase.community/challenge2018/index

In real-world environments, background sounds such as the constant call of cicadas in summer parks often coexist with various short-term acoustic events These audio signals typically comprise a steady, stationary background noise combined with sporadic sound events, making it challenging to identify and annotate different types of sounds within long recordings As a result, analyzing lengthy audio recordings, which can last one or two hours, requires significant time and effort to detect and accurately label all relevant acoustic events.

As it can be seen in Section 2.1 and 2.2, NMF is capable of modelling the spectral characteristics of the audio signal by matrixW(as a spectral basis dictionary) withK characteristics (spectral basis number) Thus, if we apply NMF with only one spectral basis, it will be expected that the stationary background sound should be well represented by this one NMF spectral basis, while non-stationary audio events should not. Then, the residual divergence is a good measure to detect non-stationary segments, expected to correspond to audio events Finally, a human listener has only to listen to the detected non-stationary segments for annotation Such annotated segments include a good variety of sounds and they form a good dataset for training step of the supervised source separation algorithms.

In this study, we analyze a single-channel audio signal to identify multiple time segments containing non-stationary acoustic events As outlined in Section 2.2.2, the approach leverages the complex-valued matrix of Short-Time Fourier Transform (STFT) coefficients, denoted as X ∈ C^{F×N}, representing the observed signal This methodology enables effective detection of dynamic acoustic events within the audio stream by examining the time-frequency representation.

Letnsecbe the duration of the segment that we want to extract (e.g nsecequals to

The duration of non-stationary acoustic events typically ranges from 5 to 10 seconds, depending on their length The size of a block in matrix V, denoted as F × B, corresponds to the length of the extracted audio segment, where B is calculated as s* nsec / nshif, with s representing the segment length in seconds, nsec the total duration, and nshif the frame shift used in the Short-Time Fourier Transform (STFT) The sampling rate of the audio signal is represented by f, and bxc indicates the largest integer less than or equal to x, ensuring precise segmentation for effective acoustic event analysis.

The power spectrogram of the input signal is denoted as V = |X| 2 NMF is performed to decompose the matrixVinto two matricesW∈R F + ×K andH∈R K×N + as equation (2.2) with the IS divergence (2.7) The parameters are initialized with random non-negative values and are iteratively updated via the MU rules (2.13) and(2.14).

2.3.2 Proposed methods for non-stationary frame detection

This section introduces three proposed methods for extracting short audio segments from real-world recordings containing environmental noise and diverse audio events One method relies solely on signal energy, while the other two utilize Non-negative Matrix Factorization (NMF) with a single spectral basis to identify non-stationary audio events These extracted segments are intended to capture the interesting and transient sounds that are crucial for accurate audio event detection.

This method relies on the premise that environmental noise, like silence and wind sounds, typically exhibits lower energy compared to non-stationary acoustic events such as human speech, car sounds, and bird songs During recordings, environmental noise remains minimal with low energy, whereas targeted acoustic events often produce higher energy levels By extracting high-power segments directly from the power spectrogram matrix, this approach effectively isolates segments likely containing specific non-stationary audio events, enhancing acoustic event detection accuracy.

Figure 2.4: Image of overlapping blocks.

We first calculate the total energy of each overlapping block of matrixV, which is shown in Fig 2.4, as pt F

Vf,(t−1)B 0 +b, (2.19) wheret = 1, , T is the block index,bis the frame index in each block, andB 0 is the block shift.

After calculating the overall energy vector $ \mathbf{p} = [p_1, \ldots, p_T] $ across all blocks, we focus on extracting audio segments with high energy values These segments are likely to contain non-stationary audio events, making them critical for accurate event detection and analysis By targeting high-energy portions of the audio, we can effectively capture dynamic and distinctive non-stationary sounds for further processing.

This method utilizes a Non-negative Matrix Factorization (NMF) model with a single spectral basis to predominantly model stationary signals that frequently appear in the data, typically representing background noise By focusing on this simplified model, we can effectively identify audio events that are poorly estimated by the background model, enabling accurate detection of transient sound events The entire process operates in the frequency domain and follows a structured pipeline illustrated in Fig 2.5, ensuring an efficient approach to audio event detection in complex environments.

Figure 2.5: General workflow of the NMF-based nonstationary segment extraction.

STFT NMF is performed with IS divergence Then, the residual divergence matrix between the model and observation is computed as

D =V./(W∗H)−log(V./(W∗H))−1, (2.20) where./denotes the element-by-element division of matrix entries.

Similar to what mentioned in the Signal energy based method, we calculate the sum divergence of each block of matrixDcorresponding to the segment durationnsec as follows: q t F

After calculating the total divergence vector $ q = [q_1, \ldots, q_T] $ across all blocks, we identify audio segments with high divergence values, as these are less accurately modeled by Non-negative Matrix Factorization (NMF) These high-divergence segments are likely to represent non-stationary audio events, making them crucial for detailed audio analysis and event detection This approach effectively isolates non-stationary sounds, enhancing the accuracy of audio segmentation and classification tasks.

Since the background noise may change over time, applying Non-negative Matrix Factorization (NMF) to the entire signal may not be optimal for long recordings To address this, we explore a localized NMF approach that processes short segments of the audio, such as one- or two-minute intervals This method involves applying NMF to partial data, allowing the residual divergence matrix D to be constructed locally This localized strategy enhances the model’s ability to adapt to variations in the acoustic signal, resulting in more accurate noise separation and signal analysis in non-stationary environments.

Algorithm 3Global/Local NMF algorithm

InitializeH (m) , W (m) randomly; H (m) is a one-row matrix andW (m) is a one- column matrix.

// Update NMF parameters fori= 1, , niterdo

Summary

This chapter introduces the NMF model, a widely used technique in audio signal processing due to its ability to effectively represent spectral characteristics of audio signals We also present a supervised NMF-based algorithm as a baseline method for audio source separation Additionally, methods for automatically detecting non-stationary segments are proposed, with experimental results from outdoor recordings across three seasons confirming NMF’s capability to model audio spectral features accurately However, the NMF-based separation requires source-specific training data, which may not always be practical Therefore, the next chapter explores a weakly-informed approach that leverages abstract semantic information about source types to enhance separation performance.

This chapter introduces proposed methods for automatically detecting non-stationary segments, as detailed in publication [3] listed in the thesis’s “List of Publications.” These innovative techniques have been successfully transferred to RION Japan, highlighting their potential for practical industry applications.

SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP

General workflow of the proposed approach

Current studies on audio source separation have shown that the fully blind techniques do not provide sufficient results Some existing informed source separation techniques using specific information (e.g., music score, speech transcript, ) [6, 16,

Weakly-informed strategy approaches enhance separation efficiency by utilizing abstract semantic information about source types, especially in scenarios with limited training data This method effectively addresses challenging problems where specific data is unavailable, relying on minimal training examples such as a few short audio recordings—typically three to five recordings, each approximately a few seconds long—to improve source separation performance.

5 seconds) of the same type as the sources in the mixture They are used to learn the general source spectral model, then this general source spectral model is explored in order to guide the separation process Note that in the rest of the thesis,GSSMwill be used as shorthand for theGeneric Source Spectral Model.

This article addresses the single-channel signal separation problem involving J sources, where X ∈ CF×N and Sj ∈ CF×N represent the complex-valued matrices of the short-time Fourier transform (STFT) coefficients for the observed mixture signal x(t) and the j-th source signal c_j(t), respectively The mixing model is described by equation (2.15) The main challenge is to accurately estimate the individual source signals c_j(t) from a single-channel mixture x(t) without relying on any training data.

Knowing the source types in the mixture and having recorded examples is practical, especially in scenarios like speech separation from noisy environments For effective separation, it is essential to gather multiple examples of each source, as a single recording may not capture all variations; for instance, noise is often poorly defined Fortunately, the required training data is minimal—typically three speech files and four noise files, each lasting 5 to 10 seconds—making this approach feasible with limited resources.

Figure 3.1: Proposed weakly-informed single-channel source separation approach.

We propose a weakly-informed single-channel audio source separation method based on Non-negative Matrix Factorization (NMF), utilizing limited training examples to guide the separation process The approach operates in the Time-Frequency (T-F) domain after applying Short-Time Fourier Transform (STFT), following a two-phase pipeline: (1) learning a Gaussian Shared Subspace Model (GSSM) from training data via NMF, and (2) decomposing the observed mixture using these pre-trained models Training data consists of audio samples similar to the sources, such as speech and environmental sounds—e.g., three speech recordings (one male, two female) and four environmental sound clips (wind, street noise, cafeteria, birdsong)—which enable effective GSSM learning This method allows for efficient source separation with minimal training data, leveraging prior models learned through NMF.

GSSM formulation

In our study, we denote $s l j(t)$ as the $n$-th single-channel learning example from the $j$-th source, with its corresponding spectrogram obtained via Short-Time Fourier Transform (STFT) denoted as $S l j$ These spectrograms are employed to train the associated Non-negative Matrix Factorization (NMF) spectral dictionaries, represented by $W l j$ The learning process involves optimizing a criterion similar to equation (2.2), aiming to effectively capture source-specific spectral features and improve separation accuracy This approach leverages spectrogram analysis and NMF techniques to enhance single-channel source separation performance, adhering to best practices in audio signal processing and unsupervised learning.

D(S l j kW l j H l j ) (3.1) whereH l j is the time activation matrix GivenW l j for all examplesl = 1, , L j of the j-th source, the GSSM for thej-t source is constructed as

U j = [W 1 j , ,W j L j ], (3.2) then the GSSM for all the sources is computed by

For effective speech and noise separation, it is essential to gather multiple speech examples representing different voices, such as three samples of male and female voices, to enhance model robustness Additionally, incorporating diverse noise types—including outdoor environments, cafeterias, waterfalls, and street sounds—provides comprehensive training data, with four examples covering these scenarios The Generalized Speech Separation Model (GSSM) is built using these varied training examples, as illustrated in Fig 3.2, to improve its accuracy and performance in real-world applications.

Model fitting with sparsity-inducing penalties

As the number of examples increases, the GSSMsUs become large, often resulting in redundant matrices due to shared spectral patterns among different samples Therefore, applying a sparsity constraint during model fitting is essential to effectively isolate relevant features This approach allows for focusing on a subset of the large matrix U, constructed in (3.3), to accurately target the source within the mixture spectrogram, enhancing spectral source separation.

[58, 74, 142] In other words, the mixture spectrogramV = |X| 2 is decomposed by solving the following optimization problem minH≥0 D(VkUH) +λΩ(H) (3.4)

Figure 3.2: Generic source spectral model (GSSM) construction. whereΩ(H)denotes a penalty function imposing sparsity on the activation matrixH, andλis a trade-off parameter determining the contribution of the penalty Whenλ= 0,

His not sparse and the entire generic model is used as illustrated in Fig 3.3a Recent work in audio source separation has considered two penalty functions as the following.

Reynolds et al introduced the block sparsity penalty function in 2000 to eliminate irrelevant training examples that lack similar spectral characteristics to the targeted source in a mixture Recently, Sun and Mysore applied this approach to single-channel, speaker-independent source separation The block sparsity-inducing penalty function plays a crucial role in enhancing source separation performance by promoting the selection of relevant training samples containing the desired spectral features.

The equation $ X g=1 \log(+kH (g) k 1 ) $ illustrates a penalty function where $ H(g) $ denotes a subset of activation coefficients corresponding to the $ g $-th block, with $ G $ representing the total number of blocks Here, $ c $ is a non-zero constant, and $ \| \cdot \|_1 $ denotes the $ l_1 $-norm Each block corresponds to a single training example, and the total number of examples across all sources is given by $ G = \sum_{j=1}^{J} L_j $ This penalty enforces activation only for relevant examples, effectively omitting poorly fitting ones by minimizing their activation coefficients.

Figure 3.3 illustrates the estimated activation matrix H under different sparsity constraints, including (a) without any sparsity, (b) with a block sparsity-inducing penalty (3.5), (c) with a component sparsity-inducing penalty (3.6), and (d) with the proposed mixed sparsity-inducing penalty (3.7) The analysis shows that applying the block sparsity penalty effectively drives entire blocks to zero, as visualized in Figure 3.3b, demonstrating how different regularization techniques influence the sparsity patterns in the activation matrix Incorporating block sparsity constraints enhances the interpretability and efficiency of the model by promoting structured sparsity within the activation matrix.

In 2014, El Badawy et al introduced the component sparsity-inducing penalty function as below

The penalty term, expressed as $\sum_{k=1}^{X} \log(+k h_k)$, relates to the k-th row of matrix H As discussed in [8], this penalty is designed to address the fact that only certain parts of the spectral model, learned from an example, may align well with the target source in a mixture, while other components do not Instead of activating the entire spectral block, the Ω₂(H) penalty promotes selecting only the most relevant spectral components from W An example of the matrix H after convergence demonstrates this selective activation, as shown in Fig 3.3c, with similar illustrations available in earlier works.

3.3.3 Proposed mixed sparsity-inducing penalty

The block sparsity-inducing penalty promotes sparsity across entire blocks of GSSM, either removing or retaining all spectral bases within a block for each training example However, this approach may overlook important features scattered across different blocks or inadvertently retain less relevant characteristics On the other hand, the component sparsity-inducing penalty enforces sparsity on individual rows of GSSM, offering better the extraction of scattered features Nonetheless, it tends to remove irrelevant components more slowly because it carefully evaluates each row within the large matrix, which can impact the efficiency of the sparsity enforcement.

Inspired by the advantage of these two state-of-the-art penalty functions, we proposed to combine them in a more general form as

X k=1 log(+kh k k 1 ), (3.7) where γ ∈ [0,1]weights the contribution of each term in mixed group sparsity constraint.

New penalty function (3.7) is the generalization of the block sparsity-inducing penalty (3.5) and the component sparsity-inducing penalty (3.6) in the sense that when γ = 1, (3.7) is equivalent to (3.5) and whenγ = 0, (3.7) is equivalent to (3.6) Fig.3.3d shows an example of the activation matrix H after convergence when the novel penalty (3.7) is used It can be seen that some blocks converge to zero due to the contribution of the first term in (3.7), while in the remaining blocks, some components are zeros due to the second term in (3.7).

Derived algorithm in unsupervised case

This section explains the process of calculating the MU rule for updating the Hmatrix when utilizing the new penalty function (3.7) within the optimization criterion (3.4) Although the approach involves training, it can be considered unsupervised since the training phase only learns general models from various example signals Implementing this method ensures effective updates to Hmatrix, adhering to the optimization framework while leveraging the novel penalty function for improved performance.

Let L(H) represent the minimization criterion outlined in equation (3.4), which incorporates the mixed group sparsity constraint Ω(H) as defined in equation (3.7), with D(ãkã) corresponding to the IS divergence described in equation (2.7) The optimization process involves calculating the partial derivative of L(H) with respect to each entry h_kn to facilitate efficient gradient-based updates and ensure accurate convergence.

This∇ h kn L(H)can be written as a sum of two nonnegative parts, denoted by∇ + h knL(H)≥

Following a standard approach for MU rule derivation [40, 73]),h kn is updated as h kn ←h kn ∇ − h knL(H)

, (3.11) where η = 0.5 following the derivation in [42, 74], which was shown to produce an accelerated descent algorithm Putting (3.10) into (3.11) and rewriting it in a matrix form, we obtain the updates ofHas

, (3.12) where Vb = UH, Y = [Y > 1 , ,Y > P ] > with Y p , p = 1, P an uniform matrix of the same size as H p whose entries are +kH 1 p k 1, and Z = [z > 1 , ,z > K ] > with z k , k = 1, K a uniform vector of the same size ash k whose entries are +kh 1 k k 1.

To optimize the parameter estimation algorithm based on equation (3.4) with the proposed penalty function (3.7), we utilize the derived MU rule (3.12) alongside the majorization-minimization algorithm This approach ensures an efficient and effective solution for parameter estimation The resulting algorithm is summarized in Algorithm 4, which incorporates a uniform matrix Y(g) matching the size of H(g) and a uniform row vector z_k corresponding to h_k, facilitating streamlined computations.

Algorithm 4Unsupervised NMF with mixed sparsity-inducing penalty Require: V,W,λ,γ

// Taking into account block sparsity-inducing penalty forg = 1, , Gdo

// Taking into account component sparsity-inducing penalty fork= 1, , K do z k ← +kh 1 k k end for

Derived algorithm in semi-supervised case

This section explores a different scenario from previous discussions, where some sources in the mixture have clean signals available for training while others are non-deterministic For example, in speech enhancement systems used for controlling robots or devices, the speaker is often known, allowing their voice to be pre-recorded for training purposes However, noise in real-world environments is highly non-stationary and varies across different moments and locations, making it impractical to accurately model during training As a result, noise characteristics should not be tightly identified or fixed in the training process to ensure robustness.

Assuming the availability of clean training signals for P sources, we only need to construct the Generalized Spatial Spectrum Model (GSSM) for the remaining Q = J - P sources This approach is referred to as semi-GSSM, which is discussed further in the rest of the chapter.

LetWpis the spectral basis matrix is learned by NMF from clean training signal of p-th source withp = 1, P, the spectral basis model obtained by allP clean signal is

The GSSM forq-th sources, which do not have clean training signal, are learned fromLq examples as in (3.1) and (3.2), then the GSSM for allQsources is computed by (3.3):

Finally, the semi-GSSM for allJsources is constructed by

The activation matrix corresponding toU s also consists of two parts as

H s = [H T ,He T ] T , (3.16) whereHis the part of activation matrix corresponding toP sources that having clean training signal, He corresponds to Q sources that learned GSSM from the example signals found.

Algorithm 5Semi-supervised NMF with mixed sparsity-inducing penalty Require: V,U s ,λ,γ

InitializeH s randomly with nonnegative values

// Taking into account block sparsity-inducing penalty forg = 1, , Gdo

// Taking into account component sparsity-inducing penalty fork= 1, , K do z k ← +kh 1 k k end for

3.5.2 Model fitting with mixed sparsity and algorithm

In semi-supervised mixture spectrogram modeling, incorporating a sparsity constraint is essential to focus on a specific subset of the matrix U, denoted as U_s This approach involves decomposing the mixture spectrogram V = |X|·2 by solving an optimization problem that minimizes the divergence D(V || U_s H_s) while enforcing sparsity through a penalty function Ω(H_e) The sparsity constraint on the activation matrix H_e improves the accuracy of the spectral decomposition, making the model more efficient in isolating relevant components This method is particularly effective for semi-supervised source separation tasks, ensuring that the model selectively activates only the most relevant features within the subset U_s.

H wheree Ω(H)e denotes a penalty function imposing sparsity on a subset He of the activation matrix Us The remainder of Us, H, is updated according to the usual optimization formula (2.2).

The proposed mixed sparsity-inducing penalty function (3.7) is applied to He as below

X k=1 log(+keh (k) k 1 ), (3.18) whereG=PQ q=1L q is the total number of training example forQsources.

The semi-supervised algorithm is summarized in Algorithm 5, whereY (g) is a uniform matrix of the same size asUe (g) , andz (k) a uniform row vector of the same size aseh (k)

Experiment

To validate the performance of our proposed approach, we utilized audio samples from the Diverse Environments Multichannel Acoustic Noise Database (DEMAND) and the International Signal Separation and Evaluation Campaign (SiSEC) to train the GSSM model Testing was conducted across three different datasets, including an artificially mixed dataset of speech and noise, as well as two benchmark datasets from the SiSEC campaign These datasets are meticulously curated by researchers in the audio source separation community and are widely recognized for their reliability More detailed information about the datasets is available to ensure comprehensive evaluation of the proposed method's effectiveness in diverse audio separation scenarios.

1 http://parole.loria.fr/DEMAND/

2 http://sisec.wiki.irisa.fr. described below:

We developed two comprehensive training sets: one for speech and one for noise The speech training set includes three distinct recordings—two female voices and one male voice—each lasting 10 seconds, to ensure diverse vocal representations The noise training set features three types of environmental sounds—kitchen noise, metro ambiance, and field sounds—with durations ranging from 5 to 15 seconds, to provide varied background noise scenarios These datasets are essential for training robust speech recognition and noise reduction models, enhancing performance in diverse acoustic environments.

The test dataset comprises 12 single-channel speech and noise mixtures mixed at a 0 dB SNR level, providing a sufficient basis to evaluate the proposed algorithm’s performance These mixtures, sampled at 16,000 Hz and lasting between 5 and 10 seconds, include both female and male English speech from the SiSEC dataset Noise samples were sourced from the DEMAND database, utilizing one channel out of the 16 available, with some mixtures combining two different noise types such as traffic and wind, ocean waves and birdsong, restaurant sounds with guitar music, forest birds with cars, and city noise with music Throughout the mixing process, all sources were present continuously in each mixture.

Our training data comprises two comprehensive sets: a voice training set and a music training set The voice training set includes recordings of three different voices—one male and two female speakers—each lasting approximately 10 seconds, providing diverse vocal samples for accurate voice recognition The music training set features nine audio files representing three bass sounds, three drum sounds, and three additional instrument sounds, with durations ranging from 10 to 15 seconds, ensuring a broad spectrum of musical tones for effective instrument identification and analysis These well-curated datasets are essential for developing robust voice and music recognition models.

• Test data: Test set contains 5 snip songs as described in table 3.1 They are in

3 Speech files are from the International Signal Separation and Evaluation Campaign (SiSEC): http://sisec.wiki.irisa.fr.

4 Some noise files are from the Diverse Environments Multichannel Acoustic Noise Database (DE-

MAND): http://parole.loria.fr/DEMAND/.

The article highlights five voice examples sourced from the International Signal Separation and Evaluation Campaign (SiSEC), specifically focusing on the "Professionally-produced music recordings" (MUS) dataset used in task 6 of SiSEC 2016 These recordings serve as benchmark examples to evaluate the performance of speech separation and audio signal processing algorithms, contributing valuable data for advancing research in music and voice separation techniques The SiSEC 2016 dataset provides high-quality, professionally-produced music recordings essential for developing and testing robust source separation methods.

Table 3.1: List of snip songs in the SiSEC-MUS dataset.

2 Tamy - Que pena Tanto faz 15

3 Another dreamer - The ones we love 25

4 Fort Minor - Remember the name 25

• Training data: We use the training sets for speech and noise as presented in section 3.6.1.1.

• Test data: We used the benchmark dataset of the “Two-channel mixtures of speech and real-world background noise” (BGN) task 7 within the SiSEC 2016

This dataset comprises 29 stereo mixtures, each 10 seconds long, sampled at 16 kHz, featuring male and female speech combined with real-world noises from various public environments such as cafeterias, squares, and subways The recordings from these environments differ in reverberation levels, with cafeteria and subway recordings exhibiting higher reverberation compared to those from squares The signal-to-noise ratio for these mixtures was randomly set between -17 and +12 dB, providing diverse acoustic conditions Overall, the dataset is divided into two subsets: the "devset" for development purposes and the "testset" for evaluation.

- The devsetincludes 9 mixtures: three with Ca noise, four with Sq noise, and two with Su noise.

- Thetestsetcontains 20 mixtures: eight with Ca noise, eight with Sq noise, and four with Su noise.

6 https://sisec.inria.fr/sisec-2016/2016-professionally-produced-music-recordings/.

7 https://sisec.inria.fr/sisec-2016/bgn-2016/

3.6.2 Single-channel source separation performance with unsupervised setting

This study evaluates the source separation performance of the proposed unsupervised algorithm across three datasets detailed in Section 3.6.1 The training set is utilized to learn the Generalized Sparse Source Model (GSSM) following the methodology outlined in Section 3.2 Subsequently, the pre-trained GSSM guides the decomposition of observed mixtures in the test set, as implemented through Algorithm 4, demonstrating the effectiveness of our approach in unsupervised source separation tasks.

The parameters for the analysis included calculating STFT using a sliding window with a frame length of 1024 samples and 50% overlap The number of NMF components was set to 32 for speech/vocal, 16 for noise, 15 for bass/drums, and 25 for other sounds The MU update iterations were fixed at 100 during training, while testing involved varying iterations from 1 to 100 to assess algorithm convergence To evaluate the sensitivity of the proposed method, the trade-off parameter λ, which controls the sparsity penalty, was tested with values {1, 10, 25, 50, 100, 200, 500}, and the penalty weighting factor γ was varied across {0, 0.2, 0.4, 0.6, 0.8, 1}, analyzing their impact on performance.

The separated speech results were assessed using the BSS-EVAL metrics, including the source-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifacts ratio (SAR), all measured in decibels (dB) These metrics evaluate overall distortion, interference, and artifacts, respectively, with higher values indicating better separation quality The results were averaged across all sources to provide a comprehensive evaluation The BSS-EVAL criteria are widely adopted in the source separation community to quantify the effectiveness of separation algorithms.

We first compare the separation performance obtained by proposed algorithm with the closed baseline algorithms as follows over two Synthetic dataset and SiSEC-MUS dataset:

The baseline NMF approach, which does not involve training, was detailed in Section 2.2 In this method, spectral models for speech/vocal and noise/music were initialized with random non-negative values and refined through iterative updates using equations (2.14) and (2.13).

The NMF-based algorithm, as outlined in Section 2.2, does not incorporate sparsity constraints The spectral model for speech and vocal sounds was trained using a single speech or vocal file, created by pairing all relevant samples from the training set detailed in Sections 3.6.1.2 Similarly, the noise and music spectral model was developed from one noise or music file, generated by pairing five noise samples from the noise training set described in Sections 3.6.1.2.

• NMF - Block sparsity: Proposed framework, combining NMF with block sparsity constraint by (3.5) [128].

• NMF - Component sparsity: Proposed framework, combining NMF with component sparsity constraint by (3.6) [8].

The proposed algorithm's results on the SiSEC-BGN dataset were submitted to the SiSEC 2016 campaign, allowing for benchmarking against other cutting-edge methods These results were compared with several state-of-the-art algorithms that have participated in the SiSEC campaign across multiple years since 2013, showcasing the effectiveness and competitiveness of our approach in the context of recent advancements in source separation.

Martinez-Munoz’s algorithm, presented at SiSEC 2013, leverages a source-filter model for the speech source while modeling noise as a combination of pseudo-stationary broadband noise, impulsive noise, and pitched interferences This technique employs multiplicative update (MU) rules from Non-negative Matrix Factorization (NMF) for accurate parameter estimation, enhancing speech separation and noise suppression.

Bryan’s algorithm, presented at SiSEC 2013, leverages human annotations on the mixture spectrogram to enhance source separation accuracy This interactive method utilizes probabilistic latent component analysis (PLCA), which is equivalent to non-negative matrix factorization (NMF), to guide and refine the separation process effectively.

• L´opez’s algorithm [82] (in SiSEC 2015) : uses spectral subtraction, they de- signs the demixing matrix and the post-filters based on a single-channel source separation method.

• Liu’s method [81] (in SiSEC 2016): performs Time Difference of Arrival (TDOA) clustering based on Generalized Cross Correlation Phase Transform (GCC-PHAT).

Table 3.2: Source separation performance obtained on the Synthetic and SiSEC-MUS dataset with unsupervised setting.

Dataset Method Speech/Vocal Noise/Music

SDR SIR SAR SDR SIR SAR

Table3.3:SpeechseparationperformanceobtainedontheSiSEC-BGN.∗ indicatessubmissionsbytheauthorsand“-”indicatesmissing information[81,98,100] Methoddevsettestset Ca1Sq1Su1AverageCa1Ca2Sq1Sq2Su1Su2Average Martinez-Munoz* (SiSEC2013)

SDR5.49.61.56.43.43.79.010.95.02.26.1 SIR15.417.35.814.114.617.118.620.523.25.917.1 SAR6.110.75.87.94.24.09.911.55.26.07.0 Bryan*[17] (SiSEC2013)

SDR5.610.24.27.33.73.813.112.95.65.67.8 SIR18.415.613.616.113.916.521.818.221.423.018.5 SAR5.912.14.98.44.54.213.714.65.75.78.5 L´opez* (SiSEC2015)

SDR 4.04.55.111.0-3.83.94.9 SIR 14.916.19.616.3-1.68.812.1 SAR 4.75.08.613.04.36.37.3 Liu* (SiSEC2016)

SDR1.9-3-10.6-3.11.62.7-4.41.9-12.6-1.2-1.0 SIR4-2.9-9.7-2.14.57.7-4.32.4-12.20.10.9 SAR7.516.46.911.36.55.518.816.910.3811.4 Proposed (SiSEC2016)

Figure 3.4: Average separation performance obtained by the proposed method with unsupervised setting over the Synthetic dataset as a function of MU iterations.

1) The convergence and stability of the algorithm

Figure 3.4 illustrates the convergence of the proposed algorithm on the synthetic dataset, demonstrating that all performance metrics—SDR, SIR, and SAR—increase with the number of MU iterations The results confirm that the algorithm converges reliably, reaching saturation after approximately 20 MU iterations This confirms the effectiveness and stability of the algorithm in enhancing signal separation performance.

Summary

This chapter introduces a novel single-channel source separation method that eliminates the need for clean training data, addressing a key limitation of traditional supervised approaches Our proposed approach offers an effective solution for separating audio sources in single-channel recordings without relying on accurately labeled training datasets The key contributions include tackling the single-channel audio source separation challenge in data-limited scenarios and demonstrating the method's potential for practical applications Overall, the results highlight the effectiveness of our approach in performing accurate source separation without dependence on clean training data, advancing the field of audio signal processing.

• We have proposed weakly-informed audio source separation algorithms for both unsupervised and semi-supervised setting.

• A sparsity-inducing penalty which combines two existing group sparsity-inducing penalties has been proposed in order to take into account the advantage of both of them.

We have analyzed the algorithm's convergence properties and its robustness to variations in hyper-parameters λ and γ Understanding these characteristics is essential for effectively tuning parameters in practical implementations, ensuring reliable and stable algorithm performance.

This article focuses on single-channel audio source separation, where mixtures are mono and NMF is used to model spectral characteristics However, with the availability of multiple microphones, multichannel source separation becomes essential to leverage spatial information about audio sources The upcoming chapter will extend the current spectral NMF model to include a spatial component, enhancing separation accuracy in multichannel recordings The innovative aspects of this work have been published in several papers listed in the thesis's “List of publications,” demonstrating its contribution to advancing audio source separation techniques.

MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 68

Tiêu đề	Audio source separation exploiting NMF-based generic source spectral model
Tác giả	Duong Thi Hien Thanh
Người hướng dẫn	Assoc. Prof. Dr. Nguyen Quoc Cuong, Dr. Nguyen Cong Phuong
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Computer Science
Thể loại	Luận án
Năm xuất bản	2019
Thành phố	Hanoi

Định dạng
Số trang	129
Dung lượng	1,79 MB